testing_generation/Corpus/CORPUS.txt

16588 lines
1.5 MiB
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<<START>> <<START>> <<START>>
Neural Ordinary Differential Equations
Ricky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud
University of Toronto, Vector Institute
{rtqichen, rubanova, jessebett, duvenaud}@cs.toronto.edu
Abstract
We introduce a new family of deep neural network models. Instead of specifying a
discrete sequence of hidden layers, we parameterize the derivative of the hidden
state using a neural network. The output of the network is computed using a black-
box differential equation solver. These continuous-depth models have constant
memory cost, adapt their evaluation strategy to each input, and can explicitly trade
numerical precision for speed. We demonstrate these properties in continuous-depth
residual networks and continuous-time latent variable models. We also construct
continuous normalizing flows, a generative model that can train by maximum
likelihood, without partitioning or ordering the data dimensions. For training, we
show how to scalably backpropagate through any ODE solver, without access to its
internal operations. This allows end-to-end training of ODEs within larger models.
1 Introduction
Models such as residual networks, recurrent neural network decoders, and normalizing flows build
complicated transformations by composing a sequence of transformations to a hidden state:
<<FORMULA>> (1)
where t ∈ {0 . . . T } and ht ∈ R . These iterative updates can be seen as an Euler discretization of a
continuous transformation (Lu et al., 2017; Haber and Ruthotto, 2017; Ruthotto and Haber, 2018).
What happens as we add more layers and take smaller steps? In the limit, we parameterize the continuous
dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network:
Starting from the input layer h(0), we can define the output layer h(T ) to be the solution to this
<<FORMULA>> (2)
ODE initial value problem at some time T . This value can be computed by a black-box differential
equation solver, which evaluates the hidden unit dynamics f wherever necessary to determine the
solution with the desired accuracy. Figure 1 contrasts these two approaches.
Defining and evaluating models using ODE solvers has several benefits:
Memory efficiency In Section 2, we show how to compute gradients of a scalar-valued loss with
respect to all inputs of any ODE solver, without backpropagating through the operations of the solver.
Not storing any intermediate quantities of the forward pass allows us to train our models with constant
memory cost as a function of depth, a major bottleneck of training deep models.
32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
Adaptive computation Eulers method is perhaps the simplest method for solving ODEs. There
have since been more than 120 years of development of efficient and accurate ODE solvers (Runge,
1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth
of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to
achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with
problem complexity. After training, accuracy can be reduced for real-time or low-power applications.
Scalable and invertible normalizing flows An unexpected side-benefit of continuous transforma-
tions is that the change of variables formula becomes easier to compute. In Section 4, we derive
this result and use it to construct a new class of invertible density models that avoids the single-unit
bottleneck of normalizing flows, and can be trained directly by maximum likelihood.
Continuous time-series models Unlike recurrent neural networks, which require discretizing
observation and emission intervals, continuously-defined dynamics can naturally incorporate data
which arrives at arbitrary times. In Section 5, we construct and demonstrate such a model.
2 Reverse-mode automatic differentiation of ODE solutions
The main technical difficulty in training continuous-depth networks is performing reverse-mode
differentiation (also known as backpropagation) through the ODE solver. Differentiating through
the operations of the forward pass is straightforward, but incurs a high memory cost and introduces
additional numerical error.
We treat the ODE solver as a black box, and compute gradients using the adjoint sensitivity
method (Pontryagin et al., 1962). This approach computes gradients by solving a second, aug-
mented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly
with problem size, has low memory cost, and explicitly controls numerical error.
Consider optimizing a scalar-valued loss function L(), whose input is the result of an ODE solver:
<<FORMULA>> (3)
To optimize L, we require gradients with respect to θ. The first step is to determining how the gradient
of the loss depends on the hidden state z(t) at each instant. This quantity is called the adjoint a(t) = ∂L/∂z(t).
Its dynamics are given by another ODE, which can be thought of as the State instantaneous analog of the chain rule:
Adjoint State
<<FORMULA>> (4)
We can compute ∂L/∂z(t0 ) by another call to an ODE solver. This solver must run backwards, starting from the initial
value of ∂L/∂z(t1 ). One complication is that solving this ODE requires the knowing value of z(t) along its entire tra-
jectory. However, we can simply recompute z(t) backwards in time together with the adjoint, starting from its final
value z(t1 ).
If the loss depends directly on the state at multi- Computing the gradients with respect to the pa-
ple observation times, the adjoint state must be parameters θ requires evaluating a third integral,
updated in the direction of the partial derivative of which depends on both z(t) and a(t):
the loss with respect to each observation.
<<FORMULA>> (5)
The vector-Jacobian products <<FORMULA>> and <<FORMULA>> in (4) and (5) can be efficiently evaluated by
automatic differentiation, at a time cost similar to that of evaluating f . All integrals for solving z,
and <<FORMULA>> can be computed in a single call to an ODE solver, which concatenates the original state, the
adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the
necessary dynamics, and call an ODE solver to compute all gradients at once.
<<ALGORITHM>>
Most ODE solvers have the option to output the state z(t) at multiple times. When the loss depends
on these intermediate states, the reverse-mode derivative must be broken into a sequence of separate
solves, one between each consecutive pair of output times (Figure 2). At each observation, the adjoint
must be adjusted in the direction of the corresponding partial derivative ∂L/∂z(ti ).
The results above extend those of Stapor et al. (2018, section 2.4.2). An extended version of
Algorithm 1 including derivatives w.r.t. t0 and t1 can be found in Appendix C. Detailed derivations
are provided in Appendix B. Appendix D provides Python code which computes all derivatives for
scipy.integrate.odeint by extending the autograd automatic differentiation package. This
code also supports all higher-order derivatives. We have since released a PyTorch (Paszke et al.,
2017) implementation, including GPU-based implementations of several standard ODE solvers at
github.com/rtqichen/torchdiffeq.
Replacing residual networks with ODEs for supervised learning
In this section, we experimentally investigate the training of neural ODEs for supervised learning.
Software To solve ODE initial value problems numerically, we use the implicit Adams method
implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being
an implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires
solving a nonlinear optimization problem at every step. This setup makes direct backpropagation
through the integrator difficult. We implement the adjoint sensitivity method in Pythons autograd
framework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden
state dynamics and their derivatives on the GPU using Tensorflow, which were then called from the
Fortran ODE solvers, which were called from Python autograd code.
Model Architectures We experiment with a small residual network which downsamples the et al. (1998).
input twice then applies 6 standard residual blocks He et al. (2016b), which are replaced by an ODESolve
module in the ODE-Net variant. We also test a network with the same architecture but where gradients are
backpropagated directly through a Runge-Kutta integrator, re-ferred to as RK-Net. Table 1 shows test error,
number of parameters, and memory cost. L denotes the number of layers in the ResNet, and L̃ is the number
of function evaluations that the ODE solver
requests in a single forward pass, which can be interpreted as an implicit number of layers. We find
that ODE-Nets and RK-Nets can achieve around the same performance as the ResNet.
Error Control in ODE-Nets ODE solvers can approximately ensure that the output is within a
given tolerance of the true solution. Changing this tolerance changes the behavior of the network.
We first verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is
proportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a
3
trade-off between accuracy and computational cost. One could train with high accuracy, but switch to
a lower accuracy at test time.
Figure 3: Statistics of a trained ODE-Net. (NFE = number of function evaluations.)
Figure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly
half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory
efficient, but also more computationally efficient than directly backpropagating through the integrator,
because the latter approach will need to backprop through each function evaluation in the forward
pass.
Network Depth Its not clear how to define the depth of an ODE solution. A related quantity is
the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver
and dependent on the initial state or input. Figure 3d shows that he number of function evaluations
increases throughout training, presumably adapting to increasing complexity of the model.
4 Continuous Normalizing Flows
The discretized equation (1) also appears in normalizing flows (Rezende and Mohamed, 2015) and
the NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to
compute exact changes in probability if samples are transformed through a bijective function f :
<<FORMULA>> (6)
An example is the planar normalizing flow (Rezende and Mohamed, 2015):
<<FORMULA>> (7)
Generally, the main bottleneck to using the change of variables formula is computing of the deter-
minant of the Jacobian ∂f/∂z, which has a cubic cost in either the dimension of z, or the number
of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow
layers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018).
Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the
computation of the change in normalizing constant:
Theorem 1 (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable
with probability p(z(t)) dependent on time. Let dz dt = f (z(t), t) be a differential equation describing
a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z
and continuous in t, then the change in log probability also follows a differential equation,
<<FORMULA>> (8)
Proof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation.
Also unlike standard finite flows, the differential equation f does not need to be bijective, since if
uniqueness is satisfied, then the entire transformation is automatically bijective.
As an example application of the instantaneous change of variables, we can examine the continuous
analog of the planar flow, and its change in normalization constant:
<<FORMULA>> (9)
Given an initial distribution p(z(0)), we can sample from p(z(t)) and evaluate its density by solving
this combined ODE.
Using multiple hiddenP units with P linear cost While det is not a linear function, the trace function
is, which implies tr( n Jn ) = n tr(Jn ). Thus if our dynamics is given by a sum of functions then
the differential equation for the log density is also a sum:
<<FORMULA>> (10)
This means we can cheaply evaluate flow models having many hidden units, with a cost only linear in
the number of hidden units M . Evaluating such wide flow layers using standard normalizing flows
costs O(M 3 ), meaning that standard NF architectures use many layers of only a single hidden unit.
Time-dependent dynamics We can specify the parameters of a flow as a function of t, making the
differential equation f (z(t), t) change with t. This is parameterization is a kind of hypernetwork
(Ha et al., 2016). We also introduce a gating mechanism for each hidden unit,
<<FORMULA>>
where σn (t) ∈ (0, 1) is a neural network that learns when the dynamic fn (z) should be applied. We
call these models continuous normalizing flows (CNF).
4.1 Experiments with Continuous Normalizing Flows
We first compare continuous and discrete planar flows at learning to sample from a known distribution.
We show that a planar CNF with M hidden units can be at least as expressive as a planar NF with
K = M layers, and sometimes much more expressive.
Density matching We configure the CNF as described above, and train for 10,000 iterations
using Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using
RMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we
minimize KL (q(x)kp(x)) as the loss function where q is the flow model and the target density p(·)
can be evaluated. Figure 4 shows that CNF generally achieves lower loss.
Maximum Likelihood Training A useful property of continuous-time normalizing flows is that
we can compute the reverse transformation for about the same cost as the forward pass, which cannot
be said for normalizing flows. This lets us train the flow on a density estimation task by performing
maximum likelihood estimation, which maximizes Ep(x) [log q(x)] where q(·) is computed using
the appropriate change of variables theorem, then afterwards reverse the CNF to generate random
samples from q(x).
For this task, we use 64 hidden units for CNF, and 64 stacked one-hidden-unit layers for NF. Figure 5
shows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the
transformed distribution after a small amount of time which shows the locations of the initial planar
flows. Interestingly, to fit the Two Circles distribution, the CNF rotates the planar flows so that
the particles can be evenly spread into circles. While the CNF transformations are smooth and
interpretable, we find that NF transformations are very unintuitive and this model has difficulty fitting
the two moons dataset in Figure 5b.
5 A generative latent function time-series model
Applying neural networks to irregularly-sampled data such as medical records, network traffic, or
neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the
latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill-
defined latent variables. Missing data can be addressed using generative time-series models (Álvarez
and Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data
imputation (Che et al., 2018). Another approach concatenates time-stamp information to the input of
an RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017).
We present a continuous-time, generative approach to modeling time series. Our model represents
each time series by a latent trajectory. Each trajectory is determined from a local initial state, zt0 , and
a global set of latent dynamics shared across all time series. Given observation times t0 , t1 , . . . , tN
and an initial state zt0 , an ODE solver produces zt1 , . . . , ztN , which describe the latent state at each
observation.We define this generative model formally through a sampling procedure:
<<FORMULA>> (11)
<<FORMULA>> (12)
<<FORMULA>> (13)
Function f is a time-invariant function that takes the value z at the current time step and outputs the
gradient: ∂z(t)/∂t = f (z(t), θf ). We parametrize this function using a neural net. Because f is time-
invariant, given any latent state z(t), the entire latent trajectory is uniquely defined. Extrapolating
this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.
Training and Prediction We can train this latent-variable model as a variational autoen-
coder (Kingma and Welling, 2014; Rezende et al., 2014), with sequence-valued observations. Our
recognition net is an RNN, which consumes the data sequentially backwards in time, and out-
puts qφ (z0 |x1 , x2 , . . . , xN ). A detailed algorithm can be found in Appendix E. Using ODEs as a
generative model allows us to make predictions for arbitrary time points t1 ...tM on a continuous
timeline.
Poisson Process likelihoods The fact that an observation oc-
curred often tells us something about the latent state. For ex-
ample, a patient may be more likely to take a medical test if
they are sick. The rate of events can be parameterized by a
function of the latent state: p(event at time t| z(t)) = λ(z(t)).
Given this rate function, the likelihood of a set of indepen-
dent observation times in the interval [tstart , tend ] is given by an t
inhomogeneous Poisson process (Palm, 1943):
We can parameterize λ(·) using another neural network. Con-
veniently, we can evaluate both the latent trajectory and the
Poisson process likelihood together in a single call to an ODE solver. Figure 7 shows the event rate
learned by such a model on a toy dataset.
A Poisson process likelihood on observation
times can be combined with a data likelihood to
jointly model all observations and the times at
which they were made.
5.1 Time-series Latent ODE Experiments
We investigate the ability of the latent ODE
model to fit and extrapolate time series. The
recognition network is an RNN with 25 hidden
units. We use a 4-dimensional latent space. We
parameterize the dynamics function f with a
one-hidden-layer network with 20 hidden units.
The decoder computing p(xti |zti ) is another
neural network with one hidden layer with 20
hidden units. Our baseline was a recurrent neu-
ral net with 25 hidden units trained to minimize
negative Gaussian log-likelihood. We trained a
second version of this RNN whose inputs were
concatenated with the time difference to the next
observation to aid RNN with irregular observations.
Bi-directional spiral dataset We generated neural network. (b): Reconstructions and extrapo-
a dataset of 1000 2-dimensional spirals, each lations by a latent neural ODE. Blue curve shows
starting at a different point, sampled at 100 model prediction. Red shows extrapolation. (c) A
equally-spaced timesteps. The dataset contains projection of inferred 4-dimensional latent ODE
two types of spirals: half are clockwise while trajectories onto their first two dimensions. Color
the other half counter-clockwise. To make the indicates the direction of the corresponding trajec-
task more realistic, we add gaussian noise to the tory. The model has learned latent dynamics which
observations.
progression through time, starting at purple and ending at red. Note that the trajectories on the left
are counter-clockwise, while the trajectories on the right are clockwise.
Time series with irregular time points To generate irregular timestamps, we randomly sample
points from each trajectory without replacement (n = {30, 50, 100}). We report predictive root-
mean-squared error (RMSE) on 100 time points extending beyond those that were used for training.
Table 2 shows that the latent ODE has substantially lower predictive RMSE.
We observed that reconstructions and extrapolations are consistent with the ground truth
regardless of number of observed points and despite the noise.
Latent space interpolation Figure 8c shows latent trajectories projected onto the first two dimen-
sions of the latent space. The trajectories form two separate clusters of trajectories, one decoding to
clockwise spirals, the other to counter-clockwise. Figure 9 shows that the latent trajectories change
smoothly as a function of the initial point z(t0 ), switching from a clockwise to a counter-clockwise
spiral.
6 Scope and Limitations
Minibatching The use of mini-batches is less straightforward than for standard neural networks.
One can still batch together evaluations through the ODE solver by concatenating the states of each
batch element together, creating a combined ODE with dimension D × K. In some cases, controlling
error on all batch elements together might require evaluating the combined system K times more
often than if each system was solved individually. However, in practice the number of evaluations did
not increase substantially when using minibatches.
Uniqueness When do continuous dynamics have a unique solution? Picards existence theo-
rem (Coddington and Levinson, 1955) states that the solution to an initial value problem exists and is
unique if the differential equation is uniformly Lipschitz continuous in z and continuous in t. This
theorem holds for our model if the neural network has finite weights and uses Lipshitz nonlinearities,
such as tanh or relu.
Setting tolerances Our framework allows the user to trade off speed for precision, but requires
the user to choose an error tolerance on both the forward and reverse passes during training. For
sequence modeling, the default value of 1.5e-8 was used. In the classification and density estimation
experiments, we were able to reduce the tolerance to 1e-3 and 1e-5, respectively, without degrading
performance.
Reconstructing forward trajectories Reconstructing the state trajectory by running the dynamics
backwards can introduce extra numerical error if the reconstructed trajectory diverges from the
original. This problem can be addressed by checkpointing: storing intermediate values of z on the
forward pass, and reconstructing the exact forward trajectory by re-integrating from those points. We
did not find this to be a practical problem, and we informally checked that reversing many layers of
continuous normalizing flows with default tolerances recovered the initial states.
8
7 Related Work
The use of the adjoint method for training continuous-time neural networks was previously pro-
posed (LeCun et al., 1988; Pearlmutter, 1995), though was not demonstrated practically. The
interpretation of residual networks He et al. (2016a) as approximate ODE solvers spurred research
into exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al.,
2017). We demonstrate these same properties in more generality by directly using an ODE solver.
Adaptive computation One can adapt computation time by training secondary neural networks
to choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al.,
2016; Figurnov et al., 2017; Chang et al., 2018). However, this introduces overhead both at training
and test time, and extra parameters that need to be fit. In contrast, ODE solvers offer well-studied,
computationally cheap, and generalizable rules for adapting the amount of computation.
Constant memory backprop through reversibility Recent work developed reversible versions
of residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives
the same constant memory advantage as our approach. However, these methods require restricted
architectures, which partition the hidden units. Our approach does not have these restrictions.
Learning differential equations Much recent work has proposed learning differential equations
from data. One can train feed-forward or recurrent neural networks to approximate a differential
equation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applica-
tions such as fluid simulation (Wiewel et al., 2018). There is also significant work on connecting
Gaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to fit
differential equations (Raissi et al., 2018b) and can naturally model continuous-time effects and
interventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic
variational inference to recover the solution of a given stochastic differential equation.
Differentiating through ODE solvers The dolfin library (Farrell et al., 2013) implements adjoint
computation for general ODE and PDE solutions, but only by backpropagating through the individual
operations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient
estimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity
analysis is quadratic-time in the number of variables, whereas the adjoint sensitivity analysis is
linear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint
method to train bespoke latent dynamic models.
In contrast, by providing a generic vector-Jacobian product, we allow an ODE solver to be trained
end-to-end with any other differentiable model components. While use of vector-Jacobian products
for solving the adjoint method has been explored in optimal control (Andersson, 2013; Andersson
et al., In Press, 2018), we highlight the potential of a general integration of black-box ODE solvers
into automatic differentiation (Baydin et al., 2018) for deep learning and generative modeling.
8 Conclusion
We investigated the use of black-box ODE solvers as a model component, developing new models
for time-series modeling, supervised learning, and density estimation. These models are evaluated
adaptively, and allow explicit control of the tradeoff between computation speed and accuracy.
Finally, we derived an instantaneous version of the change of variables formula, and developed
continuous-time normalizing flows, which can scale to large layer sizes.
9 Acknowledgements
We thank Wenyi Wang and Geoff Roeder for help with proofs, and Daniel Duckworth, Ethan Fetaya,
Hossein Soleimani, Eldad Haber, Ken Caluwaerts, Daniel Flam-Shepherd, and Harry Braviner for
feedback. We thank Chris Rackauckas, Dougal Maclaurin, and Matthew James Johnson for helpful
discussions. We also thank Yuval Frommer for pointing out an unsupported claim about parameter
efficiency.
9
References
Mauricio A Álvarez and Neil D Lawrence. Computationally efficient convolved multiple output
Gaussian processes. Journal of Machine Learning Research, 12(May):14591500, 2011.
Brandon Amos and J Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks.
In International Conference on Machine Learning, pages 136145, 2017.
Joel Andersson. A general-purpose software framework for dynamic optimization. PhD thesis, 2013.
Joel A E Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi A
software framework for nonlinear optimization and optimal control. Mathematical Programming
Computation, In Press, 2018.
Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind.
Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18
(153):1153, 2018.
Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betan-
court. The Stan math library: Reverse-mode automatic differentiation in c++. arXiv preprint
arXiv:1509.07164, 2015.
Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible
architectures for arbitrarily deep residual neural networks. arXiv preprint arXiv:1709.03698, 2017.
Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks
from dynamical systems view. In International Conference on Learning Representations, 2018.
URL https://openreview.net/forum?id=SyJS-OgR-.
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural
networks for multivariate time series with missing values. Scientific Reports, 8(1):6085, 2018.
URL https://doi.org/10.1038/s41598-018-24271-9.
Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun.
Doctor AI: Predicting clinical events via recurrent neural networks. In Proceedings of the 1st
Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning
Research, pages 301318. PMLR, 1819 Aug 2016. URL http://proceedings.mlr.press/
v56/Choi16.html.
Earl A Coddington and Norman Levinson. Theory of ordinary differential equations. Tata McGraw-
Hill Education, 1955.
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
estimation. arXiv preprint arXiv:1410.8516, 2014.
Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song.
Recurrent marked temporal point processes: Embedding event history to vector. In International
Conference on Knowledge Discovery and Data Mining, pages 15551564. ACM, 2016.
Patrick Farrell, David Ham, Simon Funke, and Marie Rognes. Automated derivation of the adjoint of
high-level transient finite element programs. SIAM Journal on Scientific Computing, 2013.
Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and
Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint,
2017.
J. Futoma, S. Hariharan, and K. Heller. Learning to Detect Sepsis with a Multitask Gaussian Process
RNN Classifier. ArXiv e-prints, 2017.
Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network:
Backpropagation without storing activations. In Advances in Neural Information Processing
Systems, pages 22112221, 2017.
10
Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
arXiv:1603.08983, 2016.
David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34
(1):014004, 2017.
E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I Nonstiff Problems.
Springer, 1987.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770778, 2016a.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
networks. In European conference on computer vision, pages 630645. Springer, 2016b.
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture
6a overview of mini-batch gradient descent, 2012.
Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in
recurrent neural networks. arXiv preprint arXiv:1611.06188, 2016.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. International Conference
on Learning Representations, 2014.
Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.
Improved variational inference with inverse autoregressive flow. In Advances in Neural Information
Processing Systems, pages 47434751, 2016.
W. Kutta. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für
Mathematik und Physik, 46:435453, 1901.
Yann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework for back-propagation.
In Proceedings of the 1988 connectionist models summer school, volume 1, pages 2128. CMU,
Pittsburgh, Pa: Morgan Kaufmann, 1988.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):22782324, 1998.
Yang Li. Time-dependent representation for neural event sequence prediction. arXiv preprint
arXiv:1708.00065, 2017.
Zachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with
RNNs: Improved classification of clinical time series. In Proceedings of the 1st Machine Learning
for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 253
270. PMLR, 1819 Aug 2016. URL http://proceedings.mlr.press/v56/Lipton16.html.
Z. Long, Y. Lu, X. Ma, and B. Dong. PDE-Net: Learning PDEs from Data. ArXiv e-prints, 2017.
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks:
Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121,
2017.
Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Autograd: Reverse-mode differentiation of
native Python. In ICML workshop on Automatic Machine Learning, 2015.
Hongyuan Mei and Jason M Eisner. The neural Hawkes process: A neurally self-modulating
multivariate point process. In Advances in Neural Information Processing Systems, pages 6757
6767, 2017.
11
Valdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives of likelihood functionals for
ODE based models using adjoint-state method. Computational Statistics, 32(4):16211643, 2017.
Conny Palm. Intensitätsschwankungen im fernsprechverker. Ericsson Technics, 1943.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
pytorch. 2017.
Barak A Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE
Transactions on Neural networks, 6(5):12121228, 1995.
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathemat-
ical theory of optimal processes. 1962.
M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial
differential equations. Journal of Computational Physics, pages 125141, 2018.
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data-
driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236, 2018a.
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Numerical Gaussian processes for
time-dependent and nonlinear partial differential equations. SIAM Journal on Scientific Computing,
40(1):A172A198, 2018b.
Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate
inference in deep generative models. In Proceedings of the 31st International Conference on
Machine Learning, pages 12781286, 2014.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv
preprint arXiv:1505.05770, 2015.
C. Runge. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46:
167178, 1895.
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations.
arXiv preprint arXiv:1804.04272, 2018.
T. Ryder, A. Golightly, A. S. McGough, and D. Prangle. Black-box Variational Inference for
Stochastic Differential Equations. ArXiv e-prints, 2018.
Michael Schober, David Duvenaud, and Philipp Hennig. Probabilistic ODE solvers with Runge-Kutta
means. In Advances in Neural Information Processing Systems 25, 2014.
Peter Schulam and Suchi Saria. What-if reasoning with counterfactual Gaussian processes. arXiv
preprint arXiv:1703.10651, 2017.
Hossein Soleimani, James Hensman, and Suchi Saria. Scalable joint models for reliable uncertainty-
aware event prediction. IEEE transactions on pattern analysis and machine intelligence, 2017a.
Hossein Soleimani, Adarsh Subbaswamy, and Suchi Saria. Treatment-response models for coun-
terfactual reasoning with continuous-time, continuous-valued interventions. arXiv preprint
arXiv:1704.02038, 2017b.
Jos Stam. Stable fluids. In Proceedings of the 26th annual conference on Computer graphics and
interactive techniques, pages 121128. ACM Press/Addison-Wesley Publishing Co., 1999.
Paul Stapor, Fabian Froehlich, and Jan Hasenauer. Optimization and uncertainty analysis of ODE
models using second order adjoint sensitivity analysis. bioRxiv, page 272005, 2018.
Jakub M Tomczak and Max Welling. Improving variational auto-encoders using Householder flow.
arXiv preprint arXiv:1611.09630, 2016.
Steffen Wiewel, Moritz Becher, and Nils Thuerey. Latent-space physics: Towards learning the
temporal evolution of fluid flow. arXiv preprint arXiv:1802.10123, 2018.
Hong Zhang and Adrian Sandu. Fatode: a library for forward, adjoint, and tangent linear integration
of ODEs. SIAM Journal on Scientific Computing, 36(5):C504C523, 2014.
Appendix A Proof of the Instantaneous Change of Variables Theorem
Theorem (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability
p(z(t)) dependent on time. Let dz/dt = f (z(t), t) be a differential equation describing a continuous-in-time
transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the
change in log probability also follows a differential equation:
<<FORMULA>>
Proof. To prove this theorem, we take the infinitesimal limit of finite changes of log p(z(t)) through time. First
we denote the transformation of z over an ε change in time as
<<FORMULA>> (14)
We assume that f is Lipschitz continuous in z(t) and continuous in t, so every initial value problem has a unique
solution by Picards existence theorem. We also assume z(t) is bounded. These conditions imply that f , Tε , and
∂T are all bounded. In the following, we use these conditions to exchange limits and products.
<<FORMULA>>
We can write the differential equation <<FORMULA>> using the discrete change of variables formula, and the
definition of the derivative:
<<FORMULA>> (15)
<<FORMULA>> (16)
<<FORMULA>> (by LHôpitals rule) (17)
<<FORMULA>> (18)
<<FORMULA>> (19)
<<FORMULA>> (20)
The derivative of the determinant can be expressed using Jacobis formula, which gives
<<FORMULA>> (21)
<<FORMULA>> (22)
<<FORMULA>> (23)
Substituting Tε with its Taylor series expansion and taking the limit, we complete the proof.
<<FORMULA>> (24)
<<FORMULA>> (25)
<<FORMULA>> (26)
<<FORMULA>> (27)
A.1 Special Cases
Planar CNF. Let f (z) = uh(wz + b), then ∂z = u ∂h ∂z. Since the trace of an outer product is the inner
product, we have
<<FORMULA>> (28)
This is the parameterization we use in all of our experiments.
Hamiltonian CNF. The continuous analog of NICE (Dinh et al., 2014) is a Hamiltonian flow, which splits
the data into two equal partitions and is a volume-preserving transformation, implying that ∂t = 0. We
can verify this. Let
<<FORMULA>> (29)
Then because the Jacobian is all zeros on its diagonal, the trace is zero. This is a volume-preserving flow.
A.2 Connection to Fokker-Planck and Liouville PDEs
The Fokker-Planck equation is a well-known partial differential equation (PDE) that describes the probability
density function of a stochastic differential equation as it changes with time. We relate the instantaneous change
of variables to the special case of Fokker-Planck with zero diffusion, the Liouville equation.
As with the instantaneous change of variables, let z(t) ∈ RD evolve through time following dz(t)/dt = f (z(t), t).
Then Liouville equation describes the change in density of za fixed point in spaceas a PDE,
<<FORMULA>> (30)
However, (30) cannot be easily used as it requires the partial derivatives of p(z,t)/∂z, which is typically approximated
using finite difference. This type of PDE has its own literature on efficient and accurate simulation (Stam, 1999).
Instead of evaluating p(·, t) at a fixed point, if we follow the trajectory of a particle z(t), we obtain
<<FORMULA>>
partial derivative from first argument, z(t) partial derivative from second argument, t
<<FORMULA>> (31)
We arrive at the instantaneous change of variables by taking the log,
<<FORMULA>> (32)
While still a PDE, (32) can be combined with z(t) to form an ODE of size D + 1,
<<FORMULA>> (33)
Compared to the Fokker-Planck and Liouville equations, the instantaneous change of variables is of more
practical impact as it can be numerically solved much more easily, requiring an extra state of D for following
the trajectory of z(t). Whereas an approach based on finite difference approximation of the Liouville equation
would require a grid size that is exponential in D.
Appendix B A Modern Proof of the Adjoint Method
We present an alternative proof to the adjoint method (Pontryagin et al., 1962) that is short and easy to follow.
14
B.1 Continuous Backpropagation
Let z(t) follow the differential equation dt = f (z(t), t, θ), where θ are the parameters. We will prove that if
we define an adjoint state
<<FORMULA>> (34)
then it follows the differential equation
<<FORMULA>> (35)
For ease of notation, we denote vectors as row vectors, whereas the main text uses column vectors.
The adjoint state is the gradient with respect to the hidden state at a specified time t. In standard neural networks,
the gradient of a hidden layer ht depends on the gradient from the next layer ht+1 by chain rule
<<FORMULA>> (36)
With a continuous hidden state, we can write the transformation after an ε change in time as
<<FORMULA>> (37)
<<FORMULA>> (38)
The proof of (35) follows from the definition of derivative:
<<FORMULA>> (39)
<<FORMULA>> (by Eq 38) (40)
<<FORMULA>> (Taylor series around z(T)) (41)
<<FORMULA>> (42)
<<FORMULA>> (43)
<<FORMULA>> (44)
<<FORMULA>> (45)
We pointed out the similarity between adjoint method and backpropagation (eq. 38). Similarly to backpropaga-
tion, ODE for the adjoint state needs to be solved backwards in time. We specify the constraint on the last time
point, which is simply the gradient of the loss wrt the last time point, and can obtain the gradients with respect to
the hidden state at any time, including the initial value.
<<FORMULA>> (46)
Here we assumed that loss function L depends only on the last time point tN . If function L depends also on
intermediate time points t1 , t2 , . . . , tN 1 , etc., we can repeat the adjoint step for each of the intervals [tN 1 , tN ],
[tN 2 , tN 1 ] in the backward order and sum up the obtained gradients.
B.2 Gradients wrt. θ and t
We can generalize (35) to obtain gradients with respect to θa constant wrt. tand and the initial and end times,
t0 and tN . We view θ and t as states with constant differential equations and write
<<FORMULA>> (47)
We can then combine these with z to form an augmented state1 with corresponding differential equation and
adjoint state,
<<FORMULA>> (48)
Note this formulates the augmented ODE as an autonomous (time-invariant) ODE, but the derivations in the
previous section still hold as this is a special case of a time-variant ODE. The Jacobian of f has the form
<<FORMULA>> (49)
where each 0 is a matrix of zeros with the appropriate dimensions. We plug this into (35) to obtain
<<FORMULA>> (50)
The first element is the adjoint differential equation (35), as expected. The second element can be used to obtain
the total gradient with respect to the parameters, by integrating over the full interval and setting aθ (tN ) = 0.
<<FORMULA>> (51)
Finally, we also get gradients with respect to t0 and tN , the start and end of the integration interval.
<<FORMULA>> (52)
Between (35), (46), (51), and (52) we have gradients for all possible inputs to an initial value problem solver.
Appendix C Full Adjoint sensitivities algorithm
This more detailed version of Algorithm 1 includes gradients with respect to the start and end times of integration.
Algorithm 2 Complete reverse-mode derivative of an ODE initial value problem
Input: dynamics parameters θ, start time t0 , stop time t1 , final state z(t1 ), loss gradient ∂L/∂z(t1 )
<<ALGORITHM>>
Note that weve overloaded t to be both a part of the state and the (dummy) independent variable. The
distinction is clear given context, so we keep t as the independent variable for consistency with the rest of the
text.
Appendix D Autograd Implementation
<<ALGORITHM>>
Appendix E Algorithm for training the latent ODE model
To obtain the latent representation zt0 , we traverse the sequence using RNN and obtain parameters of distribution
q(zt0 |{xti , ti }i , θenc ). The algorithm follows a standard VAE algorithm with an RNN variational posterior and
an ODESolve model:
<<ALGORITHM>>
<<FORMULA>> (53)
<<ALGORITHM>>
Appendix F Extra Figures
<<FIGURE>>
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Learning differential equations that are easy to solve
Jacob Kelly Jesse Bettencourt
University of Toronto, Vector Institute University of Toronto, Vector Institute
jkelly@cs.toronto.edu jessebett@cs.toronto.edu
Matthew James Johnson David Duvenaud
Google Brain University of Toronto, Vector Institute
mattjj@google.com duvenaud@cs.toronto.edu
Abstract
Differential equations parameterized by neural networks become expensive to solve
numerically as training progresses. We propose a remedy that encourages learned
dynamics to be easier to solve. Specifically, we introduce a differentiable surrogate
for the time cost of standard numerical solvers, using higher-order derivatives
of solution trajectories. These derivatives are efficient to compute with Taylor-
mode automatic differentiation. Optimizing this additional objective trades model
performance against the time cost of solving the learned dynamics. We demonstrate
our approach by training substantially faster, while nearly as accurate, models in
supervised classification, density estimation, and time-series modelling tasks.
1 Introduction
Differential equations describe a systems behavior by specifying its instantaneous dynamics.
Historically, differential equations have been derived from theory, such as Newtonian mechanics,
Maxwells equations, or epidemiological models of infectious disease, with parameters inferred
from observations. Solutions to these equations usually cannot be expressed in closed-form,
requiring numerical approximation. Recently, ordinary differential equations parameterized by
millions of learned parameters, called neural ODEs, have been fit for latent time series models,
density models, or as a replacement for very deep neural networks (Rubanova et al., 2019; Grath-
wohl et al., 2019; Chen et al., 2018). These models are not constrained to match a theoretical
model,and sometimes substantially different dynamics can give nearly indistinguishable predictions.
This raises the possibility that we can find nearly equivalent models that are substantially easier
and faster to solve. Yet standard training methods have no way to penalize the complexity of the
dynamics being learned.
<<FIGURE>>
Equal Contribution. Code available at: github.com/jacobjinkelly/easy-neural-ode
How can we learn dynamics that are faster to solve numerically without substantially changing their
predictions? Much of the computational advantages of a continuous-time formulation come from
using adaptive solvers, and most of the time cost of these solvers comes from repeatedly evaluating
the dynamics function, which in our settings is a moderately-sized neural network. So, wed like to
reduce the number of function evaluations (NFE) required for these solvers to reach a given error
tolerance. Ideally, we would add a term penalizing the NFE to the training objective, and let a
gradient-based optimizer trade off between solver cost and predictive performance. But because NFE
is integer-valued, we need to find a differentiable surrogate.
The NFE taken by an adaptive solver depends on how far it can extrapolate the trajectory forward
without introducing too much error. For example, for a standard adaptive-step Runge-Kutta solver
with order m, the step size is approximately inversely proportional to the norm of the local mth total
derivative of the solution trajectory with respect to time. That is, a larger mth derivative leads to a
smaller step size and thus more function evaluations. Thus, we propose to minimize the norm of this
total derivative during training, as a way to control the time required to solve the learned dynamics.
In this paper, we investigate the effect of this speed regularization in various models and solvers.
We examine the relationship between the solver order and the regularization order, and characterize
the tradeoff between speed and performance. In most instances, we find that solver speed can be
approximately doubled without a substantial increase in training loss. We also provide an extension
to the JAX program transformation framework that provides Taylor-mode automatic differentiation,
which is asymptotically more efficient for computing the required total derivatives than standard
nested gradients.
Our work compares against and generalizes that of Finlay et al. (2020), who proposed regularizing
dynamics in the FFJORD density estimation model, and showed that it stabilized dynamics enough
in that setting to allow the use of fixed-step solvers during training.
2 Background
An ordinary differential equation (ODE) specifies the instantaneous change of a vector-valued state
<<FORMULA>>, computing the state at a later time:
<<FORMULA>>
is called an initial value problem (IVP). For example, f could describe the equations of motion for a
particle, or the transmission and recovery rates for a virus across a population. Usually, the required
integral has no analytic solution, and must be approximated numerically.
Adaptive-step Runge-Kutta ODE Solvers Runge-Kutta methods (Runge, 1895; Kutta, 1901)
approximate the solution trajectories of ODEs through a series of small steps, starting at time t0 .
At each step, they choose a step size h, and fit a local approximation to the solution, ẑ(t), using
several evaluations of f . When h is sufficiently small, the numerical error of a mth-order method
is bounded by kẑ(t + h) z(t + h)k ≤ chm+1 for some constant c (Hairer et al., 1993). So, for a
mth-order method, the local error grows approximately in proportion to the size of the mth coefficient
in the Taylor expansion of the true solution. All else being equal, controlling this coefficient for all
dimensions of z(t) will allow larger steps to be taken without surpassing the error tolerance.
Neural Ordinary Differential Equations The dynamics function f can be a moderately-sized
neural network, and its parameters θ trained by gradient descent. Solving the resulting IVP is
analogous to evaluating a very deep residual network in which the number of layers corresponds
to the number of function evaluations of the solver (Chang et al., 2017; Ruthotto & Haber, 2018;
Chen et al., 2018). Solving such continuous-depth models using adaptive numerical solvers has
several computational advantages over standard discrete-depth network architectures. However, this
approach is often slower than using a fixed-depth network, due to an inability to control the number
of steps required by an adaptive-step solver.
3 Regularizing Higher-Order Derivatives for Speed
The ability of Runge-Kutta methods to take large and accurate steps is limited by the Kth-order
Taylor coefficients of the solution trajectory. We would like these coefficients to be small. Specifically,
we propose to regularize the squared norm of the Kth-order total derivatives of the state with respect
to time, integrated along the entire solution trajectory:
<<FORMULA>> (1)
where k·k2 is the squared `2 norm, and the dependence on the dynamics parameters θ is implicit
through the solution z(t) integrating dz(t)
<<dt = f (z(t), t, θ)>>.
During training, we weigh this regularization term by a hyperparameter λ and add it to our original loss
to get our regularized objective:
<<FORMULA>> (2)
What kind of solutions are allowed when RK = 0? For K = 0,
<<FORMULA>>
we have kz(t)k2 = 0, so the only possible solution is z(t) = 0.
For K = 1, we have kf (z(t), t)k2 = 0, so all solutions are
constant, flat trajectories. For K = 2 solutions are straight-line
trajectories. Higher values of K shrink higher derivatives, but
dont penalize lower-order dynamics. For instance, a quadratic
trajectory will have R3 = 0. Setting the Kth order dynamics to
exactly zero everywhere automatically makes all higher orders
zero as well. Figure 1 shows that regularizing R3 on a toy 1D
neural ODE reduces NFE.
<<FIGURE>>
Which orders should we regularize? We propose matching the
order of the regularizer to that of the solver being used. We
conjecture that regularizing dynamics of lower orders than that
of the solver restricts the model unnecessarily, and that let-
ting the lower orders remain unregularized should not increase
NFE very much. Figure 2 shows empirically which orders
of Runge-Kutta solvers can efficiently solve which orders of
toy polynomial trajectories. We test these conjectures on real
models and datasets in section 6.2.
The solution trajectory and our regularization term can be computed in a single call to an ODE solver
by augmenting the system with the integrand in eq. (1).
4 Efficient Higher Order Differentiation with Taylor Mode
The number of terms in higher-order forward derivatives grows exponentially in K, becoming
prohibitively expensive for K = 5, and causing substantial slowdowns even for K = 2 and K = 3.
Luckily, there exists a generalization of forward-mode automatic differentiation (AD), known as
Taylor mode, which can compute the total derivative exactly for a cost of only O(K 2 ). We found
that this asymptotic improvement reduced wall-clock time by an order of magnitude, even for K as
low as 3.
First-order forward-mode AD Standard forward-mode AD computes, for a function f (x) and
an input perturbation vector v, the product ∂f ∂x v. This Jacobian-vector product, or JVP, can be
computed efficiently without explicitly instantiating the Jacobian. This implicit computation of JVPs
is straightforward whenever f is a composition of operations for which which implicit JVP rules are
known.
Higher-order Jacobian-vector products Forward-mode AD can be generalized to higher orders
K
to compute Kth-order Jacobians contracted K times against the perturbation vector: ∂∂xKf v ⊗K .
Similarly, this can also be computed without representing any Jacobian matrices explicitly.
A naïve approach to higher-order forward mode is to recursively apply first-order forward mode.
K
Specifically, nesting JVPs K times gives the right answer: <<FORMULA>> but
causes an unnecessary exponential slowdown, costing O(exp(K)). This is because expressions that
appear in lower derivatives also appear in higher derivatives, but the work to compute is not shared
across orders.
Taylor Mode Taylor-mode AD generalizes Function Taylor propagation rule
first-order forward mode to compute the first <<y = z + cw>> <<y[k] = z[k] + cw[k]>>
K derivatives exactly with a time cost of only <<Pk>>
O(K 2 ) or O(K log K), depending on the op- <<y =zw>> << y[k] = h j=0 z[j] w[kj] i>>
<<Pk1>>
erations involved. Instead of providing rules <<y = z/w>> <<y[k] = w10 zk j=0 z[j] w[kj]>>
for propagating perturbation vectors, one pro- <<Pk>>
<<y = exp(z)>> <<ỹ[k] = j=1 y[kj] z̃[j]>>
vides rules for propagating truncated Taylor <<Pk>>
series. Some example rules are shown in ta- <<s = sin(z)>> <<s̃[k] = j=1 z̃[j] c[kj]>>
<<Pk>>
ble 1. For more details see the Appendix and <<c = cos(z)>> <<c̃[k] = j=1 z̃[j] s[kj]>>
Griewank & Walther (2008, Chapter 12). We
provide an open source implementation of Table 1: Rules for propagating Taylor polynomial
Taylor mode AD in the JAX Python library coefficients through standard functions. These rules
(Bradbury et al., 2018). generalize standard first-order derivatives. Notation
<<z[i] = i!1 zi>> and <<ỹ[i] = i!i zi>>.
5 Experiments
We consider three different tasks in which continuous-
depth or continuous time models might have computa-
tional advantages over standard discrete-depth models:
supervised learning, continuous generative modeling of
time-series (Rubanova et al., 2019), and density estima-
tion using continuous normalizing flows (Grathwohl et al.,
2019). Unless specified otherwise, we use the standard
dopri5 Runge-Kutta 4(5) solver (Dormand & Prince,
1980; Shampine, 1986). <<FIGURE>>
5.1 Supervised Learning Figure 3: Number of function evalua-
tions (NFE) and training error during
We construct a model for MNIST classification: it takes in training. Speed regularization (solid)
as input a flattened MNIST image and integrates it through decreases the NFE throughout training
dynamics given by a simple MLP, then applies a linear without substantially changing the train-
classification layer. In fig. 3 we compare the NFE and ing error.
training error of a model with and without regularizing
R3 .
5.2 Continuous Generative Time Series Models
As in Rubanova et al. (2019), we use the Latent ODE
architecture for modelling trajectories of ICU patients
using the PhysioNet Challenge 2012 dataset (Silva
et al., 2012). This variational autoencoder architec-
ture uses an RNN recognition network, and models
the state dynamics using an ODE in a latent space.
In the supervised learning setting described in the
previous section only the final state affects model pre- Figure 4: Regularizing dynamics in a la-
dictions. In contrast, time-series models predictions tent ODE modeling PhysioNet clinical data.
also depend on the value of the trajectory at all inter- Shown are a representative 2-dimensional
mediate times when observations were made. So, we slice of 20 dimensional dynamics. We re-
might expect speed regularization to be ineffective duce average NFE from 281 to 90 while only
due to these extra constraints on the dynamics. How- incurring an 8% increase in loss.
ever, fig. 4 shows that, without changing their overall
shape the latent dynamics can be adjusted to reduce their NFE by a factor of 3.
5.3 Density Estimation with Continuous Normalizing Flows
Our third task is unsupervised density estimation, using a scalable variant of continuous normalizing
flows called FFJORD (Grathwohl et al., 2019). We fit the MINIBOONE tabular dataset from
Papamakarios et al. (2017) and the MNIST image dataset (LeCun et al., 2010). We use the respective
singe-flow architectures from Grathwohl et al. (2019).
Grathwohl et al. (2019) noted that the NFE required to numerically integrate their dynamics could
become prohibitively expensive throughout training. Table 2 shows that we can reduce NFE by 38%
for only a 0.6% increase in log-likelihood measured in bits/dim.
How to train your Neural ODE We compare against the approach of Finlay et al. (2020), who
design two regularization terms specifically for stabilizing the dynamics of FFJORD models:
<<FORMULA>>
The first term is designed to encourage straight-line paths, and the second, stochastic, term is designed
to reduce overfitting. Finlay et al. (2020) used fixed-step solvers during training for some datasets.
We compare these two regularization on training with each of adaptive and fixed-step solvers, and
evaluated using an adaptive solver, in section 6.3.
6 Analysis and Discussion
6.1 Trading off function evaluations for loss
What does the trade off between accuracy and speed look like? Ideally, we could reduce the solver
time a lot without substantially reducing model performance. Indeed, this is demonstrated in all three
settings we explored. Figure 5 shows that generally, model performance starts getting substantially
worse only after a 50% reduction in solver speed when controlling R2 .
<<FIGURE>>
Figure 5: Tuning the regularization of R2 trades off between training loss and solver speed in three
different applications of neural ODEs. Horizontal axes show average number of function evaluations,
and vertical axes show unregularized training loss, both at the end of training.
6.2 Order of regularization vs. order of solver
Which order of total derivatives should we regularize for a particular solver? As mentioned earlier,
we conjecture that the best choice would be to match the order of the solver being used. Regularizing
too low an order might needlessly constrain the dynamics and make it harder to fit the data, while
regularizing too high an order might leave the dynamics difficult to solve for a lower-order solver.
However, we also expect that optimizing higher-order derivatives might be challenging, since these
higher derivatives can change quickly even for small changes to the dynamics parameters.
Figures 6 and 7 investigate this question on the task of MNIST classification. Figure 6 compares the
effectiveness of regularizing different orders when using a solver of a particular order. For a 2nd
order solver, regularizing K = 2 produces a strictly better trade-off between performance and speed,
as expected. For higher-order solvers, including ones with adaptive order, we found that regularizing
orders above K = 3 gave little benefit.
<<FIGURE>>
Figure 7 investigates the relationship between RK and the quantity it is meant to be a surrogate
for: NFE. We observe a clear monotonic relationship between the two, for all orders of solver and
regularization.
6.3 Do we reduce training time?
Our approach produces models that are fastest to evaluate at test time. However, when we train
with adaptive solvers we do not improve overall training time, due to the additional expense of
computing our regularizer. Training with a fixed-grid solver is faster, but can be unstable if dynamics
are unregularized. Finlay et al. (2020)s regularization and ours allow us to use fixed grid solvers and
reduce training time. However, ours is 2.4× slower than Finlay et al. (2020) for FFJORD because
their regularization re-uses terms already computed in the FFJORD training objective. For objectives
where these cannot be re-used, like MNIST classification, our method is 1.7× slower, but achieves
better test-time NFE.
6.4 Are we making the solver overconfident?
Because we optimize dynamics in a way specifically designed to make the solver take longer steps,
we might fear that we are “adversarially attacking” our solver, making it overconfident in its ability
to extrapolate. Figure 8c shows that this is not the case for MNIST classification.
6.5 Does speed regularization overfit?
Finlay et al. (2020) motivated one of their regularization terms by the possibility of overfitting: having
faster dynamics only for the examples in the training set, but still low on the test set. However, they
did not check whether overfitting was occurring. In fig. 8b we confirm that our regularized dynamics
have nearly identical average solve time on a held-out test set, on MNIST classification.
7 Related Work
Although the field of numerical ODE solvers is extremely mature, as far as we know, there has
been almost no work specifically on tuning differential equations to be faster to solve. The closest
<<FIGURE>>
Figure 8: Figure 8c We observe that the actual solver error is about equally well-calibrated for
regularized dynamics as random dynamics, indicating that regularization does not make the solver
overconfident. Figure 8b: There is negligible overfitting of solver speed. ??: Speed regularization
does not usefully improve generalization. For large λ, our method reduces overfitting, but increases
overall test error due to under-fitting.
related work is Grathwohl et al. (2019) who mention attempting to use weight decay and spectral
normalization to reduce NFE, and of course Finlay et al. (2020), who, among other contributions,
introduced the use of fixed-step solvers for stable training.
Stabilizing dynamics Simard et al. (1991) regularized the dynamics of discrete-time recurrent
neural networks to improve their stability, by constraining the norm of the Jacobian of the dynamics
function in the direction of its largest eigenvalue. However, this approach has an O(D3 ) time cost.
De Brouwer et al. (2019) introduced a parameterization of neural ODEs analogous to instantaneous
Gated Recurrent Unit (GRU) recurrent neural network architectures in order to stabilize training
dynamics. Dupont et al. (2019) provided theoretical arguments that adding extra dimensions to the
state of a neural ODE should make training easier, and showed that this helped reduce NFE during
training.
Gradually increasing depth Chang et al. (2017) noted the connection between residual networks
and ODEs, and took advantage of this connection to gradually make resnets deeper during training,
in order to save time. One can view the increase in NFE while neural ODEs as an automatic, but
uncontrolled, version of their method. Their results suggest we might benefit from introducing a
speed regularization schedule that gradually tapers off during training.
Gradient Regularization Novak et al. (2018); Drucker & LeCun (1992) regularized the gradients
of neural networks to improve generalization.
Table 2: Density Estimation on MNIST using FFJORD. For adaptive solvers, indicated by ∞ Steps,
our approach is slowest to train, but requires the fewest NFE once trained. For fixed-step solvers our
approach achieves lower bits/dim and NFE when comparing across fixed-grid solvers using the same
number of steps. Fixed step solvers that diverged due to instability are indicated by NaN bits/dim.
8 Scope
The initial speedups obtained in this paper are not yet enough to make neural ODEs competitive with
standard fixed-depth architectures in terms of speed for standard supervised learning. However, there
are many applications where continuous-depth architectures provide a unique advantage. Besides
density models such as FFJORD and time series models, continuous-depth architectures have been
applied in solving mean-field games (Ruthotto et al., 2019), image segmentation (Pinckaers & Litjens,
2019), image super-resolution (Scao, 2020), and molecular simulations (Wang et al., 2020). These
applications, which already use continuous-time models, could benefit from the speed regularization
proposed in this paper.
While we investigated only ODEs in this paper, this approach could presumably be extended straight-
forwardly to neural stochastic differential equations fit by adaptive solvers (Li et al., 2020) and other
flavors of parametric differential equations fit by gradient descent (Rackauckas et al., 2019).
9 Limitations
Hyperparameters The hyperparameter λ needs to be chosen to balance speed and training loss.
One the other hand, neural ODEs dont require choosing the outer number of layers, which needs to
be chosen separately for each stack of layers in standard architectures.
One also needs to choose solver order and tolerances, and these can substantially affect solver speed.
We did not investigate loosening tolerances, or modifying other parameters of the solver. The default
tolerance of 1.4e-8 for both atol and rtol behaved well in all our experiments.
One also needs to choose K. Higher K seems to generally work better, but is slower per step at
training time. In principle, if one can express their utility explicitly in terms of training loss and NFE,
it may be possible to tune λ automatically during training using the predictable relationship between
RK and NFE shown in fig. 7.
Slower overall training Although speed regularization reduces the overall NFE during training, it
makes each step more expensive. In our density estimation experiments (table 2), the overall effect
was about about 70% slower training, compared to no regularization, when using adaptive solvers.
However, test-time evaluation is much faster, since there is no slowdown per step.
10 Conclusions
This paper is an initial attempt at controlling the integration time of differential equations by regular-
izing their dynamics. This is an almost unexplored problem, and there are almost certainly better
quantities to optimize than the ones examined in this paper.
Based on these initial experiments, we propose three practical takeaways:
1. Across all tasks, tuning the regularization usually gave at least a 2x speedup without
substantially hurting model performance.
2. Overall training time with speed regularization is in general about 30% to 50% slower with
adaptive solvers.
3. For standard solvers, regularizing orders higher than R2 or R3 provided little additional
benefit.
Future work It may be possible to adapt solver architectures to take advantage of flexibility in
choosing the dynamics. Standard solver design has focused on robustly and accurately solving a
given set of differential equations. However, in a learning setting, we could consider simply rejecting
some kinds of dynamics as being too difficult to solve, analogous to other kinds of constraints we put
on models to encourage statistical regularization.
Acknowledgements
We thank Barak Perlmutter, Ken Jackson, Ricky T.Q. Chen, Will Grathwohl, Chris Finlay, and
Chris Rackauckas for feedback and helpful discussions. Resources used in preparing this research
were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and
companies sponsoring the Vector Institute.
References
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-
Milne, S. JAX: composable transformations of Python+NumPy programs, 2018. URL http:
//github.com/google/jax.
Chang, B., Meng, L., Haber, E., Tung, F., and Begert, D. Multi-level residual networks from
dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential
equations. In Advances in neural information processing systems, pp. 65716583, 2018.
De Brouwer, E., Simm, J., Arany, A., and Moreau, Y. GRU-ODE-Bayes: Continuous modeling of
sporadically-observed time series. In Advances in Neural Information Processing Systems, pp.
73777388, 2019.
Dormand, J. R. and Prince, P. J. A family of embedded Runge-Kutta formulae. Journal of computa-
tional and applied mathematics, 6(1):1926, 1980.
Drucker, H. and LeCun, Y. Improving generalization performance using double backpropagation.
IEEE Trans. Neural Networks, 3(6):991997, 1992. doi: 10.1109/72.165600. URL https:
//doi.org/10.1109/72.165600.
Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural ODEs. In Advances in Neural Information
Processing Systems, pp. 31343144, 2019.
Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. M. How to train your neural ODE.
arXiv preprint arXiv:2002.02798, 2020.
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: Free-form
continuous dynamics for scalable reversible generative models. International Conference on
Learning Representations, 2019.
Griewank, A. and Walther, A. Evaluating derivatives. 2008.
Hairer, E., Norsett, S., and Wanner, G. Solving Ordinary Differential Equations I: Nonstiff Problems,
volume 8. 01 1993. doi: 10.1007/978-3-540-78862-1.
Kutta, W. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für
Mathematik und Physik, 46:435453, 1901.
LeCun, Y., Cortes, C., and Burges, C. MNIST handwritten digit database. ATT Labs [Online].
Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
Li, X., Chen, R. T. Q., Wong, T.-K. L., and Duvenaud, D. Scalable gradients for stochastic differential
equations. In Artificial Intelligence and Statistics, 2020.
Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and
generalization in neural networks: an empirical study. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.
Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation.
Advances in Neural Information Processing Systems, 2017.
Pinckaers, H. and Litjens, G. Neural ordinary differential equations for semantic segmentation of
individual colon glands. arXiv preprint arXiv:1910.10470, 2019.
9
Rackauckas, C., Innes, M., Ma, Y., Bettencourt, J., White, L., and Dixit, V. Diffeqflux.jl-a Julia
library for neural differential equations. arXiv preprint arXiv:1902.02376, 2019.
Rubanova, Y., Chen, T. Q., and Duvenaud, D. K. Latent ordinary differential equations for irregularly-
sampled time series. In Advances in Neural Information Processing Systems, pp. 53215331,
2019.
Runge, C. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46:
167178, 1895.
Ruthotto, L. and Haber, E. Deep neural networks motivated by partial differential equations. Journal
of Mathematical Imaging and Vision, pp. 113, 2018.
Ruthotto, L., Osher, S. J., Li, W., Nurbekyan, L., and Fung, S. W. A machine learning framework for
solving high-dimensional mean field game and mean field control problems. CoRR, abs/1912.01825,
2019. URL http://arxiv.org/abs/1912.01825.
Scao, T. L. Neural differential equations for single image super-resolution. arXiv preprint
arXiv:2005.00865, 2020.
Shampine, L. F. Some practical Runge-Kutta formulas. Mathematics of Computation, 46(173):
135150, 1986. ISSN 00255718, 10886842. URL http://www.jstor.org/stable/2008219.
Silva, I., Moody, G., Scott, D. J., Celi, L. A., and Mark, R. G. Predicting in-hospital mortality of
ICU patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in
Cardiology, pp. 245248, 2012.
Simard, P., Raysz, J. P., and Victorri, B. Shaping the state space landscape in recurrent networks. In
Advances in neural information processing systems, pp. 105112, 1991.
Wang, W., Axelrod, S., and Gómez-Bombarelli, R. Differentiable molecular simulations for control
and learning. arXiv preprint arXiv:2003.00868, 2020.
Appendix A Taylor-mode Automatic Differentiation
A.1 Taylor Polynomials
To clarify the relationship between the presentation in Chapter 13 of Griewank & Walther (2008) and
our results we give the distinction between the Taylor coefficients and derivative coefficients, also
known, unhelpfully, as Tensor coefficients.
For a sufficiently smooth vector valued function f : Rn → Rm and the polynomial
<< x(t) = x[0] + x[1] t + x[2] t2 + x[3] t3 + · · · + x[d] td ∈ Rn>> (5)
we are interested in the d-truncated Taylor expansion
<<y(t) = f (x(t)) + O(td+1 )>> (6)
<<≡ y[0] + y[1] t + y[2] t + y[3] t + · · · + y[d] t ∈ R >> (7)
with the notation that <<FORMULA>> is the Taylor coefficient, which is the normalized derivative coefficient.
The Taylor coefficients of the expansion, y[j] , are smooth functions of the i ≤ j coefficients x[i],
<<FORMULA>> (8)
<<FORMULA>> (9)
<<FORMULA>> (10)
<<FORMULA>> (11)
These, as given in Griewank & Walther (2008), are written in terms of the normalized, Taylor
coefficients. This obscures their direct relationship with the derivatives, which we make explicit.
Consider the polynomial eq. (5) with Taylor coefficients expanded so their normalization is clear.
Further, lets use suggestive notation that these coefficients correspond to the higher derivatives of
x with respect to t, making x(t) a Taylor polynomial. That is <<FORMULA>>.
<<FORMULA>> (12)
<<FORMULA>> (13)
<<FORMULA>> (14)
Again, we are interested in the polynomial eq. (7), but with the normalization terms explicit
<<FORMULA>> (15)
Now we can expand the expressions for the Taylor coefficients y[i] to expressions for derivative
coefficients yi = i!y[i].
The coefficients of the Taylor expansion, yj , are smooth functions of the i ≤ j coefficients xi,
<<FORMULA>> (16)
<<FORMULA>> (17)
<<FORMULA>> (18)
<<FORMULA>> (19)
<<FORMULA>> (20)
<<FORMULA>> (21)
Therefore, eqs. (16), (17), (19) and (21) show that the derivative coefficient yi are exactly the ith
order higher derivatives of the composition f (x(t)) with respect to t. The key insight to this exercise
is that by writing the derivative coefficients explicitly we reveal that the expressions for the terms,
eqs. (16) to (18) and (20), involve terms previously computed for lower order terms.
In general, it will be useful to consider that the yk derivative coefficients is a function of all lower
order input derivatives
<<yk = yk (x0 , . . . , xk )>>. (22)
We provide the API to compute this in JAX by indexing the k-output of jet
<<yk = jet(f, x0 , (x1 , . . . , xk ))[k]>>.
A.2 Relationship with Differential Equations
A.2.1 Autonomous Form
We can transform the initial value problem
<<FORMULA>> (23)
into an autonomous dynamical system by augmenting the system to include the independent variable
with trivial dynamics Hairer et al. (1993):
<<FORMULA>> (24)
We do this for notational convenience, as well it disambiguates that derivatives with respect to t are
meant in the “total" sense. This is aleviates the potential ambiguity of ∂t f (x(t), t) which could mean
both the derivative with respect to the second argument and the derivative through x(t) by the chain
rule <<FORMULA>>.
A.2.2 Taylor Coefficients for ODE Solution with jet
Recall that jet gives us the coefficients for yi as a function of f and the coefficients xj≤i . We
can use jet and the relationship xk+1 = yk to recursively compute the coefficients of the solution
polynomial.
Algorithm 1 Taylor Coefficients for ODE Solution by Recursive Jet
<<ALGORITHM>>
A.3 Regularizing Taylor Terms
Computing the Taylor coefficients for the ODE solution as in algorithm 1 will give a local approx-
imation to the ODE solution. If infinitely many Taylor coefficients could be computed this would
give the exact solution. The order of the final Taylor coefficient, determining the truncation of the
polynomial, gives the order of the approximation.
If the higher order Taylor coefficients of the solution are large, then truncation will result in a local
approximation that quickly diverts from the solution. However, if the higher Taylor coefficients are
small then the local approximation will remain close to the solution. This motivates our regularization
method. The effect of our regularizer on the Taylor expansion of a solution to a neural ODE can be
seen in fig. 9.
Appendix B Experimental Details
Experiments were conducted using GPU-based ODE solvers. Training gradients were computed
using the adjoint method, in which the trajectory is reconstructed backwards in time to save memory,
for backpropagation. As in Finlay et al. (2020), we normalize our regularization term in eq. (1) by
the dimension of the vector-valued trajectory z(t) so that we may choose λ free of scaling by the
dimension of the problem.
B.1 Efficient computation of the gradient of regularization term
To optimize our regularized objective, we must compute its gradient. We use the adjoint method
as described in Chen et al. (2018) to differentiate through the solution to the ODE. In particular, to
optimize our model we only need to compute the gradient of the regularization term. The adjoint
method gives the gradient of the ODE solution as a solution to an augmented ODE.
<<FIGURE>>
Figure 9: Left: The dynamics and a trajectory of a neural ODE trained on a toy supervised learning
problem. The dynamics are poorly approximated by a 6th-order local Taylor series, and requires 92
NFE by a solve by a 5th-order Runge-Kutta solver. Right: Regularizing the 6th-order derivatives of
trajectories gives dynamics that are easier to solve numerically, requiring only 68 NFE.
B.2 Supervised Learning
The dynamics function f : Rd × R → Rd is given by an MLP as follows
<<z1 = σ(x)>>
<<h1 = W1 [z1 ; t] + b1>>
<<z2 = σ(h1 )>>
<<y = W2 [z2 ; t] + b2>>
Where <<[·; ·]>> denotes concatenation of a scalar onto a column vector. The parameters are <<W1 ∈
R^h×d>>, <<b1 ∈ R^h>> and <<W2 ∈ R^d×h>> , <<b2 ∈ R^d>> . Here we use 100 hidden units, i.e.<< h = 100>>. We have
<<d = 784>>, the dimension of an MNIST image.
We train with a batch size of 100 for 160 epochs. We use the standard training set of 60,000 images,
and the standard test set of 10,000 images as a validation/test set. We optimize our model using SGD
with momentum with β = 0.9. Our learning rate schedule is 1e-1 for the first 60 epochs, 1e-2 until
epoch 100, 1e-3 until epoch 140, and 1e-4 for the final 20 epochs.
B.3 Continuous Generative Modelling of Time-Series
The PhysioNet dataset consists of observations of 41 distinct traits over a time period of 48 hours.
We remove the parameters “Age”, “Gender”, “Height”, and “ICUType” as these attributes do not vary
in time. We also quantize the measurements for each attribute by the hour by averaging multiple
measurements within the same hour. This leaves 49 unique time stamps (the extra time stamp for
observations at exactly the endpoint of the 48 hour observation period). We report all our losses on
this quantized data. We performed this rather coarse quantization for computational reasons having
to do with our particular implementation of this model. The validation split was obtained by taking
a random split of 20% of the trajectories from the full dataset. In total there are 8000 trajectories.
Code is included for processing the dataset, and links to downloading the data may be found in the
code for Rubanova et al. (2019). All other experimental details may be found in the main body and
appendices of Rubanova et al. (2019).
B.4 Continuous Normalizing Flows
For the model trained on the MINIBOONE tabular dataset from Papamakarios et al. (2017), we used
the same architecture as in Table 4 in the appendix of Grathwohl et al. (2019). We chose the number
of epochs and a learning rate schedule based on manual tuning on the validation set, in contrast
to Grathwohl et al. (2019) who tuned these automatically using early stopping and an automatic
heuristic for the learning rate decay using evaluation on a validation set. In particular, we trained for
500 epochs with a learning rate of 1e-3 for the first 300 epochs, 1e-4 until epoch 425, and 1e-5
for the remaining 75 epochs. The number of epochs and learning rate schedule was determined by
evaluating the model on the validation set every 10 epochs, and decaying the learning rate by a factor
of 10 once the loss on the validation set stopped improving for several evaluations, with the goal of
matching or improving upon the log-likelihood reported in Grathwohl et al. (2019). The data was
obtained as made available from Papamakarios et al. (2017), which was already processed and split
into train/validation/test. In particular, the training set has 29556 examples, the validation set has
3284 examples, and the test set has 3648 examples, which consist of 43 features.
It is important to note that we implemented a single-flow model for the MNIST dataset, while the
original comparison in Finlay et al. (2020) was on a multi-flow model. This accounts for discrepancy
in bits/dim and NFE reported in Finlay et al. (2020).
All other experimental details are as in Grathwohl et al. (2019).
B.5 Hardware
MNIST Supervised learning, Physionet Time-series, and MNIST FFJORD experiments were trained
and evaluated on NVIDIA Tesla P100 GPU. Tabular data FFJORD experiments were evaluated on
NVIDIA Tesla P100 GPU but trained on NVIDIA Tesla T4 GPU. All experiments except for MNIST
FFJORD were trained with double precision for purposes of reproducibility.
Appendix C Additional Results
C.1 Overfitting of NFE
<<FIGURE>>
Figure 10: The difference in NFE is tracked by the variance of NFE.
In fig. 10 we note that there is a striking correspondence in the variance of NFE across individual
examples (in both the train set (dark red) and test set (light red)) and the absolute difference in NFE
between examples in the training set and test set. This suggests that any difference in the average
NFE between training examples and test examples is explained by noise in the estimate of the true
average NFE. It is also interesting that speed regularization does not have a monotonic relationship
with the variance of NFE, and we speculate as to how this might interact between the correspondence
of NFE for a particular example and the difficulty in the model correctly classifying it.
C.2 Trading off function evaluations with a surrogate loss
In fig. 11 and fig. 12 we confirm that our method poses a suitable tradeoff not only on the loss being
optimized, but also on the potentially non-differentiable loss which we truly care about. On MNIST,
we get a similar pareto curve when plotting classification error as opposed to cross-entropy loss, and
similarly on the time-series modelling task we see that we get a similar pareto curve on MSE loss as
compared to IWAE loss. The pareto curves are plotted for R3 , R2 respectively.
<<FIGURE>>
Figure 11: MNIST Classification
<<FIGURE>>
Figure 12: Physionet Time-Series
C.3 Wall-clock Time
We include additional tables with wall-clock time and training with fixed grid solvers in table 3 and
table 4.
Appendix D Comparison to How to Train Your Neural ODE
The terms from Finlay et al. (2020) are
<<FORMULA>>
and an estimate of
<<FORMULA>>
Table 3: Classification on MNIST
<<TABLE>>
These are combined with a weighted average and integrated along the solution trajectory.
These terms are motivated by the expansion
<<FORMULA>>
Namely, eq. (3) regularizes the first total derivative of the solution, f (z(t), t), along the trajectory, and
eq. (4) regularizes a stochastic estimate of the Frobenius norm of the spatial derivative, ∇z f (z(t), t),
along the solution trajectory.
In contrast, R2 regularizes the norm of the second total derivative directly. In particular, this takes
into account the ∂f ∂t term. In other words, this accounts for the explicit dependence of f on time,
while eq. (3) and eq. (4) capture only the implicit dependence on time through z(t).
Even in the case of an autonomous system, that is, where ∂f ∂t is identically 0 and the dynamics f only
depend implicitly on time, these terms still differ. Namely, R2 integrates the following along the
solution trajectory:
<<FORMULA>>
while Finlay et al. (2020) penalizes the respective norms of the matrix ∇z f (z(t), t) and vector
f (z(t), t) separately.
Table 4: Density Estimation on Tabular Data (MINIBOONE)
<<TABLE>>
<<END>> <<END>> <<END>>
<<START> <<START>> <<START>>
How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
Chris Finlay 1 Jörn-Henrik Jacobsen 2 Levon Nurbekyan 3 Adam M Oberman 1
Abstract
Training neural ODEs on large datasets has not
been tractable due to the necessity of allowing
the adaptive numerical ODE solver to refine its
step size to very small values. In practice this
leads to dynamics equivalent to many hundreds
or even thousands of layers. In this paper, we
overcome this apparent difficulty by introducing
a theoretically-grounded combination of both op-
timal transport and stability regularizations which
encourage neural ODEs to prefer simpler dynam-
ics out of all the dynamics that solve a problem
well. Simpler dynamics lead to faster conver-
gence and to fewer discretizations of the solver,
considerably decreasing wall-clock time without
loss in performance. Our approach allows us to
train neural ODE-based generative models to the
same performance as the unregularized dynamics,
with significant reductions in training time. This
brings neural ODEs closer to practical relevance
in large-scale applications.
<<FIGURE>>
Figure 1. Optimal transport map and a generic normalizing flow.
Indeed, it was observed that there is a striking similarity
1. Introduction between ResNets and the numerical solution of ordinary
differential equations (E, 2017; Haber & Ruthotto, 2017;
Recent research has bridged dynamical systems, a Ruthotto & Haber, 2018; Chen et al., 2018; 2019). In these
workhorse of mathematical modeling, with neural networks, works, deep networks are interepreted as discretizations of
the defacto function approximator for high dimensional data. an underlying dynamical system, where time indexes the
The great promise of this pairing is that the vast mathemat- “depth” of the network and the parameters of the discretized
ical machinery stemming from dynamical systems can be dynamics are learned. An alternate viewpoint was taken by
leveraged for modelling high dimensional problems in a neural ODEs (Chen et al., 2018), where the dynamics of
dimension-independent fashion. the neural network are approximated by an adaptive ODE
Connections between neural networks and ordinary differ- solver on the fly. This latter approach is quite compelling
ential equations (ODEs) were almost immediately noted as it does not require specifying the number of layers of the
after residual networks (He et al., 2016) were first proposed. network beforehand. Furthermore, it allows the learning of
homeomorphisms without any structural constraints on the
function computed by the residual block.
Neural ODEs have shown great promise in the physical sciences
(Köhler et al., 2019), in modeling irregular time series
(Rubanova et al., 2019), mean field games (Ruthotto et al.,
2019), continuous-time modeling (Yildiz et al., 2019; Kanaa
et al., 2019), and for generative modeling through normaliz-
ing flows with free-form Jacobians (Grathwohl et al., 2019).
Recent work has even adapted neural ODEs to the stochas- based on (ODE) which abstain from a priori fixing step-size.
tic setting (Li et al., 2020). Despite these successes, some Chen et al.s method is a continuous-time generalization of
hurdles still remain. In particular, although neural ODEs are residual networks, where the dynamics are generated by an
memory efficient, they can take a prohibitively long time to adaptive ODE solver that chooses step-size on-the-fly.
train, which is arguably one of the main stumbling blocks
Because of their adaptive nature, neural ODEs can be more
towards their widespread adoption.
flexible than ResNets in certain scenarios, such as when
In this work we reduce the training time of neural ODEs trading between model speed and accuracy. Moreover given
by regularizing the learned dynamics, complementing other a fixed network depth, the memory footprint of neural ODEs
recent approaches to this end such as augmented neural is orders of magnitude smaller than a standard ResNet dur-
ODEs (Dupont et al., 2019). Without further constraints on ing training. They therefore show great potential on a host
their dynamics, high dimensional neural ODEs may learn of applications, including generative modeling and density
dynamics which minimize an objective function, but which estimation. An apparent drawback of neural ODEs is their
generate irregular solution trajectories. See for example long training time: although a learned function f (· ; θ) may
Figure 1b, where an unregularized flow exhibits undesirable generate a map that solves a problem particularly well, the
properties due to unnecessarily fluctuating dynamics. As computational cost of numerically integrating (ODE) may
a solution, we propose two theoretically motivated regular- be so prohibitive that it is not tractable in practice. In this
ization terms arising from an optimal transport viewpoint paper we demonstrate this need not be so: with proper reg-
of the learned map, which encourage well-behaved dynam- ularization, it is possible to learn f (· ; θ) so that (ODE) is
ics (see 1a left). We empirically demonstrate that proper easily and quickly solved.
regularization leads to significant speed-up in training time
without loss in performance, thus bringing neural ODEs 2.1. FFJORD
closer to deployment on large-scale datasets. Our methods
are validated on the problem of generative modelling and In density estimation and generative modeling, we wish
density estimation, as an example of where neural ODEs to estimate an unknown data distribution p(x) from which
have shown impressive results, but could easily be applied we have drawn N samples. Maximum likelihood seeks to
elsewhere. approximate p(x) with a parameterized distribution pθ (x)
by minimizing the Kullback-Leibler divergence between the
In summary, our proposed regularized neural ODE (RN- two, or equivalently minimizing
ODE) achieves the same performance as the baseline, while
reducing the wall-clock training time by many hours or even
days. <<FORMULA>> (1)
2. Neural ODEs & Continuous normalizing Continuous normalizing flows (Grathwohl et al., 2019; Chen
flows et al., 2018) parameterize pθ (x) using a vector field f :
Rd × R 7→ Rd as follows. Let z(x, T ) be the solution map
Neural ODEs simplify the design of deep neural networks given by running the dynamics (ODE) for fixed time T .
by formulating the forward pass of a deep network as the Suppose we are given a known distribution q at final time T ,
solution of a ordinary differential equation. Initial work such as the normal distribution. Change of variables tells us
along these lines was motivated by the similarity of the eval- that the distribution pθ (x) may be evaluated through
uation of one layer of a ResNet and the Euler discretization
of an ODE. Suppose the block in the t-th layer of a ResNet <<log pθ (x) = log q (z(x, T )) + log det | ∇ z(x, T )|>> (2)
is given by the function f (x, t; θ), where θ are the blocks
parameters. Then the evaluation of this layer of the ResNet Evaluating the log determinant of the Jacobian is difficult.
is simply xt+1 = xt + f (xt , t; θ). Now, instead consider Grathwohl et al. (2019) exploit the following identity from
the following ODE fluid mechanics (Villani, 2003, p 114)
<<FORMULA>> (ODE) <<log det | ∇ z(x, t)| = div (f ) (z(x, t), t))>> (3)
The Euler discretization of this ODE with step-size <<τ>> is where <<div(·)>> is the divergence operator, <<div(f ) (x) =
<<zt+1 = zt + τ f (zt , t; θ)>>, which is nearly identical to the i ∂xi fi (x)>>. By the fundamental theorem of calculus, we
forward evaluation of the ResNets layer (setting step-size 1
In the normalizing flow literature divergence is typically writ-
<<τ = 1>> gives equality). Armed with this insight, Chen et al. ten explicitly as the trace of the Jacobian, however we use div (·)
(2018) suggested a method for training neural networks which is more common elsewhere.
<<FIGURE>>
Figure 2. Log-likelihood (measured in bits/dim) on the validation set as a function of wall-clock time. Rolling average of three hours, with
90% confidence intervals.
may then rewrite (2) in integral form From this simple motivating example, the need for regular-
ity of the vector field is apparent. Without placing demands
on the vector field f , it is entirely possible that the learned
<<log pθ (x) = log q (z(x, T )) + div (f ) (z(x, s), s) ds>>
dynamics will be poorly conditioned. This is not just a theo-
(4) retical exercise: because the dynamics must be solved with
Remark 2.1 (Divergence trace estimate). In (Grathwohl a numerical integrator, poorly conditioned dynamics will
et al., 2019), the divergence is estimated using an unbiased lead to difficulties during numerical integration of (ODE).
Monte-Carlo trace estimate (Hutchinson, 1990; Avron & Indeed, later we present results demonstrating a clear corre-
Toledo, 2011), lation between the number of time steps an adaptive solver
takes to solve (ODE), and the regularity of f . 
<<FORMULA>> (5) How can the regularity of the vector field be measured? One
motivating approach is to measure the force experienced by
a particle z(t) under the dynamics generated by the vector
By using the substitution (4), the task of maximizing log- field f , which is given by the total derivative of f with
likelihood shifts from choosing pθ to minimize (1), to learn- respect to time
ing the flow generated by a vector field f . This results in a
normalizing flow with a free-form Jacobian and reversible
dynamics, and was named FFJORD by Grathwohl et al.. <<FORMULA>> (6)
2.2. The need for regularity <<FORMULA>> (7)
The vector field learned through FFJORD that maximizes Well conditioned flows will place constant, or nearly con-
the log-likelihood is not unique, and raises troubling prob- stant, force on particles as they travel. Thus, in this work we
lems related to the regularity of the flow. For a simple propose regularizing the dynamics with two penalty terms,
example, refer to Figure 1, where we plot two normaliz- one term regularizing f and the other ∇ f . The first penalty,
ing flows, both mapping a toy one-dimensional distribution presented in Section 3, is a measure of the distance travelled
to the unit Gaussian, and where both maximize the log- under the flow f , and can alternately be interpreted as the
likelihood of exactly the same sample of particles. Figure kinetic energy of the flow. This penalty term is based off
1a presents a “regular” flow, where particles travel in straight of numerical methods in optimal transport, and encourages
lines that travel with constant speed. In contrast, Figure 1b particles to travel in straight lines with constant speed. The
shows a flow that still maximizes the log-likelihood, but second penalty term, discussed in Section 4, performs regu-
that has undesirable properties, such as rapidly varying local larization on the Jacobian of the vector field. Taken together
trajectories and non-constant speed. the two terms ensure that the force experienced by a particle
under the flow is constant or nearly so. 3.1. Linking normalizing flows to optimal transport
These two regularizers will promote dynamics that follow Now suppose we wish to minimize (18a), with q(z) a unit
numerically easy-to-integrate paths, thus greatly speeding normal distribution, and p(x) a data distribution, unknown
up training time. to us, but from which we have drawn N samples, and which
we model as a discrete distribution of Dirac masses. Enforc-
3. Optimal transport maps & ing the initial condition is trivial because we have sampled
from p directly. The continuity equation (18b) need not be
Benamou-Brenier
enforced because we are tracking a finite number of sam-
There is a remarkable similarity between density estimation pled particles. However the final time condition ρT = q
using continuous time normalizing flows, and the calcula- cannot be implemented directly, since we do not have di-
tion of the optimal transport map between two densities rect control on the form ρT (z) takes. Instead, introduce
using the Benamou-Brenier formulation (Benamou & Bre- a Kullback-Leibler term to (18a) penalizing discrepancy
nier, 2000; Santambrogio, 2015). While a review of optimal between ρT and q. This penalty term has an elegant simpli-
transport theory is far outside the scope of this paper, here fication when p(x) is modeled as a distribution of a finite
we provide an informal summary of key ideas relevant to number of masses, as is done in generative modeling. Set-
continuous normalizing flows. The quadratic-cost optimal ting ρ0 = pθ a brief derivation yields
transport map between two densities p(x) and q(x) is a map
z : Rd 7→ Rd minimizing the transport cost
<<FORMULA>> (10)
<<FORMULA>> (8)
With this simplification (18a) becomes
subject to the constraint that A q(z) dz = z1 (A) p(x) dx,
in other words that the measure of any set A is preserved
under the map z. In a seminal work, Benamou & Brenier <<FORMULA>> (11)
(2000) showed that rather than solving for minimizers of (8)
directly, an indirect (but computationally efficient) method
is available by writing z(x, T ) as the solution map of a
flow under a vector field f (as in (ODE)) for time T , by For further details on this derivation consult the supplemen-
minimizing tary materials.
The connection between the Benamou-Brenier formulation
<<FORMULA>> (9a) of the optimal transport problem on a discrete set of points
and continuous normalizing flows is apparent: the optimal
transport problem (11) is a regularized form of the continu-
<<FORMULA>> (9b) ous normalizing flow optimization problem (1). We there-
<<ρ0 (x) = p>>, (9c) fore expect that adding a kinetic energy regularization term
<<ρT (z) = q>>. (9d) to FFJORD will encourage solution trajectories to prefer
straight lines with constant speed.
The objective function (18a) is a measure of the kinetic
energy of the flow. The constraint (18b) ensures probability
mass is conserved. The latter two constraints guarantee the 4. Unbiased Frobenius norm regularization of
learned distribution agrees with the source p and target q. the Jacobian
Note that the kinetic energy (18a) is an upper bound on the
Refering to equation (7), one can see that even if f is regu-
transport cost, with equality only at optimality.
larized to be small, via a kinetic energy penalty term, if the
The optimal flow f minimizing (18) has several particularly Jacobian is large then the force experienced by a particle
appealing properties. First, particles induced by the opti- may also still be large. As a result, the error of the numerical
mal flow f travel in straight lines. Second, particles travel integrator can be large, which may lead an adaptive solver
with constant speed. Moreover, under suitable conditions to make many function evaluations. This relationship is
on the source and target distributions, the optimal solution apparent in Figure 3, where we empirically demonstrate the
map is unique (Villani, 2008). Therefore the solution map correlation between the number of function evaluations of
z(x, t) is entirely characterized by the initial and final posi- f taken by the adaptive solver, and the size of the Jacobian
tions: z(x, t) = (1 Tt )z(x, 0) + Tt z(x, T ). Consequently, norm of f . The correlation is remarkably strong: dynamics
given an optimal f it is extraordinarily easy to solve (ODE) governed by a poorly conditioned Jacobian matrix require
numerically with minimal computational effort. the adaptive solver to take many small time steps.
Algorithm 1 RNODE: regularized neural ODE training of
FFJORD
<<ALGORITHM>>
<<FIGURE>>
Figure 3. Number of function evaluations vs Jacobian Frobenius
norm of flows on CIFAR10 during training with vanilla FFJORD,
using an adaptive ODE solver.
\
Avron & Toledo, 2011). For real matrix B, an unbiased
<<FORMULA>> estimate of the trace is given by
<<FORMULA>> (14)
where <<FORMULA>> is drawn from a unit normal distribution.
Thus the squared Frobenius norm can be easily estimated by
setting B = AAT.
Moreover, in particle-based methods, the kinetic energy Turning to the Jacobian <<FORMULA>> of a vector valued func-
term forces dynamics to travel in straight lines only on tion f : Rd 7→ Rd , recall that the vector-Jacobian product
data seen during training, and so the regularity of the map <<FORMULA>> may be quickly computed through reverse-mode
is only guaranteed on trajectories taken by training data. automatic differentiation. Therefore an unbiased Monte-
The issue here is one of generalization: the map may be Carlo estimate of the Frobenius norm of the Jacobian is
irregular on off-distribution or perturbed images, and cannot readily available
be remedied by the kinetic energy term during training alone.
In the context of generalization, Jacobian regularization is <<FORMULA>> (15)
analagous to gradient regularization, which has been shown
to improve generalization (Drucker & LeCun, 1992; Novak <<FORMULA>> (16)
et al., 2018).
For these reasons, we also propose regularizing the Jacobian Conveniently, in the FFJORD framework the quantity
through its Frobenius norm. The Frobenius norm k · kF of a <<FORMULA>> must be computed during the estimate of the prob-
real matrix A can be thought of as the `2 norm of the matrix ability distribution under the flow, in the Monte-Carlo esti-
A vectorized mate of the divergence term (5). Thus Jacobian Frobenius
<<FORMULA>> (12) norm regularization is available with essentially no extra
computational cost.
Equivalently it may be computed as
5. Algorithm description
<<kAkF = tr(AAT)>> (13) All together, we propose modifying the objective function
of the FFJORD continuous normalizing flow (Grathwohl
and is the Euclidean norm of the singular values of a matrix. et al., 2019) with the two regularization penalties of Sec-
In trace form, the Frobenius norm lends itself to estimation tions 3 & 4. The proposed method is called RNODE, short
using a Monte-Carlo trace estimator (Hutchinson, 1990; for regularized neural ODE. Pseudo-code of the method is
<<TABLE>>
Table 1. Log-likelihood (in bits/dim) and training time (in hours) on validation images with uniform dequantization. Results on clean
images are found in the supplemental materials. For comparison we report both the results of the original FFJORD paper (Grathwohl
et al., 2019) and our own independent run of FFJORD (“vanilla”) on CIFAR10 and MNIST. Vanilla FFJORD did not train on ImageNet64
(denoted by “x”). Also reported are results for other flow-based generative modeling papers. Our method (FFJORD with RNODE) has
comparable log-likelihood as FFJORD but is significantly faster.
<<FIGURE>>
Figure 4. Quality of generated samples samples on 5bit CelebA-HQ64 with RNODE. Here temperature annealing (Kingma & Dhariwal,
2018) with T = 0.7 was used to generate visually appealing images. For full sized CelebA-HQ256 samples, consult the supplementary
materials.
presented in Algorithm 1. The optimization problem to be Here E, l, and n are respectively the kinetic energy, the
solved is log determinant of the Jacobian, and the integral of the
Frobenius norm of the Jacobian.
Both the divergence term and the Jacobian Frobenius norm
are approximated with Monte-Carlo trace estimates. In our
<<FORMULA>> implementation, the Jacobian Frobenius estamate reuses
the computatian T ∇ f from the divergence estimate for
efficiency. We remark that the kinetic energy term only
<<FORMULA>> requires the computation of a dot product. Thus just as
in FFJORD, our implementation scales linearly with the
<<FORMULA>> (17) number of time steps taken by the ODE solver.
Gradients of the objective function with respect to the net-
where z(x, t) is determined by numerically solving (ODE). work parameters are computed using the adjoint sensitivity
Note that we take the mean over number of samples and method (Pontryagin et al., 1962; Chen et al., 2018).
input dimension. This is to ensure that the choice of regu-
larization strength λK and λJ is independent of dimension
size and sample size. 6. Experimental design
To compute the three integrals and the log-probability under Here we demonstrate the benefits of regularizing neural
q of z(x, T ) at final time T , we augment the dynamics of ODEs on generative models, an application where neu-
the ODE with three extra terms, so that the entire system ral ODEs have shown strong empirical performance. We
solved by the numerical integrator is use four datasets: CIFAR10 (Krizhevsky & Hinton, 2009),
MNIST (LeCun & Cortes, 1998), downsampled ImageNet
(64x64) (van den Oord et al., 2016), and 5bit CelebA-HQ
(256x256) (Karras et al., 2017). We use an identical neural
<<FORMULA>> (RNODE) architecture to that of Grathwohl et al. (2019). The dynamics
(Kingma & Dhariwal, 2018) trained with 40 GPUs for a week;
in contrast we train with four GPUs in just under a week.
<<FIGURE>>
Figure 5. Ablation study of the effect of the two regularizers, comparing two measures of flow regularity during training with a fixed
step-size ODE solver. Figure 5a: mean Jacobian Frobenius norm as a function of training epoch. Figure 5b: mean kinetic energy of the
flow as a function of training epoch. Figure 5c: number of function evaluations.
are defined by a neural network <<f (z, t; θ(t)) : Rd × R+ 7→ step size by a factor of two until the discrete dynamics were
Rd>> where <<θ(t)>> is piecewise constant in time. On MNIST we stable and achieved good performance. The Runge-Kutta
use 10 pieces; CIFAR10 uses 14; downsampled ImageNet 4(5) adaptive solver was used on the two larger datasets. We
uses 18; and CelebA-HQ uses 26 pieces. Each piece is a have also observed that RNODE improves the training time
4-layer deep convolutional network comprised of 3x3 ker- of the adaptive solvers as well, requiring many fewer func-
nels and softplus activation functions. Intermediary layers tion evaluations; however in Python we have found that the
have 64 hidden dimensions, and time t is concatenated to fixed grid solver is typically quicker at a specified number
the spatial input z. The integration time of each piece is of function evaluations. At test time RNODE uses the same
[0, 1]. Weight matrices are chosen to imitate the multi-scale adaptive solver as FFJORD.
architecture of Real NVP (Dinh et al., 2017), in that im-
We always initialize RNODE so that <<f(z, t) = 0>>; thus train-
ages are squeezed via a permutation to halve image height
ing begins with an initial identity map. This is done by zero-
and width but quadruple the number of channels. Diver-
ing the parameters of the last layer in each piece (block),
gence of f is estimated using the Gaussian Monte-Carlo
following Goyal et al. (2017). The identity map is an ap-
trace estimator with one sample of fixed noise per solver
propriate choice because it has zero transport cost and zero
time-step.
Frobenius norm. Moreover the identity map is trivially
On MNIST and CIFAR10 we train with a batch size of solveable for any numerical solver, thus training begins
200 and train for 100 epochs on a single GPU3 , using the without any effort required on the solvers behalf.
Adam optimizer (Kingma & Ba, 2015) with a learning rate
On all datasets we set both the kinetic energy regularization
of 1e3. On the two larger datasets, we train with four
coefficient λK and the Jacobian norm coefficient λJ to 0.01.
GPUs, using a per-GPU batch size of respectively 3 and 50
for CelebA-HQ and ImageNet. Data is preprocessed by per-
turbing with uniform noise followed by the logit transform. 7. Results
The reference implementation of FFJORD solves the dy- A comparison of RNODE against FFJORD and other flow-
namics using a Runge-Kutta 4(5) adaptive solver (Dormand based generative models is presented in Table 1. We report
& Prince, 1980) with error tolerances 1e5 and initial step both our running of “vanilla” FFJORD and the results as
size 1e2. We have found that using less accurate solvers originally reported in (Grathwohl et al., 2019). We highlight
on the reference implementation of FFJORD results in nu- that RNODE runs roughly 2.8x faster than FFJORD on both
merically unstable training dynamics. In contrast, a simple datasets, while achieving or surpassing the performance of
fixed-grid four stage Runge-Kutta solver suffices for RN- FFJORD. This can further be seen in Figure 2 where we plot
ODE during training on MNIST and CIFAR10, using a bits per dimension ( d1 log2 p(x), a normalized measure
step size of 0.25. The step size was determined based on of log-likelihood) on the validation set as a function of
a simple heuristic of starting with 0.5 and decreasing the training epoch, for both datasets. Visual inspection of the
sample quality reveals no qualitative difference between
<<FIGURE>>
Figure 6. Quality of generated samples samples with and without regularization on MNIST, left, and CIFAR10, right.
regularized and unregularized approaches; refer to Figure 6. encourages flows to travel a minimal distance. In addition,
Generated images for downsampled ImageNet and CelebA- we see that the Jacobian norm alone also has a beneficial
HQ are deferred to the supplementary materials; we provide effect on the distance particles travel. Overall, the results
smaller generated images for networks trained on CelebA- support our theoretical reasoning empirically.
HQ 64x64 in Figure 4.
Surprisingly, our run of “vanilla” FFJORD achieved slightly 8. Previous generative flows inspired by
better performance than the results reported in (Grathwohl optimal transport
et al., 2019). We suspect the discrepancy in performance
and run times between our implementation of FFJORD and Zhang et al. (2018) define a neural ODE flow where the
that of the original paper is due to batch size: Grathwohl dynamics are given as the gradient of a scalar potential func-
et al. use a batch size of 900 and train on six GPUs, whereas tion. This interpretation has deep connections to optimal
on MNIST and CIFAR10 we use a batch size of 200 and transport: the optimal transport map is the gradient of a
train on a single GPU. convex potential function. Yang & Karniadakis (2019) con-
tinue along these lines, and define an optimal transport again
We were not able to train vanilla FFJORD on ImageNet64, as a scalar potential gradient. Yang & Karniadakis (2019)
due to numerical underflow in the adaptive solvers time step. enforce that the learned map is in fact an optimal trans-
This issue cannot be remedied by increasing the solvers port map by penalizing their objective function with a term
error tolerance, for this would bias the log-likelihood esti- measuring violations of the continuity equation. Ruthotto
mates on validation. et al. (2019) place generative flows within a broader context
of mean field games, and as an example consider a neural
7.1. Ablation study on MNIST ODE gradient potential flow solving the optimal transport
problem in up to 100 dimensions. We also note the recent
In Figure 5, we compare the effect of each regularizer by
work of Twomey et al. (2019), who proposed regularizing
itself on the training dynamics with the fixed grid ODE
neural ODEs with an Euler-step discretization of the kinetic
solver on the MNIST dataset. Without any regularization at
energy term to enforce straightness, although connections
all, training dynamics are numerically unstable and fail after
to optimal transport were not discussed.
just under 50 epochs. This is precisely when the Jacobian
norm grows large; refer to Figure 5a. Figure 5a demonstrates When a flow is the gradient of a scalar potential, the change
that each regularizer by itself is able to control the Jacobian of variables formula (4) simplifies so that the divergence
norm. The Jacobian regularizer is better suited to this task, term is replaced by the Laplacian of the scalar potential.
although it is interesting that the kinetic energy regularizer Although mathematically parsimonious and theoretically
also improves the Jacobian norm. Unsurprisingly Figure 5b well-motivated, we chose not to implement our flow as the
demonstrates the addition of the kinetic energy regularizer gradient of a scalar potential function due to computational
How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
constraints: such an implementation would require triple through CIFAR, and companies sponsoring the Vector Insti-
backprop (twice to compute or approximate the Laplacian, tute (www.vectorinstitute.ai/#partners).
and once more for the parameter gradient). Ruthotto et al.
(2019) circumvented this problem by utilizing special struc- References
tural properties of residual networks to efficiently compute
the Laplacian. Avron, H. and Toledo, S. Randomized algorithms for esti-
mating the trace of an implicit symmetric positive semi-
definite matrix. J. ACM, 58(2):8:18:34, 2011. doi:
9. Discussion
10.1145/1944345.1944349. URL https://doi.org/
In practice, RNODE is simple to implement, and only re- 10.1145/1944345.1944349.
quires augmenting the dynamics (ODE) with two extra
scalar equations (one for the kinetic energy term, and an- Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duve-
other for the Jacobian penalty). In the setting of FFJORD, naud, D., and Jacobsen, J. Invertible residual networks.
because we may recycle intermediary terms used in the In Chaudhuri, K. and Salakhutdinov, R. (eds.), Pro-
divergence estimate, the computational cost of evaluating ceedings of the 36th International Conference on Ma-
these two extra equations is minimal. RNODE introduces chine Learning, ICML 2019, 9-15 June 2019, Long
two extra hyperparameters related to the strength of the reg- Beach, California, USA, volume 97 of Proceedings
ularizers; we have found these required almost no tuning. of Machine Learning Research, pp. 573582. PMLR,
2019. URL http://proceedings.mlr.press/
Although the problem of classification was not considered v97/behrmann19a.html.
in this work, we believe RNODE may offer similar im-
provements both in training time and the regularity of the Benamou, J.-D. and Brenier, Y. A computational fluid me-
classifier learned. In the classification setting we expect the chanics solution to the Monge-Kantorovich mass transfer
computional overhead of calculating the two extra terms problem. Numerische Mathematik, 84(3):375393, 2000.
should be marginal relative to gains made in training time.
Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duve-
naud, D. Neural Ordinary Differential Equations. In
10. Conclusion Advances in Neural Information Processing Systems 31:
We have presented RNODE, a regularized method for neu- Annual Conference on Neural Information Processing
ral ODEs. This regularization approach is theoretically Systems 2018, NeurIPS 2018, 3-8 December 2018,
well-motivated, and encourages neural ODEs to learn well- Montréal, Canada, pp. 65726583, 2018. URL http:
behaved dynamics. As a consequence, numerical integration //papers.nips.cc/paper/7892-neural-
of the learned dynamics is straight forward and relatively ordinary-differential-equations.
easy, which means fewer discretizations are needed to solve Chen, T. Q., Behrmann, J., Duvenaud, D., and Jacobsen,
the dynamics. In many circumstances, this allows for the re- J. Residual flows for invertible generative modeling.
placement of adaptive solvers with fixed grid solvers, which In Wallach, H. M., Larochelle, H., Beygelzimer,
can be more efficient during training. This leads to a sub- A., dAlché-Buc, F., Fox, E. B., and Garnett, R.
stantial speed up in training time, while still maintaining (eds.), Advances in Neural Information Processing
the same empirical performance, opening the use of neural Systems 32: Annual Conference on Neural Information
ODEs to large-scale applications. Processing Systems 2019, NeurIPS 2019, 8-14 Decem-
ber 2019, Vancouver, BC, Canada, pp. 99139923,
Acknowledgements 2019. URL http://papers.nips.cc/paper/
9183-residual-flows-for-invertible-
C. F. and A. O. were supported by a grant from the Innova- generative-modeling.
tive Ideas Program of the Healthy Brains and Healthy Lives
initiative (HBHL) through McGill University. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Den-
L. N. was supported by AFOSR MURI FA9550-18-1-0502, sity estimation using real NVP. In 5th International
AFOSR Grant No. FA9550-18-1-0167, and ONR Grant No. Conference on Learning Representations, ICLR 2017,
N00014-18-1-2527. Toulon, France, April 24-26, 2017, Conference Track Pro-
ceedings, 2017. URL https://openreview.net/
A. O. was supported by the Air Force Office of Scientific forum?id=HkpbnH9lx.
Research under award number FA9550-18-1-0167
Dormand, J. R. and Prince, P. J. A family of embedded
Resources used in preparing this research were provided, in Runge-Kutta formulae. Journal of computational and
part, by the Province of Ontario, the Government of Canada applied mathematics, 6(1):1926, 1980.
How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
Drucker, H. and LeCun, Y. Improving generalization per- Hutchinson, M. F. A stochastic estimator of the trace of the
formance using double backpropagation. IEEE Trans. influence matrix for Laplacian smoothing splines. Com-
Neural Networks, 3(6):991997, 1992. doi: 10.1109/ munications in Statistics-Simulation and Computation,
72.165600. URL https://doi.org/10.1109/ 19(2):433450, 1990.
72.165600.
Kanaa, D., Voleti, V., Kahou, S., and Pal, C. Simple video
Dupont, E., Doucet, A., and Teh, Y. W. Augmented generation using neural ODEs. Workshop on Learning
neural ODEs. In Wallach, H. M., Larochelle, H., with Rich Experience, Advances in Neural Information
Beygelzimer, A., dAlché-Buc, F., Fox, E. B., and Gar- Processing Systems 32: Annual Conference on Neural
nett, R. (eds.), Advances in Neural Information Pro- Information Processing Systems 2019, NeurIPS 2019,
cessing Systems 32: Annual Conference on Neural 8-14 December 2019, Vancouver, BC, Canada, 2019.
Information Processing Systems 2019, NeurIPS 2019,
8-14 December 2019, Vancouver, BC, Canada, pp. Karras, T., Aila, T., Laine, S., and Lehtinen, J. Pro-
31343144, 2019. URL http://papers.nips.cc/ gressive growing of gans for improved quality, stabil-
paper/8577-augmented-neural-odes. ity, and variation. CoRR, abs/1710.10196, 2017. URL
http://arxiv.org/abs/1710.10196.
E, W. A Proposal on Machine Learning via Dynam-
ical Systems. Communications in Mathematics and Kingma, D. P. and Ba, J. Adam: A method for stochastic op-
Statistics, 5(1):111, March 2017. ISSN 2194-671X. timization. In 3rd International Conference on Learning
doi: 10.1007/s40304-017-0103-z. URL https:// Representations, ICLR 2015, San Diego, CA, USA, May
doi.org/10.1007/s40304-017-0103-z. 7-9, 2015, Conference Track Proceedings, 2015. URL
http://arxiv.org/abs/1412.6980.
Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P.,
Kingma, D. P. and Dhariwal, P. Glow: Generative flow
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
with invertible 1x1 convolutions. In Bengio, S., Wallach,
He, K. Accurate, large minibatch SGD: training ima-
H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N.,
genet in 1 hour. CoRR, abs/1706.02677, 2017. URL
and Garnett, R. (eds.), Advances in Neural Information
http://arxiv.org/abs/1706.02677.
Processing Systems 31: Annual Conference on Neural
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, Information Processing Systems 2018, NeurIPS 2018,
I., and Duvenaud, D. FFJORD: free-form continu- 3-8 December 2018, Montréal, Canada, pp. 10236
ous dynamics for scalable reversible generative mod- 10245, 2018. URL http://papers.nips.cc/
els. In 7th International Conference on Learning Rep- paper/8224-glow-generative-flow-with-
resentations, ICLR 2019, New Orleans, LA, USA, May invertible-1x1-convolutions.
6-9, 2019, 2019. URL https://openreview.net/
Köhler, J., Klein, L., and Noé, F. Equivariant flows: sam-
forum?id=rJxgknCcK7.
pling configurations for multi-body systems with sym-
Haber, E. and Ruthotto, L. Stable architectures for deep metric energies. arXiv preprint arXiv:1910.00753, 2019.
neural networks. Inverse Problems, 34(1):014004, 2017. Krizhevsky, A. and Hinton, G. Learning multiple
He, K., Zhang, X., Ren, S., and Sun, J. Deep resid- layers of features from tiny images. Technical re-
ual learning for image recognition. In 2016 IEEE port, University of Toronto, 2009. URL http://
Conference on Computer Vision and Pattern Recogni- www.cs.toronto.edu/ ̃kriz/cifar.html.
tion, CVPR 2016, Las Vegas, NV, USA, June 27-30, LeCun, Y. and Cortes, C. The MNIST database of handwrit-
2016, pp. 770778. IEEE Computer Society, 2016. doi: ten digits. 1998. URL http://yann.lecun.com/
10.1109/CVPR.2016.90. URL https://doi.org/ exdb/mnist/.
10.1109/CVPR.2016.90.
Li, X., Wong, T. L., Chen, R. T. Q., and Duvenaud, D. Scal-
Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. able gradients for stochastic differential equations. CoRR,
Flow++: Improving flow-based generative models with abs/2001.01328, 2020. URL http://arxiv.org/
variational dequantization and architecture design. In abs/2001.01328.
Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings
of the 36th International Conference on Machine Learn- Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and
ing, ICML 2019, 9-15 June 2019, Long Beach, California, Sohl-Dickstein, J. Sensitivity and generalization in neural
USA, volume 97 of Proceedings of Machine Learning networks: an empirical study. In 6th International Con-
Research, pp. 27222730. PMLR, 2019. URL http: ference on Learning Representations, ICLR 2018, Van-
//proceedings.mlr.press/v97/ho19a.html. couver, BC, Canada, April 30 - May 3, 2018, Conference
How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
Track Proceedings. OpenReview.net, 2018. URL https: Processing Systems 2019, NeurIPS 2019, 8-14 December
//openreview.net/forum?id=HJC2SzZCW. 2019, Vancouver, BC, Canada, pp. 1341213421, 2019.
URL http://papers.nips.cc/paper/9497-
Pontryagin, L. S., Mishchenko, E., Boltyanskii, V., and ode2vae-deep-generative-second-order-
Gamkrelidze, R. The mathematical theory of optimal odes-with-bayesian-neural-networks.
processes. 1962.
Zhang, L., E, W., and Wang, L. Monge-Ampère flow for
Rubanova, Y., Chen, T. Q., and Duvenaud, D. K. Latent or- generative modeling. CoRR, abs/1809.10188, 2018. URL
dinary differential equations for irregularly-sampled time http://arxiv.org/abs/1809.10188.
series. In Advances in Neural Information Processing
Systems, pp. 53215331, 2019.
Ruthotto, L. and Haber, E. Deep neural networks motivated
by partial differential equations. Journal of Mathematical
Imaging and Vision, pp. 113, 2018.
Ruthotto, L., Osher, S. J., Li, W., Nurbekyan, L., and
Fung, S. W. A machine learning framework for solv-
ing high-dimensional mean field game and mean field
control problems. CoRR, abs/1912.01825, 2019. URL
http://arxiv.org/abs/1912.01825.
Santambrogio, F. Benamou-Brenier and other continu-
ous numerical methods, pp. 219248. Springer Interna-
tional Publishing, Cham, 2015. ISBN 978-3-319-20828-
2. doi: 10.1007/978-3-319-20828-2 6. URL https:
//doi.org/10.1007/978-3-319-20828-2 6.
Twomey, N., Kozlowski, M., and Santos-Rodrı́guez, R. Neu-
ral ODEs with stochastic vector field mixtures. CoRR,
abs/1905.09905, 2019. URL http://arxiv.org/
abs/1905.09905.
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu,
K. Pixel recurrent neural networks. CoRR,
abs/1601.06759, 2016. URL http://arxiv.org/
abs/1601.06759.
Villani, C. Topics in Optimal Transportation. Graduate
studies in mathematics. American Mathematical Society,
2003. ISBN 9780821833124.
Villani, C. Optimal Transport: Old and New. Grundlehren
der mathematischen Wissenschaften. Springer Berlin Hei-
delberg, 2008. ISBN 9783540710509. URL https://
books.google.ca/books?id=hV8o5R7 5tkC.
Yang, L. and Karniadakis, G. E. Potential flow gener-
ator with L2 Optimal Transport regularity for gener-
ative models. CoRR, abs/1908.11462, 2019. URL
http://arxiv.org/abs/1908.11462.
Yildiz, C., Heinonen, M., and Lähdesmäki, H. ODE2VAE:
deep generative second order ODEs with Bayesian neural
networks. In Wallach, H. M., Larochelle, H., Beygelz-
imer, A., dAlché-Buc, F., Fox, E. B., and Garnett,
R. (eds.), Advances in Neural Information Processing
Systems 32: Annual Conference on Neural Information
A. Details of Section 3.1: Benamou-Brenier Hence, multiplying the objective function in (20) by λ and
formulation in Lagrangian coordinates ignoring the f -independent term Exp log p(x) we obtain
an equivalent objective function
The Benamou-Brenier formulation of the optimal transporta-
tion (OT) problem in Eulerian coordinates is
<<FORMULA>> (21)
<<FORMULA>> (18a)
Finally, if we assume that {xi }N i=1 are iid sampled from p,
<<FORMULA>> (18b) we obtain the empirical objective function
<<ρ0 (x) = p>>, (18c)
<<ρT (z) = q>>. (18d) <<FORMULA>> (22)
The connection between continuous normalizing flows
(CNF) and OT becomes transparent once we rewrite (18) in
Lagrangian coordinates. Indeed, for regular enough velocity
B. Additional results
fields f one has that the solution of the continuity equation Here we present additional generated samples on the two
(18b), (18c) is given by ρt = z(·, t)]p where z is the flow larger datasets considered, CelebA-HQ and ImageNet64. In
addition bits/dim on clean images are reported in Table 2.
<<FORMULA>>
The relation ρt = z(·, t)]p means that for arbitrary test
function φ we have that
<<φ(x)ρt (x, t)dx = φ(z(x, t))p(x)dx>>
Therefore (18) can be rewritten as
<<min kf (z(x, t), t)k2 p(x) dxdt>> (19a)
<<subject to ż(x, t) = f (z(x, t), t)>>, (19b)
<<z(x, 0) = x>>, (19c)
<<z(·, T )]p = q>>. (19d)
Note that ρt is eliminated in this formulation. The terminal
condition (18d) is trivial to implement in Eulerian coordi-
nates (grid-based methods) but not so simple in Lagrangian
ones (19d) (grid-free methods). To enforce (19d) we intro-
duce a penalty term in the objective function that measures
the deviation of z(·, T )]p from q. Thus, the penalized ob-
jective function is
<<FORMULA>> (20)
where λ > 0 is the penalization strength. Next, we observe
that this objective function can be written as an expectation
with respect to x p. Indeed, the Kullback-Leibler di-
vergence is invariant under coordinate transformations, and
therefore
<<FORMULA>>
<<FIGURE>>
Figure 7. Quality of FFJORD RNODE generated images on ImageNet-64.
<<FIGURE>>
Figure 8. Quality of FFJORD RNODE generated images on CelebA-HQ. We use temperature annealing, as described in (Kingma &
Dhariwal, 2018), to generate visually appealing images, with T = 0.5, . . . , 1.
Table 2. Additional results and model statistics of FFJORD RNODE. Here we report validation bits/dim on both validation images, and on
validation images with uniform variational dequantization (ie perturbed by uniform noise). We also report number of trainable model
parameters.
<<TABLE>>
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
A guide to convolution arithmetic for deep
learning
The authors of this guide would like to thank David Warde-Farley,
Guillaume Alain and Caglar Gulcehre for their valuable feedback. We
are likewise grateful to all those who helped improve this tutorial with
helpful comments, constructive criticisms and code contributions. Keep
them coming!
Special thanks to Ethan Schoonover, creator of the Solarized color
scheme, 1 whose colors were used for the figures.
Feedback
Your feedback is welcomed! We did our best to be as precise, infor-
mative and up to the point as possible, but should there be any thing you
feel might be an error or could be rephrased to be more precise or com-
prehensible, please dont refrain from contacting us. Likewise, drop us a
line if you think there is something that might fit this technical report
and you would like us to discuss we will make our best effort to update
this document.
Source code and animations
The code used to generate this guide along with its figures is available
on GitHub. 2 There the reader can also find an animated version of the
figures.
1 Introduction 5
1.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . .6
1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
2 Convolution arithmetic 12
2.1 No zero padding, unit strides . . . . . . . . . . . . . . . . . . . .12
2.2 Zero padding, unit strides . . . . . . . . . . . . . . . . . . . . . .13
2.2.1 Half (same) padding . . . . . . . . . . . . . . . . . . . . .13
2.2.2 Full padding . . . . . . . . . . . . . . . . . . . . . . . . .13
2.3 No zero padding, non-unit strides . . . . . . . . . . . . . . . . . .15
2.4 Zero padding, non-unit strides . . . . . . . . . . . . . . . . . . . .15
3 Pooling arithmetic 18
4 Transposed convolution arithmetic 19
4.1 Convolution as a matrix operation . . . . . . . . . . . . . . . . .20
4.2 Transposed convolution . . . . . . . . . . . . . . . . . . . . . . .20
4.3 No zero padding, unit strides, transposed . . . . . . . . . . . . .21
4.4 Zero padding, unit strides, transposed . . . . . . . . . . . . . . .22
4.4.1 Half (same) padding, transposed . . . . . . . . . . . . . .22
4.4.2 Full padding, transposed . . . . . . . . . . . . . . . . . . .22
4.5 No zero padding, non-unit strides, transposed . . . . . . . . . . .24
4.6 Zero padding, non-unit strides, transposed . . . . . . . . . . . . .24
5 Miscellaneous convolutions 28
5.1 Dilated convolutions . . . . . . . . . . . . . . . . . . . . . . . . .28
Chapter 1
Introduction
Deep convolutional neural networks (CNNs) have been at the heart of spectac-
ular advances in deep learning. Although CNNs have been used as early as the
nineties to solve character recognition tasks (Le Cunet al., 1997), their current
widespread application is due to much more recent work, when a deep CNN
was used to beat state-of-the-art in the ImageNet image classification challenge
(Krizhevskyet al., 2012).
Convolutional neural networks therefor e constitute a very useful tool for ma-
chine learning practitioners. However, learning to use CNNs for the first time
is generally an intimidating experience. A convolutional layers output shape
is affected by the shape of its input as well as the choice of kernel shape, zero
padding and strides, and the relationship between these properties is not triv-
ial to infer. This contrasts with fully-connected layers, whose output size is
independent of the input size. Additionally, CNNs also usually feature apool-
ingstage, adding yet another level of complexity with respect to fully-connected
networks. Finally, so-called transposed convolutional layers (also known as frac-
tionally strided convolutional layers) have been employed in more and more work
as of late (Zeileret al., 2011; Zeiler and Fergus, 2014; Longet al., 2015; Rad-
for det al., 2015; Visinet al., 2015; Imet al., 2016), and their relationship with
convolutional layers has been explained with various degrees of clarity.
This guides objective is twofold:
1.Explain the relationship between convolutional layers and transposed con-
volutional layers.
2.Provide an intuitive underst and ing of the relationship between input shape,
kernel shape, zero padding, strides and output shape in convolutional,
pooling and transposed convolutional layers.
In order to remain broadly applicable, the results shown in this guide are
independent of implementation details and apply to all commonly used machine
learning frameworks, such as Theano (Bergstraet al., 2010; Bastienet al., 2012),
Torch (Collobertet al., 2011), Tensorflow (Abadiet al., 2015) and Caffe (Jia et al., 2014).
This chapter briefly reviews the main building blocks of CNNs, namely dis-
crete convolutions and pooling. for an in-depth treatment of the subject, see
Chapter 9 of the Deep Learning textbook (Goodfellowet al., 2016).
1.1 Discrete convolutions
The bread and butter of neural networks is affine transformations: a vector
is received as input and is multiplied with a matrix to produce an output (to
which a bias vector is usually added before passing the result through a non-
linearity). This is applicable to any type of input, be it an image, a sound
clip or an unordered collection of features: whatever their dimensionality, their
representation can always be flattened into a vector before the transfomation.
Images, sound clips and many other similar kinds of data have an intrinsic
structure. More formally, they share these important properties:
They are stored as multi-dimensional arrays.
They feature one or more axes for which ordering matters (e.g., width and
height axes for an image, time axis for a sound clip).
One axis, called the channel axis, is used to access different views of the
data (e.g., the red, green and blue channels of a color image, or the left
and right channels of a stereo audio track).
These properties are not exploited when an affine transformation is applied;
in fact, all the axes are treated in the same way and the topological information
is not taken into account. Still, taking advantage of the implicit structure of
the data may prove very h and y in solving some tasks, like computer vision and
speech recognition, and in these cases it would be best to preserve it. This is
where discrete convolutions come into play.
A discrete convolution is a linear transformation that preserves this notion
of ordering. It is sparse (only a few input units contribute to a given output
unit) and reuses parameters (the same weights are applied to multiple locations
in the input).
Figure 1.1 provides an example of a discrete convolution. The light blue
grid is called the input feature map. To keep the drawing simple, a single input
feature map is represented, but it is not uncommon to have multiple feature
maps stacked one onto another. 1 A kernel(shaded area) of value
<<FIGURE>>
Figure 1.1: Computing the output values of a discrete convolution.
<<FIGURE>>
Figure 1.2: Computing the output values of a discrete convolution for N = 2, i1 =i2 = 5, k1 =k2 = 3, s1 =s2 = 2, and p1 =p2 = 1.
slides across the input feature map. At each location, the product between
each element of the kernel and the input element it overlaps is computed and
the results are summed up to obtain the output in the current location. The
procedure can be repeated using different kernels to for m as many output feature
maps as desired (Figure 1.3). The final outputs of this procedure are called
output feature maps.2 If there are multiple input feature maps, the kernel will
have to be 3-dimensional or, equivalently each one of the feature maps will
be convolved with a distinct kernel and the resulting feature maps will be
summed up elementwise to produce the output feature map.
The convolution depicted in Figure 1.1 is an instance of a 2-D convolution,
but it can be generalized to N-D convolutions. for instance, in a 3-D convolu-
tion, the kernel would be a cuboid and would slide across the height, width and
depth of the input feature map.
The collection of kernels defining a discrete convolution has a shape corre-
sponding to some permutation of(n;m;k 1 ;:::;k N ), where
<<FORMULA>>
The following properties affect the output size oj of a convolutional layer
along axis j:
<<FORMULA>>
for instance, Figure 1.2 shows a 3x3 kernel applied to a 5x5 input padded
with a 1x1 border of zeros using 2x2 strides.
Note that strides constitute a for m of subsampling. As an alternative to
being interpreted as a measure of how much the kernel is translated, strides can
also be viewed as how much of the output is retained. for instance, moving
the kernel by hops of two is equivalent to moving the kernel by hops of one but
retaining only odd output elements (Figure 1.4).
1 An example of this is what was referred to earlier as channels for images and sound clips.
2 While there is a distinction between convolution and cross-correlation from a signal pro-
cessing perspective, the two become interchangeable when the kernel is learned. for the sake
of simplicity and to stay consistent with most of the machine learning literature, the term
convolution will be used in this guide.
<<FIGURE>>
Figure 1.3: A convolution mapping from two input feature maps to three output
feature maps using a32 3x3 collection of kernels w. In the left pathway,
input feature map 1 is convolved with kernel w1;1 and input feature map 2 is
convolved with kernel w1;2 , and the results are summed together elementwise
to for m the first output feature map. The same is repeated for the middle and
right pathways to for m the second and third feature maps, and all three output
feature maps are grouped together to for m the output.
<<FIGURE>>
Figure 1.4: An alternative way of viewing strides. Instead of translating the
3x3 kernel by increments ofs= 2(left), the kernel is translated by increments
of1 and only one ins= 2output elements is retained (right).
1.2 Pooling
In addition to discrete convolutions themselves,pooling operations make up
another important building block in CNNs. Pooling operations reduce the size
of feature maps by using some function to summarize subregions, such as taking
the average or the maximum value.
Pooling works by sliding a window across the input and feeding the content
of the window to a pooling function. In some sense, pooling works very much
like a discrete convolution, but replaces the linear combination described by the
kernel with some other function. Figure 1.5 provides an example for average
pooling, and Figure 1.6 does the same for max pooling.
The following properties affect the output size j of a pooling layer along
axisj:
<<FORMULA>>
<<FIGURE>>
Figure 1.5: Computing the output values of a 3x3 average pooling operation on a 5x5 input using 1x1 strides.
<<FIGURE>>
Figure 1.6: Computing the output values of a 3x3 max pooling operation on a 5X5 input using 1X1 strides.
Convolution arithmetic
The analysis of the relationship between convolutional layer properties is eased
by the fact that they dont interact across axes, i.e., the choice of kernel size,
stride and zero padding along axis j only affects the output size of axis j.
Because of that, this chapter will focus on the following simplified setting:
2-D discrete convolutions (N= 2),
square inputs (i1 =i2 =i),
square kernel size (k1 =k2 =k),
same strides along both axes (s1 =s2 =s),
same zero padding along both axes (p1 =p2 =p).
This facilitates the analysis and the visualization, but keep in mind that the
results outlined here also generalize to the N-D and non-square cases.
2.1 No zero padding, unit strides
The simplest case to analyze is when the kernel just slides across every position
of the input (i.e.,s= 1 and p= 0). Figure 2.1 provides an example for i= 4
and k= 3.
One way of defining the output size in this case is by the number of possible
placements of the kernel on the input. Lets consider the width axis: the kernel
starts on the leftmost part of the input feature map and slides by steps of one
until it touches the right side of the input. The size of the output will be equal
to the number of steps made, plus one, accounting for the initial position of the
kernel (Figure 2.8a). The same logic applies for the height axis.
More formally, the following relationship can be inferred:
Relationship 1.for any i,k and p, and for s= 1,
<<FORMULA>>
2.2 Zero padding, unit strides
To factor in zero padding (i.e., only restricting tos= 1), lets consider its effect
on the effective input size: padding with p zeros changes the effective input size
from i to i+ 2p. In the general case, Relationship 1 can then be used to infer
the following relationship:
Relationship 2.for any i,k and p, and for s= 1,
<<FORMULA>>
Figure 2.2 provides an example for i= 5,k= 4 and p= 2.
In practice, two specific instances of zero padding are used quite extensively
because of their respective properties. Lets discuss them in more detail.
2.2.1 Half (same) padding
Having the output size be the same as the input size (i.e.,o=i) can be a
desirable property:
Relationship 3.for any i and for k o d (k= 2n+ 1; n2N),
s= 1 and p=b k=2 c=n,
<<FORMULA>>
This is sometimes referred to as half(or same) padding. Figure 2.3 provides an
example for i= 5,k= 3 and (therefor e) p= 1.
2.2.2 Full padding
While convolving a kernel generally decreases the output size with respect to
the input size, sometimes the opposite is required. This can be achieved with
proper zero padding:
Relationship 4.for any i and k, and for p=kx1 and s= 1,
<<FORMULA>>
<<FIGURE>>
Figure 2.1: (No padding, unit strides) Convolving a 3x3 kernel over a 4x4
input using unit strides (i.e.,i= 4,k= 3,s= 1 and p= 0).
<<FIGURE>>
Figure 2.2: (Arbitrary padding, unit strides) Convolving a 4x4 kernel over a
5x5 input padded with a 2x2 border of zeros using unit strides (i.e.,i= 5,
k= 4,s= 1 and p= 2).
<<FIGURE>>
Figure 2.3: (Half padding, unit strides) Convolving a 3x3 kernel over a 5x5
input using half padding and unit strides (i.e.,i= 5,k= 3,s= 1 and p= 1).
<<FIGURE>>
Figure 2.4: (Full padding, unit strides) Convolving a 3x3 kernel over a 5x5
input using full padding and unit strides (i.e.,i= 5,k= 3,s= 1 and p= 2).
This is sometimes referred to as full padding, because in this setting every
possible partial or complete superimposition of the kernel on the input feature
map is taken into account. Figure 2.4 provides an example for i= 5,k= 3 and
(therefore) p= 2.
2.3 No zero padding, non-unit strides
All relationships derived so far only apply for unit-strided convolutions. Incorporating
non unitary strides requires another inference leap. To facilitate
the analysis, lets momentarily ignore zero padding (i.e.,s >1 and p= 0).
Figure 2.5 provides an example for i= 5,k= 3 and s= 2.
Once again, the output size can be defined in terms of the number of possible
placements of the kernel on the input. Lets consider the width axis: the kernel
starts as usual on the leftmost part of the input, but this time it slides by steps
of sizes until it touches the right side of the input. The size of the output is
again equal to the number of steps made, plus one, accounting for the initial
position of the kernel (Figure 2.8b). The same logic applies for the height axis.
From this, the following relationship can be inferred:
Relationship 5.for any i,k and s, and for p= 0,
<<FORMULA>>
The floor function accounts for the fact that sometimes the last possible step
does not coincide with the kernel reaching the end of the input, i.e., some input
units are left out (see Figure 2.7 for an example of such a case).
2.4 Zero padding, non-unit strides
The most general case (convolving over a zero padded input using non-unit
strides) can be derived by applying Relationship 5 on an effective input of size
i+ 2p, in analogy to what was done for Relationship 2:
Relationship 6.for any i,k,p and s,
<<FORMULA>>
As before, the floor function means that in some cases a convolution will produce
the same output size for multiple input sizes. More specifically, ifi+ 2pkis
a multiple ofs, then any input size j=i+a; a2 f0;:::; sx1 g will produce
the same output size. Note that this ambiguity applies only for s >1.
<<FIGURE>>
Figure 2.6 shows an example with i= 5,k= 3,s= 2 and p= 1, while
<<FIGURE>>
Figure 2.7 provides an example for i= 6,k= 3,s= 2 and p= 1. Interestingly,
despite having different input sizes these convolutions share the same output
size. While this doesnt affect the analysis for convolutions, this will complicate
the analysis in the case of transposed convolutions.
<<FIGURE>>
Figure 2.5: (No zero padding, arbitrary strides) Convolving a 3x3 kernel over
a 5x5 input using 2x2 strides (i.e.,i= 5,k= 3,s= 2 and p= 0).
<<FIGURE>>
Figure 2.6: (Arbitrary padding and strides) Convolving a 3x3 kernel over a
5x5 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,
k= 3,s= 2 and p= 1).
<<FIGURE>>
Figure 2.7: (Arbitrary padding and strides) Convolving a 3x3 kernel over a
6x6 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,
k= 3,s= 2 and p= 1). In this case, the bottom row and right column of the
zero padded input are not covered by the kernel.
(a) The kernel has to slide two steps (b) The kernel has to slide one step of
to the right to touch the right side of size two to the right to touch the right
the input ( and equivalently downwards). side of the input ( and equivalently down-
Adding one to account for the initial ker- wards). Adding one to account for the
nel position, the output size is 3x3. initial kernel position, the output size is 2x2.
<<FIGURE>>
Figure 2.8: Counting kernel positions.
Chapter 3
Pooling arithmetic
In a neural network, pooling layers provide invariance to small translations of
the input. The most common kind of pooling is max pooling, which consists
in splitting the input in (usually non-overlapping) patches and outputting the
maximum value of each patch. Other kinds of pooling exist, e.g., mean or
average pooling, which all share the same idea of aggregating the input locally
by applying a non-linearity to the content of some patches (Boureauet al.,
2010a,b, 2011; Saxeet al., 2011).
Some readers may have noticed that the treatment of convolution arithmetic
only relies on the assumption that some function is repeatedly applied onto
subsets of the input. This means that the relationships derived in the previous
chapter can be reused in the case of pooling arithmetic. Since pooling does not
involve zero padding, the relationship describing the general case is as follows:
Relationship 7.for any i,k and s,
<<FORMULA>>
This relationship holds for any type of pooling.
Chapter 4
Transposed convolution arithmetic
The need for transposed convolutions generally arises from the desire to use a
transfor mation going in the opposite direction of a normal convolution, i.e., from
something that has the shape of the output of some convolution to something
that has the shape of its input while maintaining a connectivity pattern that
is compatible with said convolution. for instance, one might use such a trans-
for mation as the decoding layer of a convolutional autoencoder or to project
feature maps to a higher-dimensional space.
Once again, the convolutional case is considerably more complex than the
fully-connected case, which only requires to use a weight matrix whose shape has
been transposed. However, since every convolution boils down to an efficient im-
plementation of a matrix operation, the insights gained from the fully-connected
case are useful in solving the convolutional case.
Like for convolution arithmetic, the dissertation about transposed convolu-
tion arithmetic is simplified by the fact that transposed convolution properties
dont interact across axes.
The chapter will focus on the following setting:
2-D transposed convolutions (N= 2),
square inputs (i1 =i2 =i),
square kernel size (k1 =k2 =k),
same strides along both axes (s1 =s2 =s),
same zero padding along both axes (p1 =p2 =p).
Once again, the results outlined generalize to the N-D and non-square cases.
4.1 Convolution as a matrix operation
Take for example the convolution represented in Figure 2.1. If the input and
output were to be unrolled into vectors from left to right, top to bottom, the
convolution could be represented as a sparse matrix C where the non-zero elements
are the elements w i;j of the kernel (with i and j being the row and column
of the kernel respectively):
<<FORMULA>>
This linear operation takes the input matrix flattened as a 16-dimensional
vector and produces a 4-dimensional vector that is later reshaped as the 2x2
output matrix.
Using this representation, the backward pass is easily obtained by trans-
posingC; in other words, the error is backpropagated by multiplying the loss
withCT . This operation takes a 4-dimensional vector as input and produces
a 16-dimensional vector as output, and its connectivity pattern is compatible
withCby construction.
Notably, the kernel w defines both the matrices C and CT used for the
for ward and backward passes.
4.2 Transposed convolution
Lets now consider what would be required to go the other way around, i.e.,
map from a 4-dimensional space to a 16-dimensional space, while keeping the
connectivity pattern of the convolution depicted in Figure 2.1. This operation
is known as a transposed convolution.
Transposed convolutions also called fractionally strided convolutions or
deconvolutions 1 work by swapping the for ward and backward passes of a con-
volution. One way to put it is to note that the kernel defines a convolution, but
whether its a direct convolution or a transposed convolution is determined by
how the for ward and backward passes are computed.
for instance, although the kernel w defines a convolution whose for ward and
backward passes are computed by multiplying with C and CT respectively, it
also defines a transposed convolution whose for ward and backward passes are
computed by multiplying withCT and (CT )T =C respectively. 2
Finally note that it is always possible to emulate a transposed convolution
with a direct convolution. The disadvantage is that it usually involves adding
1 The term “deconvolution” is sometimes used in the literature, but we advocate against it
on the grounds that a deconvolution is mathematically defined as the inverse of a convolution,
which is different from a transposed convolution.
2 The transposed convolution operation can be thought of as the gradient of some convolution
with respect to its input, which is usually how transposed convolutions are implemented
in practice.
many columns and rows of zeros to the input, resulting in a much less efficient
implementation.
Building on what has been introduced so far, this chapter will proceed some-
what backwards with respect to the convolution arithmetic chapter, deriving the
properties of each transposed convolution by referring to the direct convolution
with which it shares the kernel, and defining the equivalent direct convolution.
4.3 No zero padding, unit strides, transposed
The simplest way to think about a transposed convolution on a given input is
to imagine such an input as being the result of a direct convolution applied on
some initial feature map. The transposed convolution can be then considered as
the operation that allows to recover the shape 3 of this initial feature map.
Lets consider the convolution of a 3x3 kernel on a 4x4 input with unitary
stride and no padding (i.e.,i= 4,k= 3,s= 1 and p= 0). As depicted in
Figure 2.1, this produces a 2x2 output. The transpose of this convolution will
then have an output of shape 4x4 when applied on a 2x2 input.
Another way to obtain the result of a transposed convolution is to apply an
equivalent but much less efficient direct convolution. The example described
so far could be tackled by convolving a 3x3 kernel over a 2x2 input padded
with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,k0 =k,s0 = 1 and
p0 = 2), as shown in Figure 4.1. Notably, the kernels and strides sizes remain
the same, but the input of the transposed convolution is now zero padded. 4
One way to understand the logic behind zero padding is to consider the
connectivity pattern of the transposed convolution and use it to guide the design
of the equivalent convolution. for example, the top left pixel of the input of the
direct convolution only contribute to the top left pixel of the output, the top
right pixel is only connected to the top right output pixel, and so on.
To maintain the same connectivity pattern in the equivalent convolution it is
necessary to zero pad the input in such a way that the first (top-left) application
of the kernel only touches the top-left pixel, i.e., the padding has to be equal to
the size of the kernel minus one.
Proceeding in the same fashion it is possible to determine similar observa-
tions for the other elements of the image, giving rise to the following relationship:
3 Note that the transposed convolution does not guarantee to recover the input itself, as it
is not defined as the inverse of the convolution, but rather just returns a feature map that has
the same width and height.
4 Note that although equivalent to applying the transposed matrix, this visualization adds
a lot of zero multiplications in the for m of zero padding. This is done here for illustration
purposes, but it is inefficient, and software implementations will normally not perfor m the
useless zero multiplications.
Relationship 8.A convolution described bys= 1,p= 0 and k
has an associated transposed convolution described byk0 =k,s0 =s
and p0 = kx1 and its output size is
<<FORMULA>>
Interestingly, this corresponds to a fully padded convolution with unit strides.
4.4 Zero padding, unit strides, transposed
Knowing that the transpose of a non-padded convolution is equivalent to con-
volving a zero padded input, it would be reasonable to suppose that the trans-
pose of a zero padded convolution is equivalent to convolving an input padded
withlesszeros.
It is indeed the case, as shown in Figure 4.2 for i= 5,k= 4 and p= 2.
for mally, the following relationship applies for zero padded convolutions:
Relationship 9.A convolution described by s= 1,k and phas an
associated transposed convolution described by k0 =k,s0 =s and
p0 =kp1 and its output size is
<<FORMULA>>
4.4.1 Half (same) padding, transposed
By applying the same inductive reasoning as befor e, it is reasonable to expect
that the equivalent convolution of the transpose of a half padded convolution
is itself a half padded convolution, given that the output size of a half padded
convolution is the same as its input size. Thus the following relation applies:
Relationship 10.A convolution described byk= 2n+1; n2N,
s= 1 and p=bk=2c=nh as an associated transposed convolution
described byk0 =k,s0 =s and p0 =p and its output size is
<<FORMULA>>
<<FIGURE>>
Figure 4.3 provides an example for i= 5,k= 3 and (therefor e)p= 1.
4.4.2 Full padding, transposed
Knowing that the equivalent convolution of the transpose of a non-padded con-
volution involves full padding, it is unsurprising that the equivalent of the trans-
pose of a fully padded convolution is a non-padded convolution:
<<FIGURE>>
Figure 4.1: The transpose of convolving a 3x3 kernel over a 4x4 input using
unit strides (i.e.,i= 4,k= 3,s= 1 and p= 0). It is equivalent to convolving
a 3x3 kernel over a 2x2 input padded with a 2x2 border of zeros using unit
strides (i.e.,i0 = 2,k0 =k,s0 = 1 and p0 = 2).
<<FIGURE>>
Figure 4.2: The transpose of convolving a 4x4 kernel over a 5x5 input padded
with a 2x2 border of zeros using unit strides (i.e.,i= 5,k= 4,s= 1 and
p= 2). It is equivalent to convolving a 4x4 kernel over a 6x6 input padded
with a 1x1 border of zeros using unit strides (i.e.,i0 = 6,k0 =k,s0 = 1 and
p0 = 1).
<<FIGURE>>
Figure 4.3: The transpose of convolving a 3x3 kernel over a 5x5 input using
half padding and unit strides (i.e.,i= 5,k= 3,s= 1 and p= 1). It is
equivalent to convolving a 3x3 kernel over a 5x5 input using half padding
and unit strides (i.e.,i0 = 5,k0 =k,s0 = 1 and p0 = 1).
Relationship 11.A convolution described bys= 1,k and p= kx1
has an associated transposed convolution described byk0 =k,s0 =s
and p0 = 0 and its output size is
<<FIGURE>>
Figure 4.4 provides an example for i= 5,k= 3 and (therefor e)p= 2.
4.5 No zero padding, non-unit strides, transposed
Using the same kind of inductive logic as for zero padded convolutions, one
might expect that the transpose of a convolution with s >1 involves an equiv-
alent convolution with s <1. As will be explained, this is a valid intuition,
which is why transposed convolutions are sometimes called fractionally strided
convolutions.
Figure 4.5 provides an example for i= 5,k= 3 and s= 2which helps
understand what fractional strides involve: zeros are inserted between input
units, which makes the kernel move around at a slower pace than with unit
strides. 5
for the moment, it will be assumed that the convolution is non-padded
(p= 0) and that its input size i is such that ixk is a multiple ofs. In that
case, the following relationship holds:
Relationship 12.A convolution described byp= 0,k and s and
whose input size is such that ixk is a multiple ofs, has an associated
transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 = kx1 ,
where~i0 is the size of the stretched input obtained by adding sx1
zeros between each input unit, and its output size is
<<FORMULA>>
4.6 Zero padding, non-unit strides, transposed
When the convolutions input sizeiis such thati+ 2pkis a multiple ofs,
the analysis can extended to the zero padded case by combining Relationship 9
and Relationship 12:
5 Doing so is inefficient and real-world implementations avoid useless multiplications by
zero, but conceptually it is how the transpose of a strided convolution can be thought of.
<<FIGURE>>
Figure 4.4: The transpose of convolving a 3x3 kernel over a 5x5 input using
full padding and unit strides (i.e.,i= 5,k= 3,s= 1 and p= 2). It is equivalent
to convolving a 3x3 kernel over a77input using unit strides (i.e.,i0 = 7,
k0 =k,s0 = 1 and p0 = 0).
<<FIGURE>>
Figure 4.5: The transpose of convolving a 3x3 kernel over a 5x5 input using
2x2 strides (i.e.,i= 5,k= 3,s= 2 and p= 0). It is equivalent to convolving
a 3x3 kernel over a 2x2 input (with1zero inserted between inputs) padded
with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,~i0 = 3,k0 =k,s0 = 1
and p0 = 2).
<<FIGURE>>
Figure 4.6: The transpose of convolving a 3x3 kernel over a 5x5 input padded
with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,k= 3,s= 2 and
p= 1). It is equivalent to convolving a 3x3 kernel over a 3x3 input (with
1zero inserted between inputs) padded with a 1x1 border of zeros using unit
strides (i.e.,i0 = 3,~i0 = 5,k0 =k,s0 = 1 and p0 = 1).
Relationship 13.A convolution described byk,s and p and whose
input sizeiis such tha ti+2pk is a multiple of s has an associated
transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 =
kp1, where ~i0 is the size of the stretched input obtained by
adding sx1 zeros between each input unit, and its output size is
<<FORMULA>>
<<FIGURE>>
Figure 4.6 provides an example for i= 5,k= 3,s= 2 and p= 1.
The constraint on the size of the inputican be relaxed by introducing
another parametera2 f0;:::; sx1 gthat allows to distinguish between thes
different cases that all lead to the samei0 :
Relationship 14.A convolution described byk,s and phas an
associated transposed convolution described bya,~i0 ,k0 =k,s0 = 1
and p0 =kp1, where~i0 is the size of the stretched input obtained
by adding sx1 zeros between each input unit, and a= (i+ 2pk)
modsrepresents the number of zeros added to the bottom and right
edges of the input, and its output size is
<<FORMULA>>
<<FIGURE>>
Figure 4.7 provides an example for i= 6,k= 3,s= 2 and p= 1.
<<FIGURE>>
Figure 4.7: The transpose of convolving a 3x3 kernel over a 6x6 input padded
with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,k= 3,s= 2 and
p= 1). It is equivalent to convolving a 3x3 kernel over a 2x2 input (with
1zero inserted between inputs) padded with a 1x1 border of zeros (with an
additional border of size1added to the bottom and right edges) using unit
strides (i.e.,i0 = 3,~i0 = 5,a= 1,k0 =k,s0 = 1 and p0 = 1).
Chapter 5
Miscellaneous convolutions
5.1 Dilated convolutions
Readers familiar with the deep learning literature may have noticed the term
“dilated convolutions” (or “atrous convolutions”, from the French expressioncon-
volutions à trous) appear in recent papers. Here we attempt to provide an in-
tuitive underst and ing of dilated convolutions. for a more in-depth description
and to underst and in what contexts they are applied, see Chenet al.(2014); Yu
and Koltun (2015).
Dilated convolutions “inflate” the kernel by inserting spaces between the ker-
nel elements. The dilation “rate” is controlled by an additional hyperparameter
d. Implementations may vary, but there are usually dx1 spaces inserted between
kernel elements such thatd= 1corresponds to a regular convolution.
Dilated convolutions are used to cheaply increase the receptive field of output
units without increasing the kernel size, which is especially effective when multi-
ple dilated convolutions are stacked one after another. for a concrete example,
see Oordet al.(2016), in which the proposed WaveNet model implements an
autoregressive generative model for raw audio which uses dilated convolutions
to condition new audio frames on a large context of past audio frames.
To underst and the relationship tying the dilation rated and the output size
o, it is useful to think of the impact ofdon theeffective kernel size. A kernel
of sizekdilated by a factordhas an effective size
<<FORMULA>>
This can be combined with Relationship 6 to for m the following relationship for
dilated convolutions:
Relationship 15.for any i,k,p and s, and for a dilation rated,
<<FORMULA>>
<<FIGURE>>
Figure 5.1: (Dilated convolution) Convolving a 3x3 kernel over a77input
with a dilation factor of 2 (i.e.,i= 7,k= 3,d= 2,s= 1 and p= 0).
Figure 5.1 provides an example for i= 7,k= 3 and d= 2.
Bibliography
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
G. S., Davis, A., Dean, J., Devin, M.,et al.(2015). Tensorflow: Large-
scale machine learning on heterogeneous systems. Software available from
tensorflow.org.
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron,
A., Bouchard, N., Warde-Farley, D., and Bengio, Y. (2012). Theano: new
features and speed improvements.arXiv preprint arXiv:1211.5590.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins,
G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: A cpu and
gpu math compiler in python. InProc. 9th Python in Science Conf, pages
17.
Boureau, Y., Bach, F., LeCun, Y., and Ponce, J. (2010a). Learning mid-level
features for recognition. InProc. International Conference on Computer Vi-
sion and Pattern Recognition (CVPR10). IEEE.
Boureau, Y., Ponce, J., and LeCun, Y. (2010b). A theoretical analysis of feature
pooling in vision algorithms. InProc. International Conference on Machine
learning (ICML10).
Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the
locals: multi-way local pooling for image recognition. InProc. International
Conference on Computer Vision (ICCV11). IEEE.
Chen, L.-C., Pap and reou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2014).
Semantic image segmentation with deep convolutional nets and fully con-
nected crfs.arXiv preprint arXiv:1412.7062.
Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011). Torch7: A matlab-like
environment for machine learning. InBigLearn, NIPS Workshop, number
EPFL-CONF-192376.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. Book in
preparation for MIT Press.
Im, D. J., Kim, C. D., Jiang, H., and Memisevic, R. (2016). Generating images
with recurrent adversarial networks.arXiv preprint arXiv:1602.05110.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast
feature embedding. InProceedings of the ACM International Conference on
Multimedia, pages 675678. ACM.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification
with deep convolutional neural networks. InAdvances in neural infor mation
processing systems, pages 10971105.
Le Cun, Y., Bottou, L., and Bengio, Y. (1997). Reading checks with multilayer
graph transfor mer networks. InAcoustics, Speech, and Signal Processing,
1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages
151154. IEEE.
Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for
semantic segmentation. InProceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 34313440.
Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). Wavenet: A
generative model for raw audio.arXiv preprint arXiv:1609.03499.
Radfor d, A., Metz, L., and Chintala, S. (2015). Unsupervised representa-
tion learning with deep convolutional generative adversarial networks.arXiv
preprint arXiv:1511.06434.
Saxe, A., Koh, P. W., Chen, Z., Bh and , M., Suresh, B., and Ng, A. (2011).
On r and om weights and unsupervised feature learning. In L. Getoor and
T. Scheffer, editors,Proceedings of the 28th International Conference on Ma-
chine Learning (ICML-11), ICML 11, pages 10891096, New York, NY, USA.
ACM.
Visin, F., Kastner, K., Courville, A. C., Bengio, Y., Matteucci, M., and Cho,
K. (2015). Reseg: A recurrent neural network for object segmentation.
Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated con-
volutions.arXiv preprint arXiv:1511.07122.
Zeiler, M. D. and Fergus, R. (2014). Visualizing and underst and ing convolu-
tional networks. InComputer visionECCV 2014, pages 818833. Springer.
Zeiler, M. D., Taylor, G. W., and Fergus, R. (2011). Adaptive deconvolutional
networks for mid and high level feature learning. InComputer Vision (ICCV),
2011 IEEE International Conference on, pages 20182025. IEEE.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
A Survey of Model Compression and Acceleration for Deep Neural Networks
Yu Cheng, Duo Wang, Pan Zhou Member IEEE, and Tao Zhang Senior Member IEEE
Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model
recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment
in devices with low memory resources or in applications with to billions [4].
strict latency requirements. Therefore, a natural thought is to As larger neural networks with more layers and nodes
without significantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech-
niques for compacting and accelerating CNNs model developed. tion, recent years witnessed significant progress in virtual
These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre-
parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle
ferred/compact convolutional filters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced.
For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efficient deep learning methods can have
performance, related applications, advantages, and drawbacks significant impacts on distributed systems, embedded devices,
etc. Then we will go through a few very recent additional and FPGA for Artificial Intelligence. For example, the ResNet-
successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion floating number multiplications matrix, the main datasets used for evaluating the model per-
formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant
this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than
on this topic. 75% of parameters and 50% computational time. For devices
Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte
Model Compression and Acceleration, resources, how to compact the models used on them is also
important.
Achieving these goal calls for joint solutions from many
I. INTRODUCTION
disciplines, including but not limited to machine learning, op-
In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing,
lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works
achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which
These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community
billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years.
very high computation capability plays a key role in their We classify these approaches into four categories: pa-
success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans-
achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional filters, and knowledge distil-
using a network containing 60 million parameters with five lation. The parameter pruning and sharing based methods
convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to
it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor-
ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to
example is the top face verification results on the Labeled estimate the informative parameters of the deep CNNs. The
Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional filters
containing hundreds of millions of parameters, using a mix design special structural convolutional filters to reduce the
of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge
distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
Way, Redmond, WA 98052, USA. compact neural network to reproduce the output of a larger
Duo Wang and Tao Zhang are with the Department of Automation, network.
Tsinghua University, Beijing 100084, China. In Table I, we briefly summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074,
China. rank factorization and knowledge distillation approaches can IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2
TABLE I
<<TABLE>>
be used in DNN models with fully connected layers and
convolutional layers, achieving comparable performances. On
the other hand, methods using transferred/compact filters are
designed for models with convolutional layers only. Low-rank
factorization and transfered/compact filters based approaches
provide an end-to-end pipeline and can be easily implemented
in CPU/GPU environment, which is straightforward. while
parameter pruning & sharing use different methods such as
vector quantization, binary coding and sparse constraints to
perform the task. Generally it will take several steps to achieve
the goal.
<<FIGURE>>
Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output
eter pruning/sharing low-rank factorization can be extracted is the compression model.
from pre-trained ones or trained from scratch. While the
transferred/compact filter and knowledge distillation models
can only support train from scratch. These methods are inde- memory usage and float point operations with little loss in
pendently designed and complement each other. For example, classification accuracy.
transferred layers and parameter pruning & sharing can be The method proposed in [10] quantized the link weights
used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the
used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce
speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con-
properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the
small-weight connections. Finally, the network was retrained
to learn the final weights for the remaining sparse connections.
II. PARAMETER PRUNING AND SHARING
This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importancefitting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classified into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix. instance, BinaryConnect [12], BinaryNet [13] and XNORNet-
works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization activation during the model training. The systematic study in
Network quantization compresses the original network by [15] showed that networks trained with back propagation could
reducing the number of bits required to represent each weight. be resilient to specific weight distortions, including binary
Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights.
quantization to the parameter values. Vanhouckeet al.[8] Drawbacks: the accuracy of the binary nets is significantly
showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet.
in significant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina-
work in [9] used 16-bit fixed-point representation in stochastic rization schemes are based on simple matrix approximations
rounding based CNN training, which significantly reduced and ignore the effect of binarization on the accuracy loss. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3
To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of
Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear
directly minimizes the loss with respect to the binary weights. transformsf(x;M) =(Mx), where()is an element-wise
The work in [17] reduced the time on float point multiplication nonlinear operator,xis the input vector, andMis themn
in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense
converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing
significant changes. matrix-vector products inO(mn)time. Thus, an intuitive
way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing structural matrix. Anmn matrix that can be described
Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured
network complexity and to address the over-fitting issue. An matrix. Typically, the structure should not only reduce the
early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference
[18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and
Surgeon [20] methods reduced the number of connections gradient computations.
based on the Hessian of the loss function, and their work sug- Following this direction, the work in [30], [31] proposed a
gested that such pruning gave higher accuracy than magnitude- simple and efficient approach based on circulant projections,
while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training procedure of those methods followed the way training from <<FORMULA>>, a circulant matrix R^2 R^dxd is defined
as: <<FORMULA>>
scratch manner. A recent trend in this direction is to prune redundant, <<FORMULA>> non-informative weights in a pre-trained CNN model. For <<FORMULA>>
example, Srinivas and Babu [21] explored the redundancy <<FORMULA>> among neurons, and proposed a data-free pruning method to
remove redundant neurons. Hanet al.[22] proposed to reduce <<FORMULA>>
the total number of parameters and operations in the entire thus the memory cost becomesO(d)instead of O(d^2) network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourier used a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan- In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fully on soft weight-sharing was proposed, which included both connected layers. The Adaptive Fast food transform matrix quantization and pruning in one simple (re-)training procedure. R2Rnd was defined as:The above pruning schemes typically produce connections
pruning in CNNs. <<FORMULA>> (2)
There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices. 2
with sparsity constraints. Those sparsity constraints are typ- <<FORMULA>> is a random permutation matrix, and H denotes
ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con-
norm regularizers. The work in [25] imposed group sparsity nected layer with d inputs and n outputs using the Adaptive
constraint on the convolutional filters to achieve structured Fast food transform reduces the storage and the computational
brain Damage, i.e., pruning entries of the convolution kernels costs from O(n^d) to O(n) and from O(n^d) to O(n*log(d)),
in a group-wise fashion. In [26], a group-sparse regularizer respectively.
on neurons was introduced during the training stage to learn The work in [29] showed the effectiveness of the new
compact CNNs with reduced filters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their
structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured
filters, channels or even layers. In the filter-level pruning, all matrix classes, including block and multi-level Toeplitz-like
the above works used l2-norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34].
usedl1 -norm to select and prune unimportant filters. Following this idea, [35] proposed a general structured effi-
Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs.
and sharing. First, pruning with l1 or l2 regularization requires Drawbacks: one problem of this kind of approaches is that
more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the
pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand,
which demands fine-tuning of the parameters and could be how to find a proper structural matrix is difficult. There is no
cumbersome for some applications. theoretical way to derive it out.
C. Designing Structural Matrix
III. LOW-RANK FACTORIZATION AND SPARSITY
In architectures that contain fully-connected layers, it is Convolution operations contribute the bulk of most com-
critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4
TABLE II
COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES
ON ILSVRC-2012.
<<TABLE>>
<<FIGURE>>
Fig. 2. A typical framework of the low-rank regularization method. The left
is the original convolutional layer and the right is the low-rank constraint
convolutional layer with rank-K.
would improve the compression rate as well as the overall
speedup. For the convolution kernels, it can be viewed as a
4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic
the intuition that there is a significant amount of redundancy parameters in deep models using the low-rank method. [42]
in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the final weight
remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted
it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite
help. the fully connected layer for designing compact multi-task
It has been a long time for using low-rank filters to acceler- deep learning architectures.
ate convolution, for example, high dimensional DCT (discrete Drawbacks: low-rank approaches are straightforward for
cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements
to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti-
respectively. Learning separable 1D filters was introduced fied units and maxout. However, the implementation is not
by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which
idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current
approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and
kernels were proposed in [37]. They achieved 2speedup thus cannot perform global parameter compression, which
for a single convolutional layer with 1% drop in classification is important as different layers hold different information.
accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to
decomposition schemes, reporting a 4.5speedup with 1% achieve convergence when compared to the original model.
drop in accuracy in text recognition.
The low-rank approximation was done layer by layer. The IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS
parameters of one layer were fixed after it was done, and the CNNs are parameter efficient due to exploring the trans-layers above were fine-tuned based on a reconstruction error lation invariant property of the representations to the input criterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-fitting. Although a strong theory lowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant property used nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu- decomposition for training low-rank constrained CNNs from tional filters to compress CNN models is motivated by recent scratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input,()be a network or layer and T() be the both the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is defined as:Low-rank) can be used to train CNNs from scratch. However,
there are few differences between them. For example, finding <<FORMULA>> (3)
the best low-rank approximation in CP decomposition is an ill-
posed problem, and the best rank-K (K is the rank number) indicating that transforming the input x by the transform T()
approximation may not exist sometimes. While for the BN and then passing it through the network or layer () should
scheme, the decomposition always exists. We perform a simple give the same result as first mapping x through the network
comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq.
speedup and the compression rates are used to measure their (10), the transforms <<T()>> and <<T_0()>> are not necessarily the
performances. same as they operate on different objects. According to this
As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or filters
be viewed as a 2D matrix and thus the above mentioned () to compress the whole network models. From empirical
methods can also be applied there. There are several classical observation, deep CNNs also benefit from using a large set of
works on exploiting low-rankness in fully connected layers. convolutional filters by applying certain transformT()to a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5
small set of base filters since it acts as a regularizer for the TABLE III
model. A SIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND
Following this direction, there are many recent reworks
proposed to build a convolutional layer from a set of base <<TABLE>>
filters [43][46]. What they have in common is that the
transform T() lies in the family of functions that only operate
in the spatial domain of the convolutional filters. For example,
the work in [45] found that the lower convolution layers of
CNNs learned redundant filters to extract both positive and
negative phase information of an input signal, and definedT() Drawbacks: there are few issues to be addressed for ap-to be the simple negation function: proaches that apply transform constraints to convolutional fil-
<<FORMULA>> (4)
ters. First, these methods can achieve competitive performance x for wide/flat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional filter andW is the filter x ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2compression Using a compact filter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric filters with compact blocks to improve the classification accuracy. The intuition is that the learning the speed, which significantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing33convolution into two11to useful convolutional filters instead of redundant ones. convolutions was used in [48], which achieved significantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace33convolution with11convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The
transformT()was define as:
<<FORMULA>> (5)
V. KNOWLEDGE DISTILLATION
To the best of our knowledge, exploiting knowledge transfer
where were the multi-bias factors. The work in [47] con- (KT) to compress model was first proposed by Caruanaet
side red a combination of rotation by a multiple of 90 and al.[50]. They trained a compressed/ensemble model of strong
horizontal/vertical flipping with: classifiers with pseudo-data labeled, and reproduced the output
of the original larger network. But the work is limited to
<<FORMULA>> (6)
shallow models. The idea has been recently adopted in [51]
whereWT was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide
original filters with angle2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model
transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The
was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from
can achieve good classification performance. a large teacher model into a small one by learning the class
The work in [44] definedT()as the set of translation distributions output via softmax.
functions applied to 2D filters: The work in [52] introduced a KD compression framework,
which eased the training of deep networks by following a
<<FORMULA>> (7)
student-teacher paradigm, in which the student was penalized
whereT(;x;y)denoted the translation of the first operand by according to a softened version of the teachers output. The
(x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into
at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained
can be used to 1) improve the classification accuracy as a to predict the output and the classification labels. Despite
regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various
parameter efficiency by flexibly varying their architectures to image classification tasks. The work in [53] aimed to address
compress networks. the network compression problem by taking advantage of
Table III briefly compares the performance of different depth neural networks. It proposed an approach to train thin
methods with transferred convolutional filters, using VGGNet but deep networks, called FitNets, to compress wide and
(16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended
on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In
observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher
little or no drop in classification accuracy. network, FitNet made the student mimic the full feature maps
of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec-
the capacities of teacher and student may differ greatly. ture such as GoogleNet or Network in Network, can achieve
All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting
10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully
experimental results show that these methods match or outper- optimized the utilization of the computing resources inside
form the teachers performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62]
parameters and multiplications. and motivated them to increase the depth and width of the
There are several extension along this direction of dis- network while keeping the computational budget constant.
tillation knowledge. The work in [54] trained a parametric The work in [63] targeted the Residual Network based
student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called
proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory
neural networks for the student model. Different from previous setup to train short networks and used deep networks at test
works which represented the knowledge using the soften label time. It started with very deep networks, while during training,
probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers
neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this
information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual
The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed
instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers
network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best
are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional
mations between neural network specifications. Zagoruyko networks with adaptive inference graphs to adaptively define
et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66].
assumption of FitNet. They transferred the attention maps that Other approaches to reduce the convolutional overheads in-are summaries of the full activations. clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help significantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classification tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral filters [70]. Those worksperformance competitive with other type of approaches. only aim to speed up the computation but not reduce the
memory storage.
VI. OTHER TYPES OF APPROACHES
We first summarize the works utilizing attention-based
methods. Note that attention-based mechanism [58] can reduce
VII. BENCHMARKS , EVALUATION AND DATABASES
computations significantly by learning to selectively focus or In the past five years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to first standard models include LeNets [71], All-CNN-nets [72] andfind the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been significantly reduced. layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected
a sparse combination of the experts to process each input. In The standard criteria to measure the quality of model
[61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the
which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters
selected and executed a subset of D2NN neurons based on the in the original model Manda is that of the compressed
input. model M , then the compression rate (M;M ) of M over
There have been other attempts to reduce the number of Mis aparameters of neural networks by replacing the fully connected (M;M ) = : (8)a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7
TABLE IV or low rank factorization based methods. If you need
SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION . and transferred convolutional filters approaches could be
considered.
For applications in some specific domains, methods with low-rank factorization [40] human prior (like the transferred convolutional filters, Network in network [73] low-rank factorization [40]
<<TABLE>> structural matrix) sometimes have benefits. For example,
when doing medical images classification, transferred Residual networks [75] compact filters [49], stochastic depth [63] convolutional filters could work well as medical images parameter sharing [24]
(like organ) do have the rotation transformation property.
Usually the approaches of pruning & sharing could give parameter pruning [20], [22] reasonable compression rate while not hurt the accuracy.
Thus for applications which requires stable model accu-
Another widely used measurement is the index space saving racy, it is better to utilize pruning & sharing.
defined in several papers [30], [35] as If your problem involves small/medium size datasets, you
can try the knowledge distillation approaches. The com-aa
<<FORMULA>> (9) pressed student model can take the benefit of transferring a knowledge from teacher model, making it robust datasets
where a and a are the number of the dimension of the index which are not large.
space in the original model and that of the compressed model, As we mentioned before, techniques of the four groups
respectively. are orthogonal. It is reasonable to combine two or three
Similarly, given the running timesofMands ofM , of them to maximize the performance. For some spe-
the speedup rate <<FORMULA>> is defined as: cific applications, like object detection, which requires
s both convolutional and fully connected layers, you can
<<FORMULA>> (10)
compress the convolutional layers with low rank based
Most work used the average training time per epoch to measure method and the fully connected layers with a pruning
the running time, while in [30], [35], the average testing time technique.
was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster
computation for both the training and the testing stages. Techniques for deep model compression and acceleration
Good compression methods are expected to achieve almost are still in the early stage and the following challenges still
the same performance as the original model with much smaller need to be addressed.
parameters and less computational time. However, for different Most of the current state-of-the-art approaches are built
applications with different CNN designs, the relation between on well-designed CNN models, which have limited free-
parameter size and computational time may be different. dom to change the configuration (e.g., network structural,
For example, it is observed that for deep CNNs with fully hyper-parameters). To handle more complicated tasks,
connected layers, most of the parameters are in the fully it should provide more plausible ways to configure the
connected layers; while for image classification tasks, float compressed models.
point operations are mainly in the first few convolutional layers Pruning is an effective way to compress and acceler-
since each filter is convolved with the whole image, which is ate CNNs. The current pruning techniques are mostly
usually very large at the beginning. Thus compression and designed to eliminate connections between neurons. On
acceleration of the network should focus on different type of the other hand, pruning channel can directly reduce the
layers for different applications. feature map width and shrink the model into a thinner
one. It is efficient but also challenging because removing
VIII. D ISCUSSION AND CHALLENGES channels might dramatically change the input of the
following layer.In this paper, we summarized recent efforts on compressing
and accelerating deep neural networks (DNNs). Here we dis- As we mentioned before, methods of structural matrix
and transferred convolutional filters impose prior humancuss more details about how to choose different compression knowledge to the model, which could significantly affectapproaches, and possible challenges/solutions on this area. the performance and stability. It is critical to investigate
how to control the impact of those prior knowledge.A. General Suggestions The methods of knowledge distillation provide many ben-
There is no golden rule to measure which approach is the efits such as directly accelerating model without special
best. How to choose the proper method is really depending hardware or implementations. It is still worthy developing
on the applications and requirements. Here are some general KD-based approaches and exploring how to improve their
guidance we can provide: performances.
If the applications need compacted models from pre- Hardware constraints in various of small platforms (e.g.,
trained models, you can choose either pruning & sharing mobile, robotic, self-driving car) are still a major problem IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8
to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g.,
use of the limited computational source and how to design video and image frames [88], [89]).
special compression methods for such platforms are still
challenges that need to be addressed. IX. ACKNOWLEDGMENTS
Despite the great achievements of these compression ap-
proaches, the black box mechanism is still the key barrier The authors would like to thank the reviewers and broader
to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular,
is still an important problem. we would like to thank Hong Zhao from the Department of
Automation of Tsinghua University for her help on modifying
C. Possible Solutions the paper. This research is supported by National Science
Foundation of China with Grant number 61401169.To solve the hyper-parameters configuration problem, we
can rely on the recent learning-to-learn strategies [76], [77].
This framework provides a mechanism allowing the algorithm REFERENCES
to automatically learn how to exploit structure in the problem [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with of interest. Very recently, leveraging reinforcement learning deep convolutional neural networks,” inNIPS, 2012.
to efficiently sample the design space and improve the model [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
compression has also been tried [78]. gap to human-level performance in face verification,” inCVPR, 2014.
[3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efficiency benefit on both adaptive feature sharing in multi-task networks with applications in
CPU and GPU because no special implementation is required. person attribute classification,”CoRR, vol. abs/1611.05377, 2016.
But it is also challenging to handle the input configuration. [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel distributed deep networks,” inNIPS, 2012.
pruning methods [79], which focus on imposing sparse con- [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
straints on weights during training. However, training from recognition,”CoRR, vol. abs/1512.03385, 2015.
[6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In deep convolutional networks using vector quantization,”CoRR, vol.
[80], the authors provided an iterative two-step algorithm to abs/1412.6115, 2014.
effectively prune channels in each layer. [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized
convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models on Computer Vision and Pattern Recognition (CVPR), 2016.
and transferring it to the student models is useful for the [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
knowledge distillation (KD) approaches. Instead of directly re- neural networks on cpus,” inDeep Learning and Unsupervised Feature
Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl- [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
edge of neurons could be helpful. One can derive a way to learning with limited numerical precision,” inProceedings of the
select essential neurons related to the task [81], [82]. The 32Nd International Conference on International Conference on Machine
Learning - Volume 37, ser. ICML15, 2015, pp. 17371746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
or samples, that implies these regions or samples share some deep neural networks with pruning, trained quantization and huffman
common properties that may relate to the task. coding,”International Conference on Learning Representations (ICLR),
2016. For methods with the convolutional filters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network
matrix, we can conclude that the transformation lies in the quantization,”CoRR, vol. abs/1612.01543, 2016.
family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is in Neural Information Processing Systems 28: Annual Conference on
to provide a generalization of the aforementioned approaches Neural Information Processing Systems 2015, December 7-12, 2015,
in two aspects: 1) instead of limiting the transformation to Montreal, Quebec, Canada, 2015, pp. 31233131.
[13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predefined transformations, let it be the works with weights and activations constrained to +1 or -1,”CoRR, vol.
whole family of spatial transformations applied on 2D filters abs/1602.02830, 2016.
or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in model parameters. ECCV, 2016.
Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha,
some general/unified approaches is one direction. Wanget al. “Deep neural networks are robust to weight binarization and other non-
[83] presented a feature map dimensionality reduction method linear distortions,”CoRR, vol. abs/1606.01981, 2016.
[16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen- networks,”CoRR, vol. abs/1611.01600, 2016.
erated from different filters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
information of the original network. The idea can be applied with few multiplications,”CoRR, vol. abs/1510.03009, 2015.
[18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The construction with back-propagation,” inAdvances in Neural Information
work in [84] proposed a one-shot whole network compression Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177185.
scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information
processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and fine-tuning to make deep Damage, pp. 598605.
CNNs work in mobile devices. [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives
Despite the classification task, people are also adapting the for network pruning: Optimal brain surgeon,” inAdvances in Neural
Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164 compacted models in other tasks [85][87]. We would like to 171. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9
[21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net-
networks,” inProceedings of the British Machine Vision Conference works,”arXiv preprint arXiv:1602.07576, 2016.
2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp. [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural
31.131.12. networks,” inAdvances In Neural Information Processing Systems, 2016,
[22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and pp. 10821090.
connections for efficient neural networks,” inProceedings of the 28th [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and
International Conference on Neural Information Processing Systems, ser. improving convolutional neural networks via concatenated rectified
NIPS15, 2015. linear units,”arXiv preprint arXiv:1603.05201, 2016.
[23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com- [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in
pressing neural networks with the hashing trick.” JMLR Workshop and deep neural networks,”arXiv preprint arXiv:1604.00676, 2016.
Conference Proceedings, 2015. [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic
[24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural symmetry in convolutional neural networks,” inProceedings of the
network compression,”CoRR, vol. abs/1702.04008, 2017. 33rd International Conference on International Conference on Machine
[25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain Learning - Volume 48, ser. ICML16, 2016.
damage,” in2016 IEEE Conference on Computer Vision and Pattern [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, resnet and the impact of residual connections on learning.”CoRR, vol.
pp. 25542564. abs/1602.07261, 2016.
[26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified,
cnns,” inEuropean Conference on Computer Vision, Amsterdam, the small, low power fully convolutional neural networks for real-time object
Netherlands, October 2016, pp. 662677. detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016.
[27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ
sparsity in deep neural networks,” inAdvances in Neural Information inProceedings of the 12th ACM SIGKDD International Conference on
Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, Knowledge Discovery and Data Mining, ser. KDD 06, 2006, pp. 535
I. Guyon, and R. Garnett, Eds., 2016, pp. 20742082. 541.
[28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
filters for efficient convnets,”CoRR, vol. abs/1608.08710, 2016. Advances in Neural Information Processing Systems 27: Annual Confer-
[29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for ence on Neural Information Processing Systems 2014, December 8-13
small-footprint deep learning,” inAdvances in Neural Information Pro- 2014, Montreal, Quebec, Canada, 2014, pp. 26542662.
cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
and R. Garnett, Eds., 2015, pp. 30883096. neural network,”CoRR, vol. abs/1503.02531, 2015.
[30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
Chang, “An exploration of parameter redundancy in deep networks with Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550,
circulant projections,” inInternational Conference on Computer Vision 2014.
(ICCV), 2015. [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling,
[31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and “Bayesian dark knowledge,” inAdvances in Neural Information Process-
S. Chang, “Fast neural networks with circulant projections,”CoRR, vol. ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
abs/1502.03436, 2015. and R. Garnett, Eds., 2015, pp. 34203428.
[32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression
and Z. Wang, “Deep fried convnets,” inInternational Conference on by distilling knowledge from neurons,” inProceedings of the Thirtieth
Computer Vision (ICCV), 2015. AAAI Conference on Artificial Intelligence, February 12-17, 2016,
[33]J. Chun and T. Kailath,Generalized Displacement Structure for Block- Phoenix, Arizona, USA., 2016, pp. 35603566.
Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel- [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning
berg: Springer Berlin Heidelberg, 1991, pp. 215236. via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015.
[34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
in low-rank tensor formats via cross approximation,”SIAM J. Scientific Improving the performance of convolutional neural networks via atten-
Computing, vol. 37, no. 2, 2015. tion transfer,”CoRR, vol. abs/1612.03928, 2016.
[35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc: [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
A structured efficient linear layer,” inInternational Conference on jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014.
Learning Representations (ICLR), 2016. [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and
[36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable A. C. Courville, “Dynamic capacity networks,” inProceedings of the
filters,” in2013 IEEE Conference on Computer Vision and Pattern 33nd International Conference on Machine Learning, ICML 2016, New
Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754 York City, NY, USA, June 19-24, 2016, 2016, pp. 25492558.
2761. [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
[37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, and J. Dean, “Outrageously large neural networks: The sparsely-gated
“Exploiting linear structure within convolutional networks for efficient mixture-of-experts layer,” 2017.
evaluation,” inAdvances in Neural Information Processing Systems 27, [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and
Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. J. Odobez, “Deep dynamic neural networks for multimodal gesture
Weinberger, Eds., 2014, pp. 12691277. segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell.,
[38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional vol. 38, no. 8, pp. 15831597, 2016.
neural networks with low rank expansions,” inProceedings of the British [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
Machine Vision Conference. BMVA Press, 2014. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
[39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit- inComputer Vision and Pattern Recognition (CVPR), 2015.
sky, “Speeding-up convolutional neural networks using fine-tuned cp- [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep
decomposition,”CoRR, vol. abs/1412.6553, 2014. Networks with Stochastic Depth, 2016.
[40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual
with low-rank regularization,” vol. abs/1511.06067, 2015. networks with separated stochastic depth,”CoRR, vol. abs/1612.01230,
[41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, 2016.
“Predicting parameters in deep learning,” in Advances in Neural [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and
Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, R. Feris, “Blockdrop: Dynamic inference paths in residual networks,”
Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 21482156. inCVPR, 2018.
[Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer-
files/nips26/1053.pdf ence graphs,” 2018.
[42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab- [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional
hadran, “Low-rank matrix factorization for deep neural network training networks through FFTs, 2014.
with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
Acoustics, Speech and Signal Processing, 2013. works,” in2016 IEEE Conference on Computer Vision and Pattern IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong,
pp. 40134021. M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X.
[69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S. Yu, “Ibm research and columbia university trecvid-2012 multimedia
Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol. event detection (med), multimedia event recounting (mer), and semantic
abs/1611.05138, 2016. indexing (sin) systems,” 2012.
[70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving
pooling in deep networks,” inProceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018. Yu Cheng(yu.cheng@microsoft.com) currently is a
[71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Researcher at Microsoft. Before that, he was a Re-
applied to document recognition,” inProceedings of the IEEE, 1998, pp. search Staff Member at IBM T.J. Watson Research
22782324. Center. Yu got his Ph.D. from Northwestern Univer-
[72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried- sity in 2015 and bachelor from Tsinghua University
miller, “Striving for simplicity: The all convolutional net,”CoRR, vol. in 2010. His research is about deep learning in
abs/1412.6806, 2014. general, with specific interests in the deep generative
[73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014. model, model compression, and transfer learning.
[74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for He regularly serves on the program committees of
large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014. top-tier AI conferences such as NIPS, ICML, ICLR,
[75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image CVPR and ACL.
recognition,”arXiv preprint arXiv:1512.03385, 2015.
[76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient
descent by gradient descent,” inNeural Information Processing Systems
(NIPS), 2016. Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference ceived the B.S. degree in automation from theon Learning Representations 2016, 2016. Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018. Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in are about deep learning, particularly in few-shotdeep networks,” pp. 22702278, 2016. learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on robotics vision.Computer Vision (ICCV), Oct 2017.
[81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep
neural networks,”ECCV, 2018.
[82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric
learning via cross sample similarities transfer,” inProceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18),
New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852 Pan Zhou(panzhou@hust.edu.cn) is currently an
2859. associate professor with School of Electronic In-
[83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature formation and Communications, Wuhan, China. He
map for portable deep model,” inProceedings of the 34th International received his Ph.D. in the School of Electrical and
Conference on Machine Learning, ser. Proceedings of Machine Learning Computer Engineering at the Georgia Institute of
Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Technology in 2011. Before that, he received his
Convention Centre, Sydney, Australia: PMLR, 0611 Aug 2017, pp. B.S. degree in theAdvanced Classof HUST, and
37033711. a M.S. degree in the Department of Electronics
[84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression and Information Engineering from HUST, Wuhan,
of deep convolutional neural networks for fast and low power mobile China, in 2006 and 2008, respectively. His current
applications,”CoRR, vol. abs/1511.06530, 2015. research interest includes big data analytics and
[85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient machine learning, security and privacy, and information networks.
object detection models with knowledge distillation,” inAdvances in
Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
Eds., 2017, pp. 742751.
[86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob-
“Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE tained his B.S., M.S., and Ph.D. degrees from Ts-
Conference on Computer Vision and Pattern Recognition (CVPR), June inghua University, Beijing, China, in 1993, 1995,
2018. and 1999, respectively, and another Ph.D. degree
[87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, from Saga University, Saga, Japan, in 2002, all in
Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy control engineering. He is currently a Professor with
trade-offs for modern convolutional object detectors,” in2017 IEEE the Department of Automation, Tsinghua University.
Conference on Computer Vision and Pattern Recognition, CVPR 2017, He serves the Associate Dean, School of Information
Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 32963297. Science and Technology and Head of the Department
[88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence of Automation. His current research interests include
modeling for video event detection,” in The IEEE Conference on artificial intelligence, robotics, image processing,
Computer Vision and Pattern Recognition (CVPR), June 2014. control theory, and control of spacecraft.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Analysis and Design of Echo State Networks
Mustafa C. Ozturk
can@cnel.ufl.edu
Dongming Xu
dmxu@cnel.ufl.edu
Jose C. Principe
principe@cnel.ufl.edu
Computational NeuroEngineering Laboratory, Department of Electrical and
Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
The design of echo state network (ESN) parameters relies on the selec-
tion of the maximum eigenvalue of the linearized system around zero
(spectral radius). However, this procedure does not quantify in a sys-
tematic manner the performance of the ESN in terms of approximation
error. This article presents a functional space approximation framework
to better understand the operation of ESNs and proposes an information-
theoretic metric, the average entropy of echo states, to assess the richness
of the ESN dynamics. Furthermore, it provides an interpretation of the
ESN dynamics rooted in system theory as families of coupled linearized
systems whose poles move according to the input signal dynamics. With
this interpretation, a design methodology for functional approximation
is put forward where ESNs are designed with uniform pole distributions
covering the frequency spectrum to abide by the richness metric, irre-
spective of the spectral radius. A single bias parameter at the ESN input,
adapted with the modeling error, configures the ESN spectral radius to
the input-output joint space. Function approximation examples compare
the proposed design methodology versus the conventional design.
1 Introduction
Dynamic computational models require the ability to store and access the
time history of their inputs and outputs. The most common dynamic neural
architecture is the time-delay neural network (TDNN) that couples delay
lines with a nonlinear static architecture where all the parameters (weights)
are adapted with the backpropagation algorithm. The conventional delay
line utilizes ideal delay operators, but delay lines with local first-order re-
cursive filters have been proposed by Werbos (1992) and extensively stud-
ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
1993). Chains of first-order integrators are interesting because they effec-
tively decrease the number of delays necessary to create time embeddings
(Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
ent type of embedding that is largely unexplored. RNNs are perhaps the
most biologically plausible of the artificial neural network (ANN) models
(Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990),
but are not well understood theoretically (Siegelmann & Sontag, 1991;
Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
RNNs is the difficulty to adapt the system weights. Various algorithms,
such as backpropagation through time (Werbos, 1990) and real-time recur-
rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
however, these algorithms suffer from computational complexity, resulting
in slow training, complex performance surfaces, the possibility of instabil-
ity, and the decay of gradients through the topology and time (Haykin,
1998). The problem of decaying gradients has been addressed with spe-
cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
native second-order training methods based on extended Kalman filtering
(Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
et al., 1998) provide more reliable performance and have enabled practical
applications in identification and control of dynamical systems (Kechri-
otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
Kambhampati, & Warwick, 1995).
Recently,twonewrecurrentnetworktopologieshavebeenproposed:the
echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
2002). ESNs possess a highly interconnected and recurrent topology of
nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
and contain information about the history of input and output patterns.
The outputs of these internal PEs (echo states) are fed to a memoryless but
adaptive readout network (generally linear) that produces the network out-
put. The interesting property of ESN is that only the memoryless readout is
trained, whereas the recurrent topology has fixed connection weights. This
reduces the complexity of RNN training to simple linear regression while
preserving a recurrent topology, but obviously places important constraints
in the overall architecture that have not yet been fully studied. Similar ideas
have been explored independently by Maass and formalized in the LSM
architecture. LSMs, although formulated quite generally, are mostly im-
plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
whereas ESNs are dynamical ANN models. Both attempt to model biolog-
ical information processing using similar principles. We focus on the ESN
formulation in this letter.
The echo state condition is defined in terms of the spectral radius (the
largest among the absolute values of the eigenvalues of a matrix, denoted
by·) of the reservoirs weight matrix (W<1). This condition states
that the dynamics of the ESN is uniquely controlled by the input, and the
effect of the initial states vanishes. The current design of ESN parameters
relies on the selection of spectral radius. However, there are many possible
weight matrices with the same spectral radius, and unfortunately they do
not all perform at the same level of mean square error (MSE) for functional
approximation. A similar problem exists in the design of the LSM. LSMs
have been shown to possess universal approximation given the separation
property (SP) for the liquid (reservoir in ESNs) and the approximation
property (AP) for the readout (Maass et al., 2002). SP is quantified by a
kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
(2005) that is based on the rank of a matrix formed by the system states
corresponding to different input signals. The kernel quality is a measure
for the complexity and diversity of nonlinear operations carried out by the
liquid on its input stream in order to boost the classification power of a
subsequent linear decision hyperplane (Maass et al., 2005). A variation of
SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
been argued that complex calculations can be best carried out by networks
on the boundary between ordered and chaotic dynamics.
In this letter,we are interested in studying the ESN for functional approx-
imation (filters that map input function su(·) of time on output function sy(·)
of time). We see two major shortcomings with the current ESN approach
that uses echo state condition as a design principle. First, the impact of fixed
reservoir parameters for function approximation means that the informa-
tion about the desired response is conveyed only to the output projection.
This is not optimal, and strategies to select different reservoirs for different
applications have not been devised. Second, imposing a constraint only on
the spectral radius is a weak condition to properly set the parameters of
the reservoir, as experiments show (different randomizations with the same
spectral radius perform differently for the same problem; see Figure 2).
This letter aims to address these two problems by proposing a frame-
work, a metric, and a design principle for ESNs. The framework is a signal
processing interpretation of basis and projections in functional spaces to
describe and understand the ESN architecture. According to this interpre-
tation, the ESN states implement a set of basis functionals (representation
space) constructed dynamically by the input, while the readout simply
projects the desired response onto this representation space. The metric
to describe the richness of the ESN dynamics is an information-theoretic
quantity, the average state entropy (ASE). Entropy measures the amount of
information contained in a given random variable (Shannon, 1948). Here,
the random variable is the instantaneous echo state from which the en-
tropy for the overall state (vector) is estimated. The probability density
function (pdf) in a differential geometric framework should be thought of
as a volume form; that is, in our case, the pdf of the state vector describes
the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
established information as a coordinate free metric in the state manifold.
Therefore, entropy becomes a global descriptor of information that quanti-
fies the volume of the manifold defined by the random variable. Due to the
time dependency of the states, the state entropy averaged over time (ASE)
is an appropriate estimate of the volume of the state manifold.
The design principle specifies that one should consider independently
thecorrelationamongthebasisandthespectralradius.In the absence of any
information about the desired response, the ESN states should be designed
with the highest ASE, independent of the spectral radius. We interpret the
ESN dynamics as a combination of time-varying linear systems obtained
from the linearization of the ESN nonlinear PE in a small, local neighbor-
hood of the current state. The design principle means that the poles of the
linearized ESN reservoir should have uniform pole distributions to gener-
ate echo states with the most diverse pole locations (which correspond to
the uniformity of time constants). Effectively, this will create the least cor-
related bases for a given spectral radius, which corresponds to the largest
volume spanned by the basis set. When the designer has no other informa-
tion about the desired response to set the basis, this principle distributes
the systems degrees of freedom uniformly in space. It approximates for
ESNs the well-known property of orthogonal basis. The unresolved issue
that ASE does not quantify is how to set the spectral radius, which depends
again on the desired mapping. The concept of memory depth as explained
in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
issues associated with the spectral radius. The correlation time of the de-
sired response (as estimated by the first zero of the autocorrelation function)
gives an indication of the type of spectral radius required (long correlation
time requires high spectral radius). Alternatively, a simple adaptive bias is
added at the ESN input to control the spectral radius integrating the infor-
mation from the input-output joint space in the ESN bases. For sigmoidal
PEs, the bias adjusts the operating points of the reservoir PEs, which has
the net effect of adjusting the volume of the state manifold as required to
approximate the desired response with a small error. This letter shows that
ESNs designed with this strategy obtain systematically better results in a
set of experiments when compared with the conventional ESN design.
2 Analysis of Echo State Networks
2.1 Echo States as Bases and Projections.Let us consider the ar-
chitecture and recursive update equation of a typical ESN more closely.
Consider the recurrent discrete-time neural network given in Figure 1
with M input units, N internal PEs, and L output units. The value of
the input unit at time n is <<u(n)=[u1 (n),u2 (n),...,uM (n)]^T>> , of internal
units are <<x(n)=[x1 (n),x2 (n),...,xN (n)]^T>> , and of output units are <<y(n)=
[y1 (n),y2 (n),...,yL (n)]^T>> . The connection weights are given in anN×M
weight matrixWin =(win ) for connections between the input and the inter- ij
nalPEs,in an N×N matrix W=(wij ) for connections between the internal
PEs, in an L×N matrix <<W_out =(w_out)>> for connections from PEs to the ij
Input Layer Dynamical Reservoir Read-out
<<FIGURE>>
Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed-
weight (W<1) recurrent network and a linear readout. The recurrent net-
work is a reservoir of highly interconnected dynamical components, states of
which are called echo states. The memoryless linear readout is trained to pro-
duce the output.
output units, and in an N× L matrix <<FORMULA>> for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
activation of the internal PEs (echo state) is updated according to
<<FORMULA>>, (2.1)
where f=(f1 ,f2 ,...,fN ) are the internal PEs activation functions.Here, all
i s are hyperbolic tangent functions ( ex ). The output from the readout ex +ex
network is computed according to
<<y(n+1)=f_out (W_out x(n+1))>>, (2.2)
where <<f_out =(f_out ,f_out ,...,f_out )>> are the output units nonlinear functions <<FORMULA>> (Jaeger, 2001, 2002a).
Generally, the readout is linear so f_out is identity.
ESNs resemble the RNN architecture proposed in Puskorius and
Feldkamp (1996) and also used by Sanchez (2004) in brain-machine
interfaces. The critical difference is the dimensionality of the hidden re-
current PE layer and the adaptation of the recurrent weights. We submit
that the ideas of approximation theory in functional spaces (bases and pro-
jections), so useful in adaptive signal processing (Principe, 2001), should
be utilized to understand the ESN architecture. Let h(u(t)) be a real-valued
function of a real-valued vector
<<u(t)=[u1 (t),u2 (t),...,uM (t)] T>>.
In functional approximation, the goal is to estimate the behavior ofh(u(t))
as a combination of simpler functions ϕi (t), called the basis functionals,
such that its approximant,hˆ(u(t)), is given by
<<FORMULA>>.
Here,ai s are the projections ofh(u(t)) onto each basis function. One of
the central questions in practical functional approximation is how to choose
the set of bases to approximate a given desired signal. In signal processing,
thechoicenormallygoesforacompletesetoforthogonalbasis,independent
of the input. When the basis set is complete and can be made as large
as required, fixed bases work wonders (e.g., Fourier decompositions). In
neural computing, the basic idea is to derive the set of bases from the
input signal through a multilayered architecture. For instance, consider a
single hidden layer TDNN with NPEs and a linear output. The hidden-
layer PE outputs can be considered a set of nonorthogonal basis functionals
dependent on the input,
<<FORMULA>>
bij s are the input layer weights, andgis the PE nonlinearity. The approxi-
mation produced by the TDNN is then
<<FORMULA>>, (2.3)
whereai s are the weights of the output layer. Notice that thebij s adapt
the bases and theai s adapt the projection in the projection space. Here the
goal is to restrict the number of bases (number of hidden layer PEs) because
their number is coupled with the number of parameters to adapt, which
has an impact on generalization and training set size, for example. Usually,
since all of the parameters of the network are adapted, the best basis in the
joint (input and desired signals) space as well as the best projection can be
achieved and represents the optimal solution. The output of the TDNN is
a linear combination of its internal representations, but to achieve a basis
set (even if nonorthogonal), linear independence among theϕi (u(t))s must
be enforced. Ito, Shah and Pon, and others have shown that this is indeed
the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
the scope of this article.
The ESN (and the RNN) architecture can also be studied in this frame-
work. The states of equation 2.1 correspond to the basis set, which are
recursively computed from the input, output, and previous states through
Win ,W,andWback . Notice, however, that none of these weight matrices is
adapted, that is, the functional bases in the ESN are uniquely defined by the
input and the initial selection of weights. In a sense, ESNs are trading the
adaptive connections in the RNN hidden layer by a brute force approach
of creating fixed diversified dynamics in the hidden layer.
For an ESN with a linear readout network, the output equation (y(n+
1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi s and
ai s are replaced by the echo states and the readout weights, respectively.
The readout weights are adapted in the training data, which means that the
ESN is able to find the optimal projection in the projection space, just like
the RNN or the TDNN.
A similar perspective of basis and projections for information processing
in biological networks has been proposed by Pouget and Sejnowski (1997).
They explored the possibility that the response of neurons in parietal cortex
serves as basis functions for the transformations from the sensory input
to the motor responses. They proposed that “the role of spatial represen-
tations is to code the sensory inputs and posture signals in a format that
simplifies subsequent computation, particularly in the generation of motor
commands”.
The central issue in ESN design is exactly the nonadaptive nature of
the basis set. Parameter sets in the reservoir that provide linearly inde-
pendent states and possess a given spectral radius may define drastically
different projection spaces because the correlation among the bases is not
constrained. A simple experiment was designed to demonstrate that the se-
lection of the ESN parameters by constraining the spectral radius is not the
most suitable for function approximation. Consider a 100-unit ESN where
the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
the ESN generate the seventh power of the input signal. Different realiza-
tions of a randomly connected 100-unit ESN were constructed where the
entries ofWare set to 0.4,0.4, and 0 with probabilities of 0.025, 0.025,
and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
weights are set to+1or,1 with equal probabilities, andWback is set to
zero. Input is applied for 300 time steps, and the echo states are calculated
using equation 2.1. The next step is to train the linear readout. One method
<<FIGURE>>
Figure 2: Performances of ESNs for different realizations ofWwith the same
weight distribution. The weight values are set to 0.4,0.4, and 0 with proba-
bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
of 0.88. In the 50 realizations, MSEs vary from 5.9×10 9 to 8.9×10 5 . Results
show that for each set of random weights that provide the same spectral ra-
dius, the correlation or degree of redundancy among the bases will change, and
different performances are encountered in practice.
to determine the optimal output weight matrix,Wout , in the mean square
error (MSE) sense (where MSE is defined by <<FORMULA>>) is to use 2 the Wiener solution given by Haykin (2001):
<<FORMULA>>
Here,E[.] denotes the expected value operator, andddenotes the desired
signal. Figure 2 depicts the MSE values for 50 different realizations of
the ESNs. As observed, even though each ESN has the same sparseness
and spectral radius, the MSE values obtained vary greatly among differ-
ent realizations. The minimum MSE value obtained among the 50 realiza-
tions is 5.9x10 9 , whereas the maximum MSE is 8.9x10 5 . This experiment
demonstrates that a design strategy that is based solely on the spectral
radius is not sufficient to specify the system architecture for function ap-
proximation. This shows that for each set of random weights that provide
thesamespectralradius,thecorrelationordegreeofredundancyamongthe
bases will change, and different performances are encountered in practice.
2.2 ESN Dynamics as a Combination of Linear Systems.
It is well known that the dynamics of a nonlinear system can be approximated by
that of a linear system in a small neighborhood of an equilibrium point
(Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis
with hyperbolic tangent nonlinearities and approximate the ESN dynam-
ics by the dynamics of the linearized system in the neighborhood of the
current system state. Hence, when the system operating point varies over
time, the linear system approximating the ESN dynamics changes. We are
particularly interested in the movement of the poles of the linearized ESN.
Consider the update equation for the ESN without output feedback given
by
<<x(n+1)=f(Win u(n+1)+Wx(n))>>.
Linearizing the system around the current statex(n), one obtains the
Jacobian matrix, <<J(n+1)>>, defined by
<<FORMULA>>
Here,net i(n) is the ith entry of the vector <<(W_in u(n+1)+Wx(n))>>, and w_ij
denotes the (i,j)th entry of W. The poles of the linearized system at time
n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the
amplitude of each PE changes, the local slope changes, and so the poles of
A. The transfer function of a linear system <<x(n+1)=Ax(n)+Bu(n)>> is <<X(z) =(zIU(z)A)1>>
Adjoint <<(zIA)>>. The poles of the transfer function can be obtained by solving <<det(zIA)=0>>.
The solution corresponds to the eigenvalues of A.
the linearized system are time varying, although the parameters of ESN are
fixed. In order to visualize the movement of the poles, consider an ESN with
100 states. The entries of the internal weight matrix are chosen to be 0,
0.4 and 0.4 with probabilities 0.9, 0.05, and 0.05.W is scaled such that a
spectral radius of 0.95 is obtained. Input weights are set to +1 or 1 with
equal probabilities. A sinusoidal signal with a period of 100 is fed to the
system, and the echo states are computed according to equation 2.1. Then
the Jacobian matrix and the eigenvalues are calculated using equation 2.5.
Figure 3 shows the pole tracks of the linearized ESN for different input
values. A single ESN with fixed parameters implements a combination of
many linear systems with varying pole locations, hence many different
time constants that modulate the richness of the reservoir of dynamics as a
function of input amplitude. Higher-amplitude portions of the signal tend
to saturate the nonlinear function and cause the poles to shrink toward
the origin of thez-plane (decreases the spectral radius), which results in a
system with a large stability margin. When the input is close to zero, the
poles of the linearized ESN are close to the maximal spectral radius chosen,
decreasing the stability margin. When compared to their linear counterpart,
an ESN with the same number of states results in a detailed coverage of
thez-plane dynamics, which illustrates the power of nonlinear systems.
Similar results can be obtained using signals of different shapes at the ESN
input.
A key corollary of the above analysis is that the spectral radius of an
ESN can be adjusted using a constant bias signal at the ESN input without
changing the recurrent connection matrix,W. The application of a nonzero
constant bias will move the operating point to regions of the sigmoid func-
tion closer to saturation and always decrease the spectral radius due to the
shape of the nonlinearity. 2 The relevance of bias in terms of overall system
performance has also been discussed in Jaeger (2002b) and Bertschinger
and Natschlager (2004), but here we approach it from a system theory per-¨
spective and explain its effect on reservoir dynamics.
3 Average State Entropy as a Measure of the Richness of ESN Reservoir
Previous research was aware of the influence of diversity of the recurrent
layer outputs on the overall performance of ESNs and LSMs. Several met-
rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al.,
2 Assume W has nondegenerate eigenvalues and corresponding linearly independent
eigenvectors. Then consider the eigendecomposition of W, where <<FORMULA>>,Pis the
eigenvectormatrixandDisthediagonalmatrixofeigenvalues <<FORMULA>> of W.SinceF(n)andD
are diagonal, <<FORMULA>> is the eigendecomposition
of <<J(n+1)>>. Here, each entry of <<FORMULA>>, is an eigenvalue of J. Therefore,
<<FORMULA>> since <<FORMULA>>.
<<FIGURE>>
Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input
goes through a cycle. An ESN with fixed parameters implements a combination
of linear systems with varying pole locations. (A) One cycle of sinusoidal signal
with a period of 100. (BE) The positions of poles of the linearized systems
when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative
pole locations show the movement of the poles as the input changes. Due to
the varying pole locations, different time constants modulate the richness of
the reservoir of dynamics as a function of input amplitude. Higher-amplitude
signals tend to saturate the nonlinear function and cause the poles to shrink
toward the origin of thez-plane (decreases the spectral radius), which results in
a system with a large stability margin. When the input is close to zero, the poles
ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing
the stability margin. An ESN with more states results in a detailed coverage of
thez-plane dynamics, which illustrates the power of nonlinear systems, when
compared to their linear counterpart.
Here, our approach of bases and projections leads to a new metric.
We propose the instantaneous state entropy to quantify the distribution of
instantaneous amplitudes across the ESN states. Entropy of the instanta-
neous ESN states is appropriate to quantify performance in function ap-
proximation because the ESN output is a mere weighted combination of
the instantaneous value of the ESN states. If the echo states instantaneous
amplitudes are concentrated on only a few values across the ESN state dy-
namic range, the ability to approximate an arbitrary desired response by
weighting the states is limited (and wasteful due to redundancy between
the different states), and performance will suffer. On the other hand, if the
ESN states provide a diversity of instantaneous amplitudes, it is much eas-
ier to achieve the desired mapping. Hence, the instantaneous entropy of the
states appears as a good measure to quantify the richness of dynamics with
instantaneous mappers. Due to the time structure of signals, the average
state entropy (ASE), defined as the state entropy averaged over time, will be
the parameter used to quantify the diversity in the dynamical reservoir of
the ESN. Moreover, entropy has been proposed as an appropriate measure
of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE
measures the volume of the echo state manifold spanned by trajectories.
Renyisquadraticentropyisemployedherebecauseitisaglobalmeasure
of information. In addition, an efficient nonparametric estimator of Renyis
entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe,
Xu, & Fisher, 2000). Renyis entropy with parameterγfor a random variable
X with a <<FORMULA>> is given by Renyi (1970):
<<FORMULA>>
Renyis quadratic entropy is obtained forγ=2 (forγ→1, Shannons en-
tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un-
known pdf to be estimated, Parzen windowing approximates the underly-
ing pdf by
<<FORMULA>>
whereKσ is the kernel function with the kernel sizeσ. Then the Renyis
quadratic entropy can be estimated by (Principe et al., 2000)
<<FORMULA>>
The instantaneous state entropy is estimated using equation 3.1 where
thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T
of an ESN withNinternal PEs. Results will be shown with a gaussian kernel
with kernel size chosen to be 0.3 of the standard deviation of the entries
of the state vector. We will show that ASE is a more sensitive parameter to
quantify the approximation properties of ESNs by experimentally demon-
strating that ESNs with different spectral radius and even with the same
spectral radius display different ASEs.
Let us consider the same 100-unit ESN that we used in the previous
section built with three different spectral radii 0.2, 0.5, 0.8 with an input
signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks.
The instantaneous state entropy is also calculated at each time step using
equation 3.1 and plotted in Figure 4B. First, note that the instantaneous
state entropy changes over time with the distribution of the echo states as
we would expect, since state entropy is dependent on the input signal that
also changes in this case. Second, as the spectral radius increases in the
simulation, the diversity in the echo states increases. For the spectral radius
of 0.2, echo states instantaneous amplitudes are concentrated on only a
few values, which is wasteful due to redundancy between different states.
In practice, to quantify the overall representation ability over time, we will
use ASE, which takes values0.735,0.007, and 0.335 for the spectral
radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral
radius, several ASEs are possible. Figure 4C shows ASEs from 50 different
realizations of ESNs with the same spectral radius of 0.5, which means that
ASE is a finer descriptor of the dynamics of the reservoir. Although we
have presented an experiment with sinusoidal signal, similar results are
obtained for other inputs as long as the input dynamic range is properly
selected.
Maximizing ASE means that the diversity of the states over time is the
largest and should provide a basis set that is as uncorrelated as possible.
This condition is unfortunately not a guarantee that the ESN so designed
will perform the best, because the basis set in ESNs is created independent
of the desired response and the application may require a small spectral
radius. However, we maintain that when the desired response is not ac-
cessible for the design of the ESN bases or when the same reservoir is
to be used for a number of problems, the default strategy should be to
maximize the ASE of the state vector. The following section addresses
the design of ESNs with high ASE values and a simple mechanism to
adjust the reservoir dynamics without changing the recurrent connection
weights.
4 Designing Echo State Networks
4.1 Design of the Echo State Recurrent Connections.According to the
interpretation of ESNs as coupled linear systems, the design of the internal
connection matrix, W, will be based on the distribution of the poles of the
linearized system around zero state. Our proposal is to design the ESN
such that the linearized system has uniform pole distribution inside the
unit circle of thez-plane. With this design scenario, the system dynamics
will include uniform coverage of time constants arising from the uniform
distribution of the poles, which also decorrelates as much as possible the
basis functionals. This principle was chosen by analogy to the identification
oflinearsystemsusingKautzfilters(Kautz,1954),whichshowsthatthebest
approximation of a given transfer function by a linear system with finite
order is achieved when poles are placed in the neighborhood of the spectral
resonances. When no information is available about the desired response,
we should uniformly spread the poles to anticipate good approximation to
arbitrary mappings.
We again use a maximum entropy principle to distribute the poles inside
the unit circle uniformly. The constraints of a circle as boundary conditions
for discrete linear systems and complex conjugate locations are easy to
include for the pole distribution (Thogula, 2003). The poles are first initial-
ized at random locations; the quadratic Renyis entropy is calculated by
equation 3.1, and poles are moved such that the entropy of the new dis-
tribution is increased over iterations (Erdogmus & Principe, 2002). This
method is efficient to find uniform coverage of the unit circle with an arbi-
trary number of poles. The system with the uniform pole locations can be
interpreted using linear system theory. The poles that are close to the unit
circle correspond to many sharp bandpass filters specializing in different
frequency regions, whereas the inner poles realize filters of larger frequency
support. Moreover, different orientations (angles) of the poles create filters
of different center frequencies.
Now the problem is to construct an internal weight matrix from the pole
locations (eigenvalues ofW). In principle, we would like to create a sparse
<<FIGURE>>
Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs
ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8,
from top to bottom, respectively. The diversity of echo states increases when the
spectral radius increases. Within the dynamic range of the echo states, systems
with smaller spectral radius can generate only uneven representations, while
forW=0.8, outputs of echo states almost uniformly distribute within their
dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1.
Information contained in the echo states is changing over time according to the
input amplitude. Therefore, the richness of representation is controlled by the
input amplitude. Moreover, the value of ASE increases with spectral radius.
(C) ASEs from 50 different realizations of ESNs with the same spectral radius
of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the
reservoir than the spectral radius.
matrix, so we started with the sparsest matrix (with an inverse), which is
the direct canonical structure given by (Kailath, 1980)
<<FORMULA>>
The characteristic polynomial of W_i's
<<FORMULA>>, (4.2)
wherepi s are the eigenvalues andai s are the coefficients of the character-
istic polynomial ofW. Here, we know the pole locations of the linear system
obtained from the linearization of the ESN, so using equation 4.2, we can
obtain the characteristic polynomial and constructWmatrix in the canon-
ical form using equation 4.1. We will call the ESN constructed based on
the uniform pole principle ASE-ESN. All other possible solutions with the
same eigenvalues can be obtained byQ1 WQ,whereQis any nonsingular
matrix.
To corroborate our hypothesis, we would like to show that the linearized
ESN designed with the recurrent weight matrix having the eigenvalues
uniformly distributed inside the unit circle creates higher ASE values for a
given spectral radius compared to other ESNs with random internal con-
nection weight matrices. We will consider an ESN with 30 states and use our
procedure to create theWmatrix for ASE-ESN for different spectral radii
between <<[0.1, 0.95]>>. Similarly, we constructed ESNs with sparse randomW
matrices with different sparseness constraints. This corresponds to a weight
distribution having the values 0, c and c with probabilities <<p_1>> ,<<(1p_1)/2>>,
and <<(1p_1)/2>>, wherep1 defines the sparseness ofWandcis a constant
that takes a specific value depending on the spectral radius. We also created
Wmatrices with values uniformly distributed between1 and 1 (U-ESN)
and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then,
for differentWin matrices, we run the ASE-ESNs with the sinusoidal input
given in section 3 and calculate ASE. Figure 5 compares the ASE values
averaged over 1000 realizations. As observed from the figure, the ASE-ESN
with uniform pole distribution generates higher ASE on average for all
spectral radii compared to ESNs with sparse and uniform random connec-
tions. This approach is indeed conceptually similar to Jeffreys maximum
entropy prior (Jeffreys, 1946): it will provide a consistently good response
for the largest class of problems. Concentrating the poles of the linearized
<<FIGURE>>
Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith
uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN
with uniformly distributed weights between1 and 1. Randomly generated
weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the
networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole
distribution generates a higher ASE on average for all spectral radii compared
to ESNs with random connections.
system in certain regions of the space provides good performance only if
the desired response has energy in this part of the space, as is well known
from the theory of Kautz filters (Kautz, 1954).
4.2 Design of the Adaptive Bias.
In conventional ESNs, only the output weights are trained, optimizing the
projections of the desired response onto the basis functions (echo states).
Since the dynamical reservoir is fixed,
the basis functions are only input dependent. However, since function ap-
proximation is a problem in the joint space of the input and desired signals,
a penalty in performance will be incurred. From the linearization analysis
that shows the crucial importance of the operating point of the PE non-
linearity in defining the echo state dynamics, we propose to use a single
external adaptive bias to adjust the effective spectral radius of an ESN. No-
tice that according to linearization analysis, bias can reduce only spectral
radius. The information for adaptation of bias is the MSE in training, which
modulates the spectral radius of the system with the information derived
from the approximation error. With this simple mechanism, some informa-
tionfromtheinput-outputjointspaceisincorporatedinthedefinitionofthe
projection space of the ESN. The beauty of this method is that the spectral
radius can be adjusted by a single parameter that is external to the system
without changing reservoir weights.
The training of bias can be easily accomplished. Indeed, since the pa-
rameter space is only one-dimensional, a simple line search method can be
efficiently employed to optimize the bias. Among different line search al-
gorithms, we will use a search that uses Fibonacci numbers in the selection
of points to be evaluated (Wilde, 1964). The Fibonacci search method min-
imizes the maximum number of evaluations needed to reduce the interval
of uncertainty to within the prescribed length. In our problem, a bias value
is picked according to Fibonacci search. For each value of bias, training
data are applied to the ESN, and the echo states are calculated. Then the
corresponding optimal output weights and the objective function (MSE)
are evaluated to pick the next bias value.
Alternatively, gradient-based methods can be utilized to optimize the
bias, due to simplicity and low computational cost. System update equation
with an external bias signal,b,isgivenby
<<x(n+1)=f(W_in u(n+1)+Win b+Wx(n))>>.
The update equation forbis given by
<<FORMULA>>
Here,Ois the MSE defined previously. This algorithm may suffer from
similar problems observed in gradient-based methods in recurrent net-
works training. However, we observed that the performance surface is
rather simple. Moreover, since the search parameter is one-dimensional,
the gradient vector can assume only one of the two directions. Hence, im-
precision in the gradient estimation should affect the speed of convergence
but normally not change the correct gradient direction.
5 Experiments
This section presents a variety of experiments in order to test the validity
of the ESN design scheme proposed in the previous section.
5.1 Short-Term Memory Capacity.
This experiment compares the shortterm memory (STM) capacity of ESNs
with the same spectral radius using
the framework presented in Jaeger (2002a). Consider an ESN with a sin-
gle input signal, <<u(n)>>, optimally trained with the desired signal <<u(nk)>>,
for a given delayk. Denoting the optimal output signalyk (n), thek-delay
STM capacity of a network,MC k , is defined as a squared correlation coef-
ficient betweenu <<(nk)>> and <<FORMULA>> (Jaeger, 2002a). The STM capacity, MC,
of the network is defined as <<FORMULA>>. STM capacity measures how accu-
rately the delayed versions of the input signal are recovered with optimally
trained output units. Jaeger (2002a) has shown that the memory capacity
for recalling an independent and identically distributed (i.i.d.) input by an
Nunit RNN with linear output units is bounded by N.
We use ESNs with 20 PEs and a single input unit. ESNs are driven
by an i.i.d. random input signal,<<u(n)>>, that is uniformly distributed over
[0.5, 0.5]. The goal is to train the ESN to generate the delayed versions
of the input, <<u(n1),...,u(n40)>>. We used four different ESNs: R-ESN,
U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN
used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47,
0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a
sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof
U-ESN are uniformly distributed over [1, 1] and scaled to obtain the spec-
tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed
with uniform poles. BASE-ESN has the same recurrent weight matrix as
ASE-ESN and an adaptive bias at its input. In each ESN, the input weights
are set to 0.1 or0.1 with equal probability, and direct connections from the
input to the output are allowed, whereasWback is set to 0 (Jaeger, 2002a).
The echo states are calculated using equation 2.1 for 200 samples of the
input signal, and the first 100 samples corresponding to initial transient
are eliminated. Then the output weight matrix is calculated using equation
2.4. For the BASE-ESN, the bias is trained for each task. All networks are
run with a test input signal, and the corresponding output andMC k are
calculated. Figure 6 shows thek-delay STM capacity (averaged over 100
trials) of each ESN for delays 1,...,40 for the test signal. The STM capac-
ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70,
and 16.90, respectively. First, ESNs with uniform pole distribution (ASE-
ESN and BASE-ESN) haveMCs that are much longer than the randomly
generated ESN given in Jaeger (2002a) in spite of all having the same spec-
tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical
maximumvalueofN=20.AcloserlookatthefigureshowsthatR-ESNper-
forms slightly better than ASE-ESN for delays less than 9. In fact, for small
k, large ASE degrades the performance because the tasks do not need long
memory depth. However, the drawback of high ASE for smallkis recov-
ered in BASE-ESN, which reduces the ASE to the appropriate level required
for the task. Overall, the addition of the bias to the ASE-ESN increases the
STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly
better STM compared to R-ESN with only three different weight values,
although it has more distinct weight values compared to R-ESN. It is also
significant to note that theMCwill be very poor for an ESN with smaller
spectral radius even with an adaptive bias, since the problem requires large
ASE and bias can only reduce ASE. This experiment demonstrates the
<<FIGURE>>
Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed
using the test signal. The results are averaged over 100 different realizations of
each ESN type with the specifications given in the text for differentWandWin
matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are
13.09, 13.55, 16.70, and 16.90, respectively.
suitability of maximizing ASE in tasks that require a substantial memory
length.
5.2 Binary Parity Check.
The effect of the adaptive bias was marginal
in the previous experiment since the nature of the problem required large
ASE values. However, there are tasks in which the optimal solutions re-
quire smaller ASE values and smaller spectral radius. Those are the tasks
where the adaptive bias becomes a crucial design parameter in our design
methodology.
Consider an ESN with 100 internal units and a single input unit. ESN is
drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal
is to train an ESN to generate them-bit parity corresponding to lastmbits
received, wheremis 3,...,8. Similar to the previous experiments, we used
the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly
connected ESN where the entries ofWmatrix are set to 0, 0.06,0.06
with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse
connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN
are designed with a spectral radius of 0.9. The input weights are set to 1 or -1
with equal probability, and direct connections from the input to the output
are allowed whereasWback is set to 0. The echo states are calculated using
equation 2.1 for 1000 samples of the input signal, and the first 100 samples
corresponding to the initial transient are eliminated.Then the output weight
<<FIGURE>>
Figure 7: The number of wrong decisions made by each ESN form=3,...,8
in the binary parity check problem. The results are averaged over 100 differ-
ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin
matrices with the specifications given in the text. The total numbers of wrong
decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and
699.
matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias
is trained for each task. The binary decision is made by a threshold detector
that compares the output of the ESN to 0.5. Figure 7 shows the number of
wrong decisions (averaged over 100 different realizations) made by each
ESN for <<m=3,...,8>>.
The total numbers of wrong decisions for <<m=3,...,8>> of R-ESN, ASE-
ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs
poorly since the nature of the problem requires a short time constant for
fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the
R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions.
BASE-ESN performs a lot better than ASE-ESN and slightly better than
the R-ESN since the adaptive bias reduces the spectral radius effectively.
Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN,
since the task requires access to longer input history, which compromises
the need for fast response. Indeed, the bias in the BASE-ESN takes effect
when there are errors (m>4) and when the task benefits from smaller
spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and
2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide
range of bias values that result in similar MSE values (between 0 and 3). In
summary, this experiment clearly demonstrates the power of the bias signal
to configure the ESN reservoir according to the mapping task.
5.3 System Identification.
This section presents a function approxima-
tion task where the aim is to identify a nonlinear dynamical system. The
unknown system is defined by the difference equation
<<y(n+1)=0.3y(n)+0.6y(n1)+f(u(n))>>,
where
<<f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu)>>.
The input to the system is chosen to be <<sin(2πn/25)>>.
We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with
30 internal units and a single input unit. TheWmatrix of each ESN is scaled
suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN
where the entries ofWmatrix are set to 0, 0.35,0.35 with probabilities 0.8,
0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or1 with
equal probability, and direct connections from the input to the output are
allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated
using equation 2.4. The MSE values (averaged over 100 realizations) for R-
ESN and ASE-ESN are 1.23x10 5 and 1.83x10 6 , respectively. The addition
of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10^6
to 3.27x10^9 .
6 Discussion
The great appeal of echo state networks (ESNs) and liquid state machine
(LSM) is their ability to construct arbitrary mappings of signals with rich
and time-varying temporal structures without requiring adaptation of the
free parameters of the recurrent layer. The echo state condition allows the
recurrent connections to be fixed with training limited to the linear output
layer. However, the literature did not elucidate on how to properly choose
the recurrent parameters for system identification applications. Here, we
provide an alternate framework that interprets the echo states as a set
of functional bases formed by fixed nonlinear combinations of the input.
The linear readout at the output stage simply computes the projection of
the desired output space onto this representation space. We further in-
troduce an information-theoretic criterion, ASE, to better understand and
evaluate the capability of a given ESN to construct such a representation
layer. The average entropy of the distribution of the echo states quantifies
thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest
to achieve the smallest correlation among the bases and be able to cope with
arbitrary mappings. However, not all function approximation problems re-
quire the same memory depth, which is coupled to the spectral radius. The
effective spectral radius of an ESN can be optimized for the given problem
with the help of an external bias signal that is adapted using the joint input-
output space information. The interesting property of this method when
applied to ESN built from sigmoidal nonlinearities is that it allows the fine
tuning of the system dynamics for a given problem with a single external
adaptive bias input and without changing internal system parameters. In
our opinion, the combination of the largest possible ASE and the adapta-
tion of the spectral radius by the bias produces the most parsimonious pole
location of the linearized ESN when no knowledge about the mapping is
available to optimally locate the bass functionals. Moreover, the bias can be
easily trained with either a line search method or a gradient-based method
since it is one-dimensional. We have illustrated experimentally that the de-
sign of the ESN using the maximization of ASE with the adaptation of the
spectral radius by the bias has provided consistently better performance
across tasks that require different memory depths. This means that these
two parameters design methodology is preferred to the spectral radius
criterion proposed by Jaeger, and it is still easily incorporated in the ESN
design.
Experiments demonstrate that the ASE for ESN with uniform linearized
poles is maximized when the spectral radius of the recurrent weight matrix
approaches one (instability). It is interesting to relate this observation with
the computational properties found in dynamical systems “at the edge of
chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993;
Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨
tomata rules are evolved to perform a complex computation, evolution will
tend to select rules with “critical” parameter values, which correlate with
a phase transition between ordered and chaotic regimes. Recently, similar
conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨
Langtons interpretation of edge of chaos was questioned by Mitchell et al.
(1993). Here, we provide a system-theoretic view and explain the computa-
tional behavior with the diversity of dynamics achieved with linearizations
that have poles close to the unit circle. According to our results, the spectral
radiusoftheoptimalESNinfunctionapproximationisproblemdependent,
and in general it is impossible to forecast the computational performance
as the system approaches instability (the spectral radius of the recurrent
weight matrix approaches one). However, allowing the system to modu-
late the spectral radius by either the output or internal biasing may allow
a system close to instability to solve various problems requiring different
spectral radii.
Our emphasis here is mostly on ESNs without output feedback connec-
tions. However, the proposed design methodology can also be applied to
ESNs with output feedback. Both feedforward and feedback connections
contribute to specify the bases to create the projection space. At the same
time, there are applications where the output feedback contributes to the
system dynamics in a different fashion. For example, it has been shown that
a fixed weight (fully trained) RNN with output feedback can implement a
family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992).
In meta-learning, the role of output feedback in the network is to bias the
system to different regions of dynamics, providing multiple input-output
mappings required (Santiago & Lendaris, 2004). However, results could not
be replicated with ESNs (Prokhorov, 2005). We believe that more work has
to be done on output feedback in the context of ESNs but also suspect that
the echo state condition may be a restriction on the system dynamics for
this type of problem.
There are many interesting issues to be researched in this exciting new
area. Besides an evaluation tool, ASE may also be utilized to train the ESNs
representation layer in an unsupervised fashion. In fact, we can easily adapt
withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild,
and Principe (2003): extra weights linking the outputs of recurrent states to
maximize output entropy. Output entropy maximization is a well-known
metric to create independent components (Bell & Sejnowski, 1995), and
here it means that the echo states will become as independent as possible.
This would circumvent the linearization of the dynamical system to set the
recurrent weights and would fine-tune continuously in an unsupervised
manner the parameters of the ESN among different inputs. However, it
goes against the idea of a fixed ESN reservoir.
The reservoir of recurrent PEs can be thought of as a new form of a time-
to-space mapping. Unlike the delay line that forms an embedding (Takens,
1981), this mapping may have the advantage of filtering noise and produce
representations with better SNRs to the peaks of the input, which is very
appealing for signal processing and seems to be used in biology. However,
further theoretical work is necessary in order to understand the embedding
capabilities of ESNs. One of the disadvantages of the ESN correlated basis
is in the design of the readout. Gradient-based algorithms will be very
slow to converge (due to the large eigenvalue spread of modes), and even
if recursive methods are used, their stability may be compromised by the
condition number of the matrix. However, our recent results incorporating
anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of
solving this problem.
Finally we would like to briefly comment on the implications of these
models to neurobiology and computational neuroscience. The work by
Pouget and Sejnowski (1997) has shown that the available physiological
data are consistent with the hypothesis that the response of a single neuron
in the parietal cortex serves as a basis function generated by the sensory
input in a nonlinear fashion. In other words, the neurons transform the
sensory input into a format (representation space) such that the subsequent
computation is simplified. Then, whenever a motor command (output of
the biological system) needs to be generated, this simple computation to
read out the neuronal activity is done. There is an intriguing similarity
betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski
and our interpretation of echo states in ESN. We believe that similar ideas
can be applied to improve the design of microcircuit implementations of
LSMs. First, the framework of functional space interpretation (bases and
projections) is also applicable to microcircuits. Second, the ASE measure
may be directly utilized for LSM states because the states are normally low-
pass-filtered before the readout. However, the control of ASE by changing
the liquid dynamics is unclear. Perhaps global control of thresholds or bias
current will be able to accomplish bias control as in ESN with sigmoid
PEs.
Acknowledgments
This work was partially supported by NSFECS-0422718, NSFCNS-0540304,
and ONR N00014-1-1-0405.
References
Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer.
Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor-
ical perception, and probability learning: Some applications of a neural model.
Psychological Review, 84, 413451.
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach
to blind separation and blind deconvolution.Neural Computation, 7(6), 1129
1159.
Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨
in recurrent neural networks.Neural Computation, 16(7), 14131436.
Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal
of Physics, 14(1), 113.
de Vries, B. (1991).Temporal processing with neural networks—the development of the
gamma model. Unpublished doctoral dissertation, University of Florida.
Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural
network for system identification and control.IEEE Proceedings of Control Theory
and Applications, 142(4), 307314.
Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179211.
Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation:
Stochastic information gradient.Signal Processing Letters, 10(8), 242245.
Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for
adaptive system training.IEEE Transactions on Neural Networks, 13(5), 10351044.
Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream
Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle
(Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 2953). Dordrecht,
Netherlands: Kluwer. 136 M. Ozturk, D. Xu, and J. Pr´ıncipe
Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle
River, NJ. Prentice Hall.
Haykin, S. (2001).Adaptive filter theory(4th ed.). Upper Saddle River, NJ: Prentice
Hall.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa-
tion, 9(8), 17351780.
Hopfield, J. (1984). Neurons with graded response have collective computational
properties like those of two-state neurons.Proceedings of the National Academy of
Sciences, 81, 30883092.
Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math-
ematics, 5(1), 189203.
Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural
networks(Tech. Rep. No. 148). Bremen: German National Research Center for
Information Technology.
Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152).
Bremen: German National Research Center for Information Technology.
Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL,
EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German
National Research Center for Information Technology.
Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
and saving energy in wireless communication.Science, 304(5667), 7880.
Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems.
Proceedings of the Royal Society of London, A 196, 453461.
Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall.
Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit
Theory, 1(3), 2939.
Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks
for adaptive communication channel equalization.IEEE Transactions on Neural
Networks, 5(2), 267278.
Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks.
IEEE Transactions on Neural Networks, 6(5), 10001004.
Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation
theory(2nd ed.). New York: Springer-Verlag.
Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 1237.
Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the
computational power and generalization capability of neural microcircuits. In
L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing
systems, no. 17 (pp. 865872). Cambridge, MA: MIT Press.
Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨
stable states: A new framework for neural computation based on perturbations.
Neural Computation, 14(11), 25312560.
Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos:
Evolving cellular automata to perform computations.Complex Systems, 7, 89
130.
Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J.
Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293
301). Singapore: World Scientific. Analysis and Design of Echo State Networks 137
Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex
using basis functions.Journal of Cognitive Neuroscience, 9(2), 222237.
Principe, J. (2001). Dynamic neural networks and optimal signal processing. In
Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6
28). Boca Raton, FL: CRC Press.
Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new
class of adaptive IIR filters with restricted feedback.IEEE Transactions on Signal
Processing, 41(2), 649656.
Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin
(Ed.),Unsupervised adaptive filtering(pp. 265319). Hoboken, NJ: Wiley.
Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter-
national Joint Conference on Neural Networks(pp. 14631466). Montreal, Canada.
Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed
weights in recurrent neural networks: An overview. InProc. of International Joint
Conference on Neural Networks(pp. 20182022). Honolulu, Hawaii.
Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys-
tems with Kalman filter trained recurrent networks.IEEE Transactions on Neural
Networks, 5(2), 279297.
Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap-
plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 14071420.
Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev,
M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with
echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and
Signal Processing. Philadelphia.
Renyi, A. (1970).Probability theory. New York: Elsevier.
Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis.
Unpublished doctoral dissertation, University of Florida.
Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net-
works: Reformulating fixed weight neural networks. InProc. of International Joint
Conference on Neural Networks(pp. 189194). Budapest, Hungary.
Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in
multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 1018.
Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical
Journal, 27, 623656.
Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc-
toral dissertation, Rutgers University.
Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied
Mathematics Letters, 4(6), 7780.
Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended
Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process-
ing systems, 1(pp. 133140). San Mateo, CA: Morgan Kaufmann.
Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S.
Young (Eds.),Dynamical systems and turbulence(pp. 366381). Berlin: Springer.
Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub-
lished masters thesis, University of Florida.
Werbos, P. (1990). Backpropagation through time: What it does and how to do it.
Proceedings of IEEE, 78(10), 15501560. 138 M. Ozturk, D. Xu, and J. Pr´ıncipe
Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua-
tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 6589). New
York: Van Nostrand Reinhold.
Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall.
Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running
fully recurrent neural networks.Neural Computation, 1, 270280.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Bayesian Compression for Deep Learning
Christos Louizos Karen Ullrich Max Welling
University of Amsterdam University of Amsterdam University of Amsterdam
TNO Intelligent Imaging k.ullrich@uva.nl CIFAR
c.louizos@uva.nl m.welling@uva.nl
Abstract
Compression and computational efficiency in deep learning have become a problem
of great significance. In this work, we argue that the most principled and effective
way to attack this problem is by adopting a Bayesian point of view, where through
sparsity inducing priors we prune large parts of the network. We introduce two
novelties in this paper: 1) we use hierarchical priors to prune nodes instead of
individual weights, and 2) we use the posterior uncertainties to determine the
optimal fixed point precision to encode the weights. Both factors significantly
contribute to achieving the state of the art in terms of compression rates, while
still staying competitive with methods designed to optimize for speed or energy
efficiency.
1 Introduction
While deep neural networks have become extremely successful in in a wide range of applications,
often exceeding human performance, they remain difficult to apply in many real world scenarios. For
instance, making billions of predictions per day comes with substantial energy costs given the energy
consumption of common Graphical Processing Units (GPUs). Also, real-time predictions are often
about a factor100away in terms of speed from what deep NNs can deliver, and sending NNs with
millions of parameters through band limited channels is still impractical. As a result, running them on
hardware limited devices such as smart phones, robots or cars requires substantial improvements on
all of these issues. For all those reasons, compression and efficiency have become a topic of interest
in the deep learning community.
While all of these issues are certainly related, compression and performance optimizing procedures
might not always be aligned. As an illustration, consider the convolutional layers of Alexnet, which
account for only 4% of the parameters but 91% of the computation [68]. Compressing these layers
will not contribute much to the overall memory footprint.
There is a variety of approaches to address these problem settings. However, most methods have
the common strategy of reducing both the neural network structure and the effective fixed point
precision for each weight. A justification for the former is the finding that NNs suffer from significant
parameter redundancy [14]. Methods in this line of thought are network pruning, where unnecessary
connections are being removed [40,24,21], or student-teacher learning where a large network is
used to train a significantly smaller network [5, 27].
From a Bayesian perspective network pruning and reducing bit precision for the weights is aligned
with achieving high accuracy, because Bayesian methods search for the optimal model structure
(which leads to pruning with sparsity inducing priors), and reward uncertain posteriors over parameters
through the bits back argument [28] (which leads to removing insignificant bits). This relation is
made explicit in the MDL principle [20] which is known to be related to Bayesian inference.
In this paper we will use the variational Bayesian approximation for Bayesian inference which has
also been explicitly interpreted in terms of model compression [28]. By employing sparsity inducing
priors for hidden units (and not individual weights) we can prune neurons including all their ingoing
and outgoing weights. This avoids more complicated and inefficient coding schemes needed for
pruning or vector quantizing individual weights. As an additional Bayesian bonus we can use the
variational posterior uncertainty to assess which bits are significant and remove the ones which
fluctuate too much under approximate posterior sampling. From this we derive the optimal fixed
point precision per layer, which is still practical on chip.
2 Variational Bayes and Minimum Description Length
A fundamental theorem in information theory is the minimum description length (MDL) principle [20].
It relates to compression directly in that it defines the best hypothesis to be the one that communicates
the sum of the model (complexity costLC ) and the data misfit (error costLE ) with the minimum
number of bits [59,60]. It is well understood that variational inference can be reinterpreted from an
MDL point of view [56,72,28,30,19]. More specifically, assume that we are presented with a dataset QD that consists from N input-output pairs <<FORMULA>>. Let <<FORMULA>>
be a parametric model, e.g. a deep neural network, that maps inputs x to their corresponding outputs
y using parameters w governed by a prior distribution <<p(w)>>. In this scenario, we wish to approximate
the intractable posterior distribution <<p(w|D) =p(D|w)p(w)=p(D)>> with a fixed form approximate
posterior <<q(w)>> by optimizing the variational parameters according to:
<<FORMULA>>
where <<H()>> denotes the entropy and <<L()>> is known as the evidence-lower-bound (ELBO) or negative
variational free energy. As indicated in eq.1, <<L()>> naturally decomposes into a minimum cost for
communicating the targets <<FORMULA>> under the assumption that the sender and receiver agreed on a n=1 prior <<p(w)>> and that the receiver knows the inputs <<FORMULA>> and form of the parametric model. n=1
By using sparsity inducing priors for groups of weights that feed into a neuron the Bayesian mecha-
nism will start pruning hidden units that are not strictly necessary for prediction and thus achieving
compression. But there is also a second mechanism by which Bayes can help us compress. By
explicitly entertaining noisy weight encodings through <<q(w)>> we can benefit from the bits-back
argument [28,30] due to the entropy term; this is in contrast to infinitely precise weights that lead to
<<FORMULA>>. Nevertheless in practice, the data misfit termLE is intractable for neural network
models under a noisy weight encoding, so as a solution Monte Carlo integration is usually employed.
Continuous q(w) allow for the reparametrization trick [36,58]. Here, we replace sampling from
q(w) by a deterministic function of the variational parameters and random samples from some
noise variables:
<<FORMULA>>; (2)
where <<w=f(;)>>. By applying this trick, we obtain unbiased stochastic gradients of the ELBO
with respect to the variational parameters, thus resulting in a standard optimization problem that is
fit for stochastic gradient ascent. The efficiency of the gradient estimator resulting from eq. 2 can be
further improved for neural networks by utilizing local reparametrizations [37] (which we will use in
our experiments); they provide variance reduction in an efficient way by locally marginalizing the
weights at each layer and instead sampling the distribution of the pre-activations.
3 Related Work
One of the earliest ideas and most direct approaches to tackle efficiency is pruning. Originally
introduced by [40], pruning has recently been demonstrated to be applicable to modern architectures
[25,21]. It had been demonstrated that an overwhelming amount of up to 99,5% of parameters
can be pruned in common architectures. There have been quite a few encouraging results obtained
by (empirical) Bayesian approaches that employ weight pruning [19,7,52,70,51]. Nevertheless,
2 In practice this term is a large constant determined by the weight precision.
weight pruning is in general inefficient for compression since the matrix format of the weights is not
taken into consideration, therefore the Compressed Sparse Column (CSC) format has to be employed.
Moreover, note that in conventional CNNs most flops are used by the convolution operation. Inspired
by this observation, several authors proposed pruning schemes that take these considerations into
account [73, 74] or even go as far as efficiency aware architectures to begin with [32, 15, 31]. From
the Bayesian viewpoint, similar pruning schemes have been explored at [47, 53, 39, 34].
Given optimal architecture, NNs can further be compressed by quantization. More precisely, there
are two common techniques. First, the set of accessible weights can be reduced drastically. As an
extreme example, [13,48,57,76] and [11] trained NN to use only binary or tertiary weights with
floating point gradients. This approach however is in need of significantly more parameters than
their ordinary counterparts. Work by [18] explores various techniques beyond binary quantization:
k-means quantization, product quantization and residual quantization. Later studies extent this set to
optimal fixed point [44] and hashing quantization [10]. [25] apply k-means clustering and consequent
center training. From a practical point of view, however, all these are fairly unpractical during
test time. For the computation of each feature map in a net, the original weight matrix must be
reconstructed from the indexes in the matrix and a codebook that contains all the original weights.
This is an expensive operation and this is why some studies propose a different approach than set
quantization. Precision quantization simply reduces the bit size per weight. This has a great advantage
over set quantization at inference time since feature maps can simply be computed with less precision
weights. Several studies show that this has little to no effect on network accuracy when using 16bit
weights [49,22,12,71,9]. Somewhat orthogonal to the above discussion but certainly relevant are
approaches that customize the implementation of CNNs for hardware limited devices[31, 4, 62].
4 Bayesian compression with scale mixtures of normals
Consider the following prior over a parameter w where its scale z is governed by a distribution <<p(z)>>:
<<FORMULA>>; (3)
with z2 serving as the variance of the zero-mean normal distribution over w. By treating the scales
of w as random variables we can recover marginal prior distributions over the parameters that have
heavier tails and more mass at zero; this subsequently biases the posterior distribution over w to
be sparse. This family of distributions is known as scale-mixtures of normals [6,2] and it is quite
general, as a lot of well known sparsity inducing distributions are special cases.
One example of the aforementioned framework is the spike-and-slab distribution [50], the golden
standard for sparse Bayesian inference. Under the spike-and-slab, the mixing density of the scales is a
Bernoulli distribution, thus the marginal <<p(w)>> has a delta “spike” at zero and a continuous “slab” over
the real line. Unfortunately, this prior leads to a computationally expensive inference since we have
to explore a space of2M models, whereMis the number of the model parameters. Dropout [29,67],
one of the most popular regularization techniques for neural networks, can be interpreted as positing a
spike and slab distribution over the weights where the variance of the “slab” is zero [17,45]. Another
example is the Laplace distribution which arises by considering <<FORMULA>>. The mode of
the posterior distribution under a Laplace prior is known as the Lasso [69] estimator and has been
previously used for sparsifying neural networks at [73,61]. While computationally simple, the
Lasso estimator is prone to “shrinking" large signals [8] and only provides point estimates about
the parameters. As a result it does not provide uncertainty estimates, it can potentially overfit and,
according to the bits-back argument, is inefficient for compression.
For these reasons, in this paper we will tackle the problem of compression and efficiency in neural
networks by adopting a Bayesian treatment and inferring an approximate posterior distribution over
the parameters under a scale mixture prior. We will consider two choices for the prior over the scales
p(z); the hyperparameter free log-uniform prior [16,37] and the half-Cauchy prior, which results into
a horseshoe [8] distribution. Both of these distributions correspond to a continuous relaxation of the
spike-and-slab prior and we provide a brief discussion on their shrinkage properties at Appendix C.
4.1 Reparametrizing variational dropout for group sparsity
One potential choice for p(z) is the improper log-uniform prior [37] <<FORMULA>>. It turns out that
we can recover the log-uniform prior over the weightswif we marginalize over the scales z:
<<FORMULA>> (4)
This alternative parametrization of the log uniform prior is known in the statistics literature as the
normal-Jeffreys prior and has been introduced by [16]. This formulation allows to “couple" the
scales of weights that belong to the same group (e.g. neuron or feature map), by simply sharing the
corresponding scale variablezin the joint prior 3 :
<<FORMULA>>; (5)
where W is the weight matrix of a fully connected neural network layer with A being the dimen-
sionality of the input and B the dimensionality of the output. Now consider performing variational
inference with a joint approximate posterior parametrized as follows:
<<FORMULA>>; (6)
where _i is the dropout rate [67,37,51] of the given group. As explained at [37,51], the multiplicative
parametrization of the approximate posterior over z suffers from high variance gradients; therefore
we will follow [51] and re-parametrize it in terms of <<FORMULA>>, hence optimize w.r.t._2 .
The <<FORMULA>> lower bound under this prior and approximate posterior becomes:
<<FORMULA>> (7)
Under this particular variational posterior parametrization the negative KL-divergence from the
conditional prior <<p(W|z)>> to the approximate posterior <<q(W|z)>> is independent of z:
<<FORMULA>> (8)
This independence can be better understood if we consider a non-centered parametrization of the
prior [55]. More specifically, consider reparametrizing the weights asw~ij =wij ; this will then result zi
into <<p(W|z)p(z) =p(W~)p(z)>>, where <<FORMULA>>. Now if <<FORMULA>> and <<W= diag(z)>>
we perform variational inference under the p(W~)p(z)prior with an approximate posterior that has Q the form of <<FORMULA>>, with <<FORMULA>>, then we see that we ij arrive at the same expressions for the negative KL-divergence from the prior to the approximate
posterior. Finally, the negative KL-divergence from the normal-Jeffreys scale prior p(z) to the
Gaussian variational posterior q depends only on the “implied” dropout rate, <<FORMULA>>, and zi z takes the following form [51]:
<<FORMULA>>; (9)
where <<FORMULA>> are the sigmoid and softplus functions respectively 4 and k1 = 0:63576,k2 =
1:87320,k3 = 1:48695. We can now prune entire groups of parameters by simply specifying a thresh-
old for the variational dropout rate of the corresponding group, e.g.<<FORMULA>>. It should be mentioned that this prior parametrization readily allows for a more flexible marginal pos-
terior over the weights as we now have a compound distribution, <<FORMULA>>; this
is in contrast to the original parametrization and the Gaussian approximations employed by [37,51].
Strictly speaking the result of eq. 4 only holds when each weight has its own scale and not when that scale is
shared across multiple weights. Nevertheless, in practice we obtain a prior that behaves in a similar way, i.e. it
biases the variational posterior to be sparse.
<<FORMULA>>
Furthermore, this approach generalizes the low variance additive parametrization of variational
dropout proposed for weight sparsity at [51] to group sparsity (which was left as an open question
at [51]) in a principled way.
At test time, in order to have a single feedforward pass we replace the distribution overWat each
layer with a single weight matrix, the masked variational posterior mean:
<<FORMULA>>; (10)
where m is a binary mask determined according to the group variational dropout rate andMW are
the means ofq (W~). We further use the variational posterior marginal variances 5 for this particular
posterior approximation:
<<FORMULA>>; (11)
to assess the bit precision of each weight in the weight matrix. More specifically, we employed the
mean variance across the weight matrixW^ to compute the unit round off necessary to represent the
weights. This method will give us the amount significant bits, and by adding 3 exponent and 1 sign
bits we arrive at the final bit precision for the entire weight matrixW^6 . We provide more details at
Appendix B.
4.2 Group horseshoe with half-Cauchy scale priors
Another choice for p(z) is a proper half-Cauchy distribution: <<FORMULA>>; it
induces a horseshoe prior [8] distribution over the weights, which is a well known sparsity inducing
prior in the statistics literature. More formally, the prior hierarchy over the weights is expressed as
(in a non-centered parametrization):
<<FORMULA>>; (12)
where0 is the free parameter that can be tuned for specific desiderata. The idea behind the horseshoe
is that of the “global-local" shrinkage; the global scale variablespulls all of the variables towards
zero whereas the heavy tailed local variableszi can compensate and allow for some weights to escape.
Instead of directly working with the half-Cauchy priors we will employ a decomposition of the
half-Cauchy that relies upon (inverse) gamma distributions [54] as this will allow us to compute
the negative KL-divergence from the scale priorp(z)to an approximate log-normal scale posterior
q (z)in closed form (the derivation is given in Appendix D). More specifically, we have that the
half-Cauchy prior can be expressed in a non-centered parametrization as:
<<FORMULA>>; (13)
where <<IG(;);G(;)>> correspond to the inverse Gamma and Gamma distributions in the scale
parametrization, and z follows a half-Cauchy distribution with scale k. Therefore we will re-express
the whole hierarchy as:
<<FORMULA>>; (14)
It should be mentioned that the improper log-uniform prior is the limiting case of the horseshoe prior
when the shapes of the (inverse) Gamma hyperpriors on <<FORMULA>> go to zero [8]. In fact, several well
known shrinkage priors can be expressed in this form by altering the shapes of the (inverse) Gamma
hyperpriors [3]. For the variational posterior we will employ the following mean field approximation:
<<FORMULA>>.
Notice that the fact that we are using mean-field variational approximations (which we chose for simplicity)
can potentially underestimate the variance, thus lead to higher bit precisions for the weights. We leave the
exploration of more involved posteriors for future work.
Where <<LN(;)>> is a log-normal distribution. It should be mentioned that a similar form of non-
centered variational inference for the horseshoe has been also successfully employed for undirected
models at [q 33]. Notice that we can also apply local reparametrizations [37] when we are sampling
<<FORMULA>>
i i and sa sb by exploiting properties of the log-normal distribution 7 and thus forming the
implied:
<<FORMULA>> (17)
As a threshold rule for group pruning we will use the negative log-mode 8 of the local log-normal r.v.
<<FORMULA>> , i.e. prune when <<FORMULA>>, with <<FORMULA>>. This ignores <<FORMULA>> and <<FORMULA>>, but nonetheless we found <<FORMULA>> dependencies among the zi elements induced by the common scale
that it works well in practice. Similarly with the group normal-Jeffreys prior, we will replace the
distribution overWat each layer with the masked variational posterior mean during test time:
<<FORMULA>>; (19)
wheremis a binary mask determined according to the aforementioned threshold,MW are the means
ofq(W~)and;2 are the means and variances of the local log-normals over <<FORMULA>>. Furthermore,
similarly to the group normal-Jeffreys approach, we will use the variational posterior marginal
variances:
<<FORMULA>>; (20)
to compute the final bit precision for the entire weight matrix W.
5 Experiments
We validated the compression and speed-up capabilities of our models on the well-known architectures
of LeNet-300-100 [41], LeNet-5-Caffe 9 on MNIST [42] and, similarly with [51], VGG [63]10 on
CIFAR 10 [38]. The groups of parameters were constructed by coupling the scale variables for each
filter for the convolutional layers and for each input neuron for the fully connected layers. We provide
the algorithms that describe the forward pass using local reparametrizations for fully connected
and convolutional layers with each of the employed approximate posteriors at appendix F. For the
horseshoe prior we set the scale 0 of the global half-Cauchy prior to a reasonably small value, e.g.
0 = 1e5. This further increases the prior mass at zero, which is essential for sparse estimation
and compression. We also found that constraining the standard deviations as described at [46] and
“warm-up" [65] helps in avoiding bad local optima of the variational objective. Further details about
the experimental setup can be found at Appendix A. Determining the threshold for pruning can be
easily done with manual inspection as usually there are two well separated clusters (signal and noise).
We provide a sample visualization at Appendix E.
5.1 Architecture learning & bit precisions
We will first demonstrate the group sparsity capabilities of our methods by illustrating the learned
architectures at Table 1, along with the inferred bit precision per layer. As we can observe, our
methods infer significantly smaller architectures for the LeNet-300-100 and LeNet-5-Caffe, compared
to Sparse Variational Dropout, Generalized Dropout and Group Lasso. Interestingly, we observe
that for the VGG network almost all of big 512 feature map layers are drastically reduced to around
10 feature maps whereas the initial layers are mostly kept intact. Furthermore, all of the Bayesian
methods considered require far fewer than the standard 32 bits per-layer to represent the weights,
sometimes even allowing for 5 bit precisions.
The product of log-normal r.v.s is another log-normal and a power of a log-normal r.v. is another log-normal.
Empirically, it slightly better separates the scales compared to the negative log-mean <<FORMULA>>.
https://github.com/BVLC/caffe/tree/master/examples/mnist
The adapted CIFAR 10 version described athttp://torch.ch/blog/2015/07/30/cifar.html.
Table 1: Learned architectures with Sparse VD [51], Generalized Dropout (GD) [66] and Group
Lasso (GL) [73]. Bayesian Compression (BC) with group normal-Jeffreys (BC-GNJ) and group
horseshoe (BC-GHS) priors correspond to the proposed models. We show the amount of neurons left
after pruning along with the average bit precisions for the weights at each layer.
<<TABLE>>
5.2 Compression Rates
For the actual compression task we compare our method to current work in three different scenarios:
(i) compression achieved only by pruning, here, for non-group methods we use the CSC format
to store parameters; (ii) compression based on the former but with reduced bit precision per layer
(only for the weights); and (iii) the maximum compression rate as proposed by [25]. We believe
Table 2: Compression results for our methods. “DC” corresponds to Deep Compression method
introduced at [25], “DNS” to the method of [21] and “SWS” to the Soft-Weight Sharing of [70].
Numbers marked with * are best case guesses.
<<TABLE>>
these to be relevant scenarios because (i) can be applied with already existing frameworks such as
Tensorflow [1], (ii) is a practical scheme given upcoming GPUs and frameworks will be designed to
work with low and mixed precision arithmetics [43,23]. For (iii), we perform k-means clustering on
the weights with k=32 and consequently store a weight index that points to a codebook of available
weights. Note that the latter achieves highest compression rate but it is however fairly unpractical at
test time since the original matrix needs to be restored for each layer. As we can observe at Table 2,
our methods are competitive with the state-of-the art for LeNet-300-100 while offering significantly
better compression rates on the LeNet-5-Caffe architecture, without any loss in accuracy. Do note
that group sparsity and weight sparsity can be combined so as to further prune some weights when a
particular group is not removed, thus we can potentially further boost compression performance at
e.g. LeNet-300-100. For the VGG network we observe that training from a random initialization
yielded consistently less accuracy (around 1%-2% less) compared to initializing the means of the
approximate posterior from a pretrained network, similarly with [51], thus we only report the latter
results 11 . After initialization we trained the VGG network regularly for 200 epochs using Adam with
the default hyperparameters. We observe a small drop in accuracy for the final models when using
the deterministic version of the network for prediction, but nevertheless averaging across multiple
samples restores the original accuracy. Note, that in general we can maintain the original accuracy on
VGG without sampling by simply finetuning with a small learning rate, as done at [51]. This will
still induce (less) sparsity but unfortunately it does not lead to good compression as the bit precision
remains very high due to not appropriately increasing the marginal variances of the weights.
5.3 Speed and energy consumption
We demonstrate that our method is competitive with [73], denoted as GL, a method that explicitly
prunes convolutional kernels to reduce compute time. We measure the time and energy consumption
of one forward pass of a mini-batch with batch size 8192 through LeNet-5-Caffe. We average over10 4
forward passes and all experiments were run with Tensorflow 1.0.1, cuda 8.0 and respective cuDNN.
We apply 16 CPUs run in parallel (CPU) or a Titan X (GPU). Note that we only use the pruned
architecture as lower bit precision would further increase the speed-up but is not implementable in
any common framework. Further, all methods we compare to in the latter experiments would barely
show an improvement at all since they do not learn to prune groups but only parameters. In figure 1
we present our results. As to be expected the largest effect on the speed up is caused by GPU usage.
However, both our models and best competing models reach a speed up factor of around 8x. We
can further save about 3x energy costs by applying our architecture instead of the original one on a
GPU. For larger networks the speed-up is even higher: for the VGG experiments with batch size 256
we have a speed-up factor of 51x.
<<FIGURE>>
Figure 1:Left:Avg. Time a batch of 8192 samples takes to pass through LeNet-5-Caffe. Numbers on
top of the bars represent speed-up factor relative to the CPU implementation of the original network.
Right:Energy consumption of the GPU of the same process (when run on GPU).
6 Conclusion
We introduced Bayesian compression, a way to tackle efficiency and compression in deep neural
networks in a unified and principled way. Our proposed methods allow for theoretically principled
compression of neural networks, improved energy efficiency with reduced computation while naturally
learning the bit precisions for each weight. This serves as a strong argument in favor of Bayesian
methods for neural networks, when we are concerned with compression and speed up.
11 We also tried to finetune the same network with Sparse VD, but unfortunately it increased the error
considerably (around 3% extra error), therefore we do not report those results.
8 Acknowledgments
We would like to thank Dmitry Molchanov, Dmitry Vetrov, Klamer Schutte and Dennis Koelma for
valuable discussions and feedback. This research was supported by TNO, NWO and Google.
References
[1]M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv
preprint arXiv:1603.04467, 2016.
[2]D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions.Journal of the Royal Statistical
Society. Series B (Methodological), pages 99102, 1974.
[3]A. Armagan, M. Clyde, and D. B. Dunson. Generalized beta mixtures of gaussians. InAdvances in neural
information processing systems, pages 523531, 2011.
[4]E. Azarkhish, D. Rossi, I. Loi, and L. Benini. Neurostream: Scalable and energy efficient deep learning
with smart memory cubes.arXiv preprint arXiv:1701.06420, 2017.
[5]J. Ba and R. Caruana. Do deep nets really need to be deep? InAdvances in neural information processing
systems, pages 26542662, 2014.
[6] E. Beale, C. Mallows, et al. Scale mixing of symmetric distributions with zero means.The Annals of
Mathematical Statistics, 30(4):11451151, 1959.
[7]C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks.
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11
July 2015, 2015.
[8]C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals.Biometrika, 97
(2):465480, 2010.
[9]S. Chai, A. Raghavan, D. Zhang, M. Amer, and T. Shields. Low precision neural networks using subband
decomposition.arXiv preprint arXiv:1703.08595, 2017.
[10]W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural
networks.arXiv preprint arXiv:1506.04449, 2015.
[11]M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations
constrained to+1or1.arXiv preprint arXiv:1602.02830, 2016.
[12]M. Courbariaux, J.-P. David, and Y. Bengio. Training deep neural networks with low precision multiplica-
tions.arXiv preprint arXiv:1412.7024, 2014.
[13]M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary
weights during propagations. InAdvances in Neural Information Processing Systems, pages 31053113,
2015.
[14]M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. InAdvances in
Neural Information Processing Systems, pages 21482156, 2013.
[15]X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference
complexity.arXiv preprint arXiv:1703.08651, 2017.
[16]M. A. Figueiredo. Adaptive sparseness using jeffreys prior.Advances in neural information processing
systems, 1:697704, 2002.
[17]Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep
learning.ICML, 2016.
[18]Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector
quantization.ICLR, 2015.
[19]A. Graves. Practical variational inference for neural networks. InAdvances in Neural Information
Processing Systems, pages 23482356, 2011.
[20]P. D. Grünwald.The minimum description length principle. MIT press, 2007.
[21]Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. InAdvances In Neural
Information Processing Systems, pages 13791387, 2016.
[22]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical
precision.CoRR, abs/1502.02551, 392, 2015.
[23]P. Gysel. Ristretto: Hardware-oriented approximation of convolutional neural networks.Masters thesis,
University of California, 2016.
[24]S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural networks.
InAdvances in Neural Information Processing Systems, pages 11351143, 2015.
[25]S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning,
trained quantization and huffman coding.ICLR, 2016.
[26]K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on
imagenet classification. InProceedings of the IEEE International Conference on Computer Vision, pages
10261034, 2015.
[27]G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531, 2015.
[28]G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length
of the weights. InProceedings of the sixth annual conference on Computational learning theory, pages
513. ACM, 1993.
[29]G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors.arXiv preprint arXiv:1207.0580, 2012.
[30]A. Honkela and H. Valpola. Variational learning and bits-back coding: an information-theoretic view to
bayesian learning.IEEE Transactions on Neural Networks, 15(4):800810, 2004.
[31]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.
Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint
arXiv:1704.04861, 2017.
[32]F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level
accuracy with 50x fewer parameters and< 0.5 mb model size.ICLR, 2017.
[33]J. B. Ingraham and D. S. Marks. Bayesian sparsity for intractable distributions. arXiv preprint
arXiv:1602.03807, 2016.
[34]T. Karaletsos and G. Rätsch. Automatic relevance determination for deep generative models.arXiv preprint
arXiv:1505.07765, 2015.
[35]D. Kingma and J. Ba. Adam: A method for stochastic optimization.International Conference on Learning
Representations (ICLR), San Diego, 2015.
[36]D. P. Kingma and M. Welling. Auto-encoding variational bayes.International Conference on Learning
Representations (ICLR), 2014.
[37]D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparametrization trick.
Advances in Neural Information Processing Systems, 2015.
[38]A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.
[39]N. D. Lawrence. Note relevance determination. InNeural Nets WIRN Vietri-01, pages 128133. Springer,
2002.
[40]Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. InNIPs,
volume 2, pages 598605, 1989.
[41]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):22782324, 1998.
[42]Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998.
[43]D. D. Lin and S. S. Talathi. Overcoming challenges in fixed point training of deep convolutional networks.
Workshop ICML, 2016.
[44]D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks.
arXiv preprint arXiv:1511.06393, 2015.
[45]C. Louizos. Smart regularization of deep architectures.Masters thesis, University of Amsterdam, 2015.
[46]C. Louizos and M. Welling. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.
ArXiv e-prints, Mar. 2017.
[47]D. J. MacKay. Probable networks and plausible predictions—a review of practical bayesian methods for
supervised neural networks.Network: Computation in Neural Systems, 6(3):469505, 1995.
[48]N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with
fine-grained quantization.arXiv preprint arXiv:1705.01462, 2017.
[49]P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha. Deep neural networks are robust to
weight binarization and other non-linear distortions.arXiv preprint arXiv:1606.01981, 2016.
[50]T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the
American Statistical Association, 83(404):10231032, 1988.
[51]D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deep neural networks.arXiv
preprint arXiv:1701.05369, 2017.
[52]E. Nalisnick, A. Anandkumar, and P. Smyth. A scale mixture perspective of multiplicative noise in neural
networks.arXiv preprint arXiv:1506.03208, 2015.
[53]R. M. Neal.Bayesian learning for neural networks. PhD thesis, Citeseer, 1995.
[54]S. E. Neville, J. T. Ormerod, M. Wand, et al. Mean field variational bayes for continuous sparse signal
shrinkage: pitfalls and remedies.Electronic Journal of Statistics, 8(1):11131151, 2014.
[55]O. Papaspiliopoulos, G. O. Roberts, and M. Sköld. A general framework for the parametrization of
hierarchical models.Statistical Science, pages 5973, 2007.
[56]C. Peterson. A mean field theory learning algorithm for neural networks.Complex systems, 1:9951019,
1987.
[57]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary
convolutional neural networks. InEuropean Conference on Computer Vision, pages 525542. Springer,
2016.
[58]D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in
deep generative models. InProceedings of the 31th International Conference on Machine Learning, ICML
2014, Beijing, China, 21-26 June 2014, pages 12781286, 2014.
[59]J. Rissanen. Modeling by shortest data description.Automatica, 14(5):465471, 1978.
[60]J. Rissanen. Stochastic complexity and modeling.The annals of statistics, pages 10801100, 1986.
[61]S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural
networks.arXiv preprint arXiv:1607.00485, 2016.
[62]S. Shi and X. Chu. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units.
arXiv preprint arXiv:1704.07724, 2017.
[63]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
ICLR, 2015.
[64]M. Sites. Ieee standard for floating-point arithmetic. 2008.
[65]C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders.
arXiv preprint arXiv:1602.02282, 2016.
[66]S. Srinivas and R. V. Babu. Generalized dropout.arXiv preprint arXiv:1611.06791, 2016.
[67]N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to
prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):19291958,
2014.
[68]V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer. Efficient processing of deep neural networks: A tutorial and
survey.arXiv preprint arXiv:1703.09039, 2017.
[69]R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society.
Series B (Methodological), pages 267288, 1996.
[70]K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression.ICLR, 2017.
[71]G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using low-precision
and sparsity.arXiv preprint arXiv:1610.00324, 2016.
[72]C. S. Wallace. Classification by minimum-message-length inference. InInternational Conference on
Computing and Information, pages 7281. Springer, 1990.
[73]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In
Advances In Neural Information Processing Systems, pages 20742082, 2016.
[74]T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efficient convolutional neural networks using
energy-aware pruning.CVPR, 2017.
[75]S. Zagoruyko and N. Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016.
[76]C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization.ICLR, 2017.
Appendix
A. Detailed experimental setup
We implemented our methods in Tensorflow [1] and optimized the variational parameters using
Adam [35] with the default hyperparameters. The means of the conditional Gaussian <<q(W|z)>>
Table 3: Floating point formats Bits per Exponent
<<TABLE>>
were initialized with the scheme proposed at [26], whereas the log of the standard deviations were
initialized by sampling from N(9;1e4). The parameters of q(z) were initialized such that the
overall mean of zise 1 and the overall variance is very low (1e^8); this ensures that all of the
groups are active during the initial training iterations.
As for the standard deviation constraints; for the LeNet-300-100 architecture we constrained the
standard deviation of the first layer to be 0:2 whereas for the LeNet-5-Caffe we constrained
the standard deviation of the first layer to be 0:5. The remaining standard deviations were left
unconstrained. For the VGG network we constrained the standard deviations of the 64 and 128
feature map layers to be 0:1, the standard deviations of the 256 feature map layers to be0:2
and left the rest of the standard deviations unconstrained. We also found beneficial the incorporation
of “warm-up” [65], i.e we annealed the negative KL-divergence from the prior to the approximate
posterior with a linear schedule for the first 100 epochs. We initialized the means of the approximate
posterior by the weights and biases obtained from a VGG network trained with batch normalization
and dropout on CIFAR 10. For our method we disabled batch-normalization during training.
As for preprocessing the data; for MNIST the only preprocessing we did was to rescale the digits to
lie at the [-1,1] range and for CIFAR 10 we used the preprocessed dataset provided by [75].
Furthermore, do note that by pruning a given filter at a particular convolutional layer we can also
prune the parameters corresponding to that feature map for the next layer. This similarly holds for
fully connected layers; if we drop a given input neuron then the weights corresponding to that node
from the previous layer can also be pruned.
B. Standards for Floating-Point Arithmetic
Floating points values eventually need to be represented in a binary basis in a computer. The most
common standard today is the IEEE 754-2008 convention [64]. It definesx-bit base-2 formats,
officially referred to as binaryx, withx2 f16;32;64;128g. The formats are also widely known as
half, single, double and quadruple precision floats, respectively and used in almost all programming
languages as a standard. The format considers 3 kinds of bits: one sign bit,wexponent bits andp
precision bits.
<<FIGURE>>
Figure 2: A symbolic representation of the binaryxformat [64].
The Sign bit determines the sign of the number to be represented. The exponentEis anw-bit signed
integer, e.g. for single precisionw= 8and thusE2[127;128]. In practice, exponents range from
is smaller since the first and the last number are reserved for special numbers. The true significand or
mantissa includes t bits on the right of the binary point. There is an implicit leading bit with value
one. A values is consequently decomposed as follows
<<FORMULA>> (21)
<<FORMULA>> (22)
In table 3, we summarize common and less common floating point formats.
There is however the possibility to design a self defined format. There are 3 important quantities
when choosing the right specification: overflow, underflow and unit round off also known as machine
precision. Each one can be computed knowing the number of exponent and significant bits. in
our work for example we consider a format that uses significantly less exponent bits since network
parameters usually vary between [-10,10]. We set the unit round off equal to the precision and thus
can compute the significant bits necessary to represent a specific weight.
Beyond designing a tailored floating point format for deep learning, recent work also explored the
possibility of deep learning with mixed formats [43,23]. For example, imagine the activations having
high precision while weights can be low precision.
C. Shrinkage properties of the normal-Jeffreys and horseshoe priors
<<FIGURE>>
Figure 3: Comparison of the behavior of the log-uniform / normal-Jeffreys (NJ) prior and the
horseshoe (HS) prior (wheres= 1). Both priors behave similarly at zero but the normal-Jeffreys has
an extremely heavy tail (thus making it non-normalizable).
In this section we will provide some insights about the behavior of each of the priors we employ by
following the excellent analysis of [8]; we can perform a change of variables and express the scale
mixture distribution of eq.3 in the main paper in terms of a shrinkage coefficient,
<<FORMULA>> (23)
It is easy to observe that eq. 23 corresponds to a continuous relaxation of the spike-and-slab prior:
when <<= 0>> we have that <<FORMULA>>, i.e. no shrinkage/regularization forw, when
<<= 1>> we have that <<FORMULA>>, i.e.wis exactly zero, and when <<=1>> we have that <<FORMULA>>. Now by examining the implied prior on the shrinkage coefficient for both
the log-uniform and the horseshoe priors we can better study their behavior. As it is explained at
the half-Cauchy prior onzcorresponds to a beta prior on the shrinkage coefficient, <<FORMULA>>,
whereas the normal-Jeffreys / log-uniform prior onzcorresponds <<top() =B(;)>> with <<FORMULA>>.
The densities of both of these distributions can be seen at Figure 3b. As we can observe, the log-
uniform prior posits a distribution that concentrates almost all of its mass at either0or1,
essentially either pruning the parameter or keeping it close to the maximum likelihood estimate due
<<FORMULA>>. In contrast the horseshoe prior maintains enough probability mass for
the in-between values of and thus can, potentially, offer better regularization and generalization.
D. Negative KL-divergences for log-normal approximating posteriors
Le <<FORMULA>> be a log-normal approximating posterior. Here we will derive the negative
KL-divergences toq(z)from inverse gamma, gamma and half-normal distributions.
Letp(z)be an inverse gamma distribution, i.e. <<p(z) =IG(;)>>. The negative KL-divergence can
be expressed as follows:
<<FORMULA>> (24)
The second term is the entropy of the log-normal distribution which has the following form:
<<FORMULA>> (25)
The first term is the negative cross-entropy of the log-normal approximate posterior from the inverse-
Gamma prior:
<<FORMULA>> (26)
<<FORMULA>> (27)
Since the natural logarithm of a log-normal distribution <<FORMULA>> follows a normal distribution
<<FORMULA>> we have that <<FORMULA>>. Furthermore we have that <<FORMULA>> then <<FORMULA>>, therefore
<<FORMULA>>. Putting everything together we have that:
<<FORMULA>> (28)
Therefore the negative KL-divergence is:
<<FORMULA>> (29)
Now let p(z) be a Gamma prior, i.e. <<p(z) =G(;)>>. We have that the negative cross-entropy
changes to:
<<FORMULA>> (30)
<<FORMULA>> (31)
<<FORMULA>> (32)2
Therefore the negative KL-divergence is:
<<FORMULA>> (33)
Now, by employing the aforementioned we can express the negative KL-divergence from
<<FORMULA>> to <<FORMULA>> as follows:
<<FORMULA>>
with the KL-divergence for the weight distribution <<q (W~)>> given by eq.8 in the main paper.
E. Visualizations
<<FIGURE>>
Figure 4: Distribution of the thresholds for the Sparse Variational Dropout 4a, Bayesian Compression
with group normal-Jeffreys (BC-GNJ) 4b and group Horseshoe (BC-GHS) 4c priors for the three
layer LeNet-300-100 architecture. It is easily observed that there are usually two well separable
groups with BC-GNJ and BC-GHS, thus making the choice for the threshold easy. Smaller values
indicate signal whereas larger values indicate noise (i.e. useless groups).
<<FIGURE>>
Figure 5: Distribution of the bit precisions for the Sparse Variational Dropout 5a, Bayesian Com-
pression with group normal-Jeffreys (BC-GNJ) 5b and group Horseshoe (BC-GHS) 5c priors for the
three layer LeNet-300-100 architecture. All of the methods usually require far fewer than 32bits for
the weights.
F. Algorithms for the feedforward pass
Algorithms 1, 2, 3, 4 describe the forward pass using local reparametrizations for fully connected and
convolutional layers with the approximate posteriors for the Bayesian Compression (BC) with group
normal-Jeffreys (BC-GNJ) and group Horseshoe (BC-GHS) priors employed at the experiments. For
the fully connected layers we coupled the scales for each input neuron whereas for the convolutional
we couple the scales for each output feature map.Mw ;w are the means and variances of each layer,
His a minibatch of activations of sizeK. For the first layer we have thatH=XwhereXis the
minibatch of inputs. For the convolutional layersNf are the number of convolutional filters,is the
convolution operator and we assume the [batch, height, width, feature maps] convention.
Algorithm 1 Fully connected BC-GNJ layer h.
<<ALGORITHM>>
Algorithm 2Convolutional BC-GNJ layerh.
<<ALGORITHM>>
Algorithm 3 Fully connected BC-GHS layerh.
<<ALGORITHM>>
Algorithm 4Convolutional BC-GHS layerh.
<<ALGORITHM>>
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Channel Pruning for Accelerating Very Deep Neural Networks
Yihui He* Xiangyu Zhang Jian Sun
Xifian Jiaotong University Megvii Inc. Megvii Inc.
Xifian, 710049, China Beijing, 100190, China Beijing, 100190, China
heyihui@stu.xjtu.edu.cn zhangxiangyu@megvii.com sunjian@megvii.com
Abstract
In this paper, we introduce a new channel pruning method to accelerate very deep convolutional neural net.works. Given a trained CNN model, we propose an it.erative two-step algorithm to effectively prune each layer, by a LASSO regression based channel selection and least square reconstruction. We further generalize this algorithm to multi-layer and multi-branch cases. Our method re.duces the accumulated error and enhance the compatibility with various architectures. Our pruned VGG-16 achieves the state-of-the-art results by 5. speed-up along with only 0.3% increase of error. More importantly, our method is able to accelerate modern networks like ResNet, exception and suffers only 1.4%, 1.0% accuracy loss under 2. speed.up respectively, which is significant.
1. Introduction
Recent CNN acceleration works fall into three categories: optimized implementation (e.g., FFT [47]), quantization (e.g., BinaryNet [8]), and structured simplification that convert a CNN into compact one [22]. This work focuses on the last one.
Structured simplification mainly involves: tensor factorization [22], sparse connection [17], and channel pruning [48]. Tensor factorization factorizes a convolutional layer into several efficient ones (Fig. 1(c)). However, feature map width (number of channels) could not be reduced, which makes it difficult to decompose 1 . 1 convolutional layer favored by modern networks (e.g., GoogleNet [45], ResNet [18], Xception [7]). This type of method also intro.duces extra computation overhead. Sparse connection deactivates connections between neurons or channels (Fig. 1(b)). Though it is able to achieves high theoretical speed-up ratio, the sparse convolutional layers have an fiirregularfi shape which is not implementation friendly. In contrast, channel pruning directly reduces feature map width, which shrinks
<<FIGURE>>
Figure 1. Structured simplification methods that accelerate CNNs:
(a) a network with 3 conv layers. (b) sparse connection deactivates some connections between channels. (c) tensor factorization factorizes a convolutional layer into several pieces. (d) channel pruning reduces number of channels in each layer (focus of this paper).
a network into thinner one, as shown in Fig. 1(d). It is efficient on both CPU and GPU because no special implementation is required.
Pruning channels is simple but challenging because re.moving channels in one layer might dramatically change the input of the following layer. Recently, training-based channel pruning works [1, 48] have focused on imposing sparse constrain on weights during training, which could adaptively determine hyper-parameters. However, training from scratch is very costly and results for very deep CNNs on ImageNet have been rarely reported. Inference-time at.tempts [31, 3] have focused on analysis of the importance of individual weight. The reported speed-up ratio is very limited.
In this paper, we propose a new inference-time approach for channel pruning, utilizing redundancy inter channels. Inspired by tensor factorization improvement by feature maps reconstruction [52], instead of analyzing filter weights [22, 31], we fully exploits redundancy within feature maps. Specifically, given a trained CNN model, pruning each layer is achieved by minimizing reconstruction error on its output feature maps, as showed in Fig. 2. We solve this mini.
<<FIGURE>>
Figure 2. Channel pruning for accelerating a convolutional layer. We aim to reduce the width of feature map B, while minimizing the reconstruction error on feature map C. Our optimization algorithm (Sec. 3.1) performs within the dotted box, which does not involve nonlinearity. This figure illustrates the situation that two channels are pruned for feature map B. Thus corresponding channels of filters W can be removed. Furthermore, even though not directly optimized by our algorithm, the corresponding filters in the previous layer can also be removed (marked by dotted filters). c, n: number of channels for feature maps B and C, kh . kw : kernel size.
minimization problem by two alternative steps: channels selection and feature map reconstruction. In one step, we figure out the most representative channels, and prune redundant ones, based on LASSO regression. In the other step, we reconstruct the outputs with remaining channels with linear least squares. We alternatively take two steps. Further, we approximate the network layer-by-layer, with accumulated error accounted. We also discuss methodologies to prune multi-branch networks (e.g., ResNet [18], exception [7]).
For VGG-16, we achieve 4. acceleration, with only 1.0% increase of top-5 error. Combined with tensor factorization, we reach 5. acceleration but merely suffer 0.3% increase of error, which outperforms previous state-of-the.arts. We further speed up ResNet-50 and Xception-50 by 2. with only 1.4%, 1.0% accuracy loss respectively.
2. Related Work
There has been a significant amount of work on accelerating CNNs. Many of them fall into three categories: optimized implementation [4], quantization [40], and structured simplification [22].
Optimized implementation based methods [35, 47, 27, 4] accelerate convolution, with special convolution algorithms like FFT [47]. Quantization [8, 40] reduces floating point computational complexity.
Sparse connection eliminates connections between neurons [17, 32, 29, 15, 14]. [51] prunes connections based on weights magnitude. [16] could accelerate fully connected layers up to 50.. However, in practice, the actual speed-up maybe very related to implementation.
Tensor factorization [22, 28, 13, 24] decompose weights into several pieces. [50, 10, 12] accelerate fully connected layers with truncated SVD. [52] factorize a layer into 3 . 3 and 1 . 1 combination, driven by feature map redundancy.
Channel pruning removes redundant channels on feature maps. There are several training-based approaches. [1, 48] regularize networks to improve accuracy. Channel-wise SSL [48] reaches high compression ratio for first few conv layers of LeNet [30] and AlexNet [26]. However, training-based approaches are more costly, and the effectiveness for very deep networks on large datasets is rarely exploited.
Inference-time channel pruning is challenging, as re.ported by previous works [2, 39]. Some works [44, 34, 19] focus on model size compression, which mainly operate the fully connected layers. Data-free approaches [31, 3] results for speed-up ratio (e.g., 5.) have not been reported, and requires long retraining procedure. [3] select channels via over 100 random trials, however it need long time to eval.ate each trial on a deep network, which makes it infeasible to work on very deep models and large datasets. [31] is even worse than naive solution from our observation sometimes (Sec. 4.1.1).
3. Approach
In this section, we first propose a channel pruning algorithm for a single layer, then generalize this approach to multiple layers or the whole model. Furthermore, we dis.cuss variants of our approach for multi-branch networks.
3.1. Formulation
Fig. 2 illustrates our channel pruning algorithm for a sin.gle convolutional layer. We aim to reduce the width of feature map B, while maintaining outputs in feature map
C. Once channels are pruned, we can remove correspond.ing channels of the filters that take these channels as in.put. Also, filters that produce these channels can also be removed. It is clear that channel pruning involves two key points. The first is channel selection, since we need to select most representative channels to maintain as much information. The second is reconstruction. We need to reconstruct the following feature maps using the selected channels.
Motivated by this, we propose an iterative two-step algorithm. In one step, we aim to select most representative channels. Since an exhaustive search is infeasible even for tiny networks, we come up with a LASSO regression based method to figure out representative channels and prune redundant ones. In the other step, we reconstruct the outputs with remaining channels with linear least squares. We alter.natively take two steps.
Formally, to prune a feature map with c channels, we consider applying n.c.kh .kw convolutional filters W on <<FORMULA>> input volumes X sampled from this feature map, which produces N . n output matrix Y. Here, N is the number of samples, n is the number of output channels, and kh,kw are the kernel size. For simple representation, bias term is not included in our formulation. To prune the
..
input channels from c to desired <<FORMULA>>, while minimizing reconstruction error, we formulate our problem as follow:
<<FORMULA>> (1)
F is Frobenius norm. <<FORMULA>> matrix sliced from ith channel of input volumes X_i, i =1, ..., c. W_i is n . filter weights sliced from ith channel of W. is coefficient vector of length c for channel selection, and .i is ith entry of . Notice that, if .i =0, X_i will be no longer useful, which could be safely pruned from feature map. W_i could also be removed. Optimization Solving this minimization problem in Eqn. 1 is NP-hard. Therefore, we relax the l_0 to l_1 regularization:
<<FORMULA>> (2)
. is a penalty coefficient. By increasing l, there will be more zero terms in and one can get higher speed-up ratio. We also add a constrain .i WiF =1 to this formulation, which avoids trivial solution.
Now we solve this problem in two folds. First, we fix W, solve for channel selection. Second, we fix , solve W to reconstruct error.
(i) The subproblem of . In this case, W is fixed. We solve for channel selection. This problem can be solved by LASSO regression [46, 5], which is widely used for model selection.
<<FORMULA>> (3)
.
Here Zi =XiWi (size N .n). We will ignore ith channels if .i =0.
(ii) The subproblem of W. In this case, is fixed. We utilize the selected channels to minimize reconstruction error. We can find optimized solution by least squares:
<<FORMULA>>. (4)
Here <<FORMULA>> (size N.). W is n reshaped W, <<FORMULA>>. After obtained result W, it is reshaped back to W. Then we assign <<FORMULA>>. Constrain <<FORMULA>> satisfies.
We alternatively optimize (i) and (ii). In the beginning, W is initialized from the trained model, <<FORMULA>>, namely no penalty, and <<k = c>>. We gradually increase <<FORMULA>> For each
change of <<FORMULA>>, we iterate these two steps until k is stable.
After <<FORMULA>> satisfies, we obtain the final solution W from <<FORMULA>> In practice, we found that the two steps iteration is time consuming. So we apply (i) multiple times,
<<FORMULA>>
until <<FORMULA>> satisfies. Then apply (ii) just once, to obtain
<<FORMULA>>
the final result. From our observation, this result is comparable with two steps iterations. Therefore, in the following experiments, we adopt this approach for efficiency.
Discussion: Some recent works [48, 1, 17] (though train.
ing based) also introduce .1-norm or LASSO. However, we must emphasis that we use different formulations. Many of them introduced sparsity regularization into training loss, instead of explicitly solving LASSO. Other work [1] solved LASSO, while feature maps or data were not considered during optimization. Because of these differences, our ap.proach could be applied at inference time.
3.2. Whole Model Pruning
Inspired by [52], we apply our approach layer by layer sequentially. For each layer, we obtain input volumes from the current input feature map, and output volumes from the output feature map of the un-pruned model. This could be formalized as:
<<FORMULA>> (5)
Different from Eqn. 1, Y is replaced by Y . , which is from feature map of the original model. Therefore, the accumulated error could be accounted during sequential pruning.
3.3. Pruning Multi.Branch Networks
The whole model pruning discussed above is enough for single-branch networks like LeNet [30], AlexNet [26] and VGG Nets [43]. However, it is insufficient for multi-branch networks like GoogLeNet [45] and ResNet [18]. We mainly focus on pruning the widely used residual structure (e.g., ResNet [18], Xception [7]). Given a residual block shown in Fig. 3 (left), the input bifurcates into shortcut and residual branch. On the residual branch, there are several convolutional layers (e.g., 3 convolutional layers which have spatial size of 1 . 1, 3 . 3, 1 . 1, Fig. 3, left). Other layers except the first and last layer can be pruned as is described previously. For the first layer, the challenge is that the large input feature map width (for ResNet, 4 times of its output) can it be easily pruned, since it is shared with shortcut. For the last layer, accumulated error from the shortcut is hard to be recovered, since there is no parameter on the shortcut. To address these challenges, we propose several variants of our approach as follows.
<<FIGURE>>
Figure 3. Illustration of multi-branch enhancement for residual block. Left: original residual block. Right: pruned residual block with enhancement, cx denotes the feature map width. Input channels of the first convolutional layer are sampled, so that the large input feature map width could be reduced. As for the last layer, rather than approximate Y2 , we try to approximate <<Y1+Y2>> directly (Sec. 3.3 Last layer of residual branch).
Last layer of residual branch: Shown in Fig. 3, the output layer of a residual block consists of two inputs: feature map Y1 and Y2 from the shortcut and residual branch. We aim to recover Y1 +Y2 for this block. Here, Y1, Y2 are the original feature maps before pruning. Y2 could be approximated as in Eqn. 1. However, shortcut branch is parameter-free, then Y1 could not be recovered directly. To compensate this error, the optimization goal of the last layer is changed from Y2 to Y1 .Y . +Y2, which does not change
<<FORMULA>>
our optimization. Here, Y . is the current feature map after
<<FORMULA>>
previous layers pruned. When pruning, volumes should be sampled correspondingly from these two branches.
First layer of residual branch: Illustrated in Fig. 3(left), the input feature map of the residual block could not be pruned, since it is also shared with the short.cut branch. In this condition, we could perform feature map sampling before the first convolution to save computation. We still apply our algorithm as Eqn. 1. Differently, we sample the selected channels on the shared feature maps to construct a new input for the later convolution, shown in Fig. 3(right). Computational cost for this operation could be ignored. More importantly, after introducing feature map sampling, the convolution is still irregular.
Filter-wise pruning is another option for the first con.volution on the residual branch. Since the input channels of parameter-free shortcut branch could not be pruned, we apply our Eqn. 1 to each filter independently (each fil.ter chooses its own representative input channels). Under single layer acceleration, filter-wise pruning is more accurate than our original one. From our experiments, it im.proves 0.5% top-5 accuracy for 2. ResNet-50 (applied on the first layer of each residual branch) without fine-tuning. However, after fine-tuning, there is no noticeable improvement. In addition, it outputs irregular convolutional layers, which need special library implementation support. We do not adopt it in the following experiments.
4. Experiment
We evaluation our approach for the popular VGG Nets [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR.10 [25] and PASCAL VOC 2007 [11].
For Batch Normalization [21], we first merge it into convolutional weights, which do not affect the outputs of the networks. So that each convolutional layer is followed by ReLU [36]. We use Caffe [23] for deep network evaluation, and scikit-learn [38] for solvers implementation. For channel pruning, we found that it is enough to extract 5000 images, and 10 samples per image. On ImageNet, we evaluate the top-5 accuracy with single view. Images are re.sized such that the shorter side is 256. The testing is on center crop of 224 . 224 pixels. We could gain more per.formance with fine-tuning. We use a batch size of 128 and
.5
learning rate 1e^-4. We fine-tune our pruned models for 10 epochs. The augmentation for fine-tuning is random crop of 224 . 224 and mirror.
4.1. Experiments with VGG.16
VGG-16 [43] is a 16 layers single path convolutional neural network, with 13 convolutional layers. It is widely used in recognition, detection and segmentation, etc. Single view top-5 accuracy for VGG-16 is 89.9%1.
4.1.1 Single Layer Pruning
In this subsection, we evaluate single layer acceleration performance using our algorithm in Sec. 3.1. For better under.standing, we compare our algorithm with two naive chan.nel selection strategies. first k selects the first k channels. max response selects channels based on corresponding filters that have high absolute weights sum [31]. For fair com.parison, we obtain the feature map indexes selected by each of them, then perform reconstruction (Sec. 3.1 (ii)). We hope that this could demonstrate the importance of channel selection. Performance is measured by increase of error af.ter a certain layer is pruned without fine-tuning, shown in Fig. 4.
As expected, error increases as speed-up ratio increases. Our approach is consistently better than other approaches in different convolutional layers under different speed-up ra.tio. Unexpectedly, sometimes max response is even worse than first k. We argue that max response ignores correlations between different filters. Filters with large absolute weight may have strong correlation. Thus selection based on filter weights is less meaningful. Correlation on feature maps is worth exploiting. We can find that channel selection http://www.vlfeat.org/matconvnet/pretrained/
<<FIGURE>>
Figure 4. Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by increase of error. To verify the importance of channel selection referred in Sec. 3.1, we considered two naive baselines. first k selects the first k feature maps. max response selects channels based on absolute sum of corresponding weights filter [31]. Our approach is consistently better (smaller is better).
<<TABLE>>
Table 1. Accelerating the VGG-16 model [43] using a speedup ratio of 2., 4., or 5. (smaller is better).
affects reconstruction error a lot. Therefore, it is important for channel pruning.
Also notice that channel pruning gradually becomes hard, from shallower to deeper layers. It indicates that shallower layers have much more redundancy, which is consistent with [52]. We could prune more aggressively on shallower layers in whole model acceleration.
4.1.2 Whole Model Pruning
Shown in Table 1, whole model acceleration results under 2., 4., 5. are demonstrated. We adopt whole model pruning proposed in Sec. 3.2. Guided by single layer experiments above, we pruning more aggressive for shallower layers. Remaining channels ratios for shallow lay.ers (conv 1_x to conv 3_x) and deep layers (conv4_x) is 1:1.5. conv 5_x are not pruned, since they only con.tribute 9% computation in total and are not redundant.
After fine-tuning, we could reach 2. speed-up without losing accuracy. Under 4., we only suffers 1.0% drops. Consistent with single layer analysis, our approach outperforms previous channel pruning approach (Li et al. [31]) by large margin. This is because we fully exploits channel redundancy within feature maps. Compared with tensor factorization algorithms, our approach is better than Jaderberg et al. [22], without fine-tuning. Though worse than Asym. [52], our combined model outperforms its combined Asym. 3D (Table 2). This may indicate that channel pruning is more challenging than tensor factorization, since removing channels in one layer might dramatically change the input of the following layer. However, channel pruning keeps the original model architecture, do not introduce additional layers, and the absolute speed-up ratio on GPU is much higher (Table 3).
Since our approach exploits a new cardinality, we further combine our channel pruning with spatial factorization [22] and channel factorization [52]. Demonstrated in Table 2,
<<TABLE>>
Table 2. Performance of combined methods on the VGG-16 model
[43] using a speed-up ratio of 4. or 5.. Our 3C solution outperforms previous approaches (smaller is better).
our 3 cardinalities acceleration (spatial, channel factorization, and channel pruning, denoted by 3C) outperforms previous state-of-the-arts. Asym. 3D [52] (spatial and chan.nel factorization), factorizes a convolutional layer to three parts: <<FORMULA>>.
We apply spatial factorization, channel factorization, and our channel pruning together sequentially layer-by-layer. We fine-tune the accelerated models for 20 epochs, since they are 3 times deeper than the original ones. After fine-tuning, our 4. model suffers no degradation. Clearly, a combination of different acceleration techniques is better than any single one. This indicates that a model is redundant in each cardinality.
4.1.3 Comparisons of Absolute Performance
We further evaluate absolute performance of acceleration on GPU. Results in Table 3 are obtained under Caffe [23], CUDA 8 [37] and cuDNN5 [6], with a mini-batch of 32 on a GPU (GeForce GTX TITAN X). Results are averaged from 50 runs. Tensor factorization approaches decompose weights into too many pieces, which heavily increase over.head. They could not gain much absolute speed-up. Though our approach also encountered performance decadence, it generalizes better on GPU than other approaches. Our re.sults for tensor factorization differ from previous research [52, 22], maybe because current library and hardware prefer single large convolution instead of several small ones.
4.1.4 Comparisons with Training from Scratch
Though training a compact model from scratch is time-consuming (usually 120 epochs), it worths comparing our approach and from scratch counterparts. To be fair, we evaluated both from scratch counterpart, and normal setting net.work that has the same computational complexity and same architecture.
Shown in Table 4, we observed that it is difficult for from scratch counterparts to reach competitive accuracy. our model outperforms from scratch one. Our approach successfully picks out informative channels and constructs highly compact models. We can safely draw the conclusion that the same model is difficult to be obtained from scratch. This coincides with architecture design researches [20, 1] that the model could be easier to train if there are more channels in shallower layers. However, channel prun.ing favors shallower layers.
For from scratch (uniformed), the filters in each layers is reduced by half (eg. reduce conv1_1 from 64 to 32). We can observe that normal setting networks of the same complexity couldn't reach same accuracy either. This consolidates our idea that there is much redundancy in networks while training. However, redundancy can be opt out at inference-time. This maybe an advantage of inference-time acceleration approaches over training-based approaches.
Notice that there is a 0.6% gap between the from scratch model and uniformed one, which indicates that there is room for model exploration. Adopting our approach is much faster than training a model from scratch, even for a thin.ner one. Further researches could alleviate our approach to do thin model exploring.
4.1.5 Acceleration for Detection
VGG-16 is popular among object detection tasks [42, 41, 33]. We evaluate transfer learning ability of our 2./4. pruned VGG-16, for Faster R-CNN [42] object detections. PASCAL VOC 2007 object detection benchmark [11] contains 5k trainable images and 5k test images. The performance is evaluated by mean Average Precision (mAP). In our experiments, we first perform channel pruning for VGG-16 on the ImageNet. Then we use the pruned model as the pre-trained model for Faster R-CNN.
The actual running time of Faster R-CNN is 220ms / im.age. The convolutional layers contributes about 64%. We got actual time of 94ms for 4. acceleration. From Table 5, we observe 0.4% mAP drops of our 2. model, which is not harmful for practice consideration.
4.2. Experiments with Residual Architecture Nets
For Multi-path networks [45, 18, 7], we further explore the popular ResNet [18] and latest Xception [7], on Ima.geNet and CIFAR-10. Pruning residual architecture nets is more challenging. These networks are designed for both efficiency and high accuracy. Tensor factorization algorithms [52, 22] have difficult to accelerate these model. Spatially, 1 . 1 convolution is favored, which could hardly be factorized.
4.2.1 ResNet Pruning
ResNet complexity uniformly drops on each residual block. Guided by single layer experiments (Sec. 4.1.1), we still prefer reducing shallower layers heavier than deeper ones.
Following similar setting as Filter pruning [31], we keep 70% channels for sensitive residual blocks (res5 and blocks close to the position where spatial size
<<TABLE>>
Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is better).
<<TABLE>>
Table 4. Comparisons with training from scratch, under 4. acceleration. Our fine-tuned model outperforms scratch trained counterparts (smaller is better).
<<TABLE>>
Table 5.Acceleration for Faster R-CNN detection.
<<TABLE>>
Table 6. 2. acceleration for ResNet-50 on ImageNet, the base.line network is top-5 accuracy is 92.2% (one view). We improve performance with multi-branch enhancement (Sec. 3.3, smaller is better).
change, e.g. res3a,res3d). As for other blocks, we keep 30% channels. With multi-branch enhancement, we prune branch 2a more aggressively within each residual block. The remaining channels ratios for branch 2a,branch 2b,branch 2c is 2:4:3 (e.g., Given 30%, we keep 40%, 80%, 60% respectively).
We evaluate performance of multi-branch variants of our approach (Sec. 3.3). From Table 6, we improve 4.0% with our multi-branch enhancement. This is because we accounted the accumulated error from shortcut connection which could broadcast to every layer after it. And the large input feature map width at the entry of each residual block is well reduced by our feature map sampling.
<<TABLE>>
Table 7. Comparisons for Xception-50, under 2. acceleration ra.tio. The baseline network is top-5 accuracy is 92.8%. Our approach outperforms previous approaches. Most structured simplification methods are not effective on Xception architecture (smaller is better).
4.2.2 Xception Pruning
Since computational complexity becomes important in model design, separable convolution has been payed much attention [49, 7]. Xception [7] is already spatially optimized and tensor factorization on 1 . 1 convolutional layer is destructive. Thanks to our approach, it could still be accelerated with graceful degradation. For the ease of comparison, we adopt Xception convolution on ResNet-50, denoted by Xception-50. Based on ResNet-50, we swap all convolutional layers with spatial conv blocks. To keep the same computational complexity, we increase the input channels of all branch2b layers by 2.. The baseline Xception.50 has a top-5 accuracy of 92.8% and complexity of 4450 MFLOPs.
We apply multi-branch variants of our approach as de.scribed in Sec. 3.3, and adopt the same pruning ratio setting as ResNet in previous section. Maybe because of Xcep.tion block is unstable, Batch Normalization layers must be maintained during pruning. Otherwise it becomes nontrivial to fine-tune the pruned model.
Shown in Table 7, after fine-tuning, we only suffer 1.0% increase of error under 2.. Filter pruning [31] could also apply on Xception, though it is designed for small speed.up ratio. Without fine-tuning, top-5 error is 100%. After training 20 epochs which is like training from scratch, in.creased error reach 4.3%. Our results for Xception-50 are not as graceful as results for VGG-16, since modern net.works tend to have less redundancy by design.
<<TABLE>>
Table 8. 2. speed-up comparisons for ResNet-56 on CIFAR-10, the baseline accuracy is 92.8% (one view). We outperforms previous approaches and scratch trained counterpart (smaller is better).
4.2.3 Experiments on CIFAR-10
Even though our approach is designed for large datasets, it could generalize well on small datasets. We perform experiments on CIFAR-10 dataset [25], which is favored by many acceleration researches. It consists of 50k images for training and 10k for testing in 10 classes.
We reproduce ResNet-56, which has accuracy of 92.8% (Serve as a reference, the official ResNet-56 [18] has ac.curacy of 93.0%). For 2. acceleration, we follow similar setting as Sec. 4.2.1 (keep the final stage unchanged, where the spatial size is 8 . 8). Shown in Table 8, our approach is competitive with scratch trained one, without fine-tuning, under 2. speed-up. After fine-tuning, our result is significantly better than Filter pruning [31] and scratch trained one.
5. Conclusion
To conclude, current deep CNNs are accurate with high inference costs. In this paper, we have presented an inference-time channel pruning method for very deep net.works. The reduced CNNs are inference efficient networks while maintaining accuracy, and only require off-the-shelf libraries. Compelling speed-ups and accuracy are demonstrated for both VGG Net and ResNet-like networks on Im.ageNet, CIFAR-10 and PASCAL VOC.
In the future, we plan to involve our approaches into training time, instead of inference time only, which may also accelerate training procedure.
References
[1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pages 2262fi2270, 2016. 1, 2, 3, 6
[2] S. Anwar, K. Hwang, and W. Sung. Structured prun.ing of deep convolutional neural networks. arXiv preprint arXiv:1512.08571, 2015. 2
[3] S. Anwar and W. Sung. Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv:1610.09639, 2016. 1, 2
[4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn: Lookup-based convolutional neural network. arXiv preprint arXiv:1611.06473, 2016. 2
[5] L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37(4):373fi384, 1995. 3
[6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. 6
[7] F. Chollet. Xception: Deep learning with depthwise separa.ble convolutions. arXiv preprint arXiv:1610.02357, 2016. 1, 2, 3, 4, 6, 7
[8] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. 1, 2
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248fi255. IEEE, 2009. 4
[10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer.gus. Exploiting linear structure within convolutional net.works for efficient evaluation. In Advances in Neural In.formation Processing Systems, pages 1269fi1277, 2014. 2
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal.network.org/challenges/VOC/voc2007/workshop/index.html. 4, 6
[12] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Inter.national Conference on Computer Vision, pages 1440fi1448, 2015. 2
[13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress.ing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014. 2
[14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Process.ing Systems, pages 1379fi1387, 2016. 2
[15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: efficient inference engine on com.pressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243fi254. IEEE Press, 2016. 2
[16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com.pressing deep neural network with pruning, trained quantiza.tion and huffman coding. CoRR, abs/1510.00149, 2, 2015.
2
[17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135fi1143, 2015. 1, 2, 3
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn.ing for image recognition. arXiv preprint arXiv:1512.03385, 2015. 1,2,3,4,6,8
[19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim.ming: A data-driven neuron pruning approach towards effi.cient deep architectures. arXiv preprint arXiv:1607.03250, 2016. 2
[20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 6
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 4
[22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 1, 2, 5, 6, 7
[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir.shick, S. Guadarrama, and T. Darrell. Caffe: Convolu.tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 4, 6
[24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015. 2
[25] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. 4, 8
[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097fi1105, 2012. 2, 3
[27] A. Lavin. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015. 2
[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and
V. Lempitsky. Speeding-up convolutional neural net.works using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2
[29] V. Lebedev and V. Lempitsky. Fast convnets using group-wise brain damage. arXiv preprint arXiv:1506.02515, 2015.
2
[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed.ings of the IEEE, 86(11):2278fi2324, 1998. 2, 3
[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710,2016. 1,2,4,5,6,7,8
[32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni.tion, pages 806fi814, 2015. 2
[33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed,
C. Fu, and A. C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. 6
[34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint arXiv:1511.05077, 2015. 2
[35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013. 2
[36] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807fi814, 2010. 4
[37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40fi53, 2008. 6
[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma.chine learning in Python. Journal of Machine Learning Re.search, 12:2825fi2830, 2011. 4
[39] A. Polyak and L. Wolf. Channel-level acceleration of deep face representations. IEEE Access, 3:2163fi2175, 2015. 2
[40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor.net: Imagenet classification using binary convolutional neu.ral networks. In European Conference on Computer Vision, pages 525fi542. Springer, 2016. 2
[41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015. 6
[42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal net.works. CoRR, abs/1506.01497, 2015. 6
[43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3, 4, 5, 6
[44] S. Srinivas and R. V. Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015. 2
[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1fi9, 2015. 1, 3, 6
[46] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267fi288, 1996. 3
[47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi.antino, and Y. LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014. 1, 2
[48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances In Neural Information Processing Systems, pages 2074fi2082, 2016. 1, 2, 3
[49] S. Xie, R. Girshick, P. Dollfiar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016. 7
[50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In INTERSPEECH, pages 2365fi2369, 2013. 2
[51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv preprint arXiv:1611.05128, 2016. 2
[52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelli.gence, 38(10):1943fi1955, 2016. 1, 2, 3, 5, 6, 7
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Convex Neural Networks
Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Dept. IRO, Universite de Montr´ eal´
P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada
fbengioy,lerouxni,vincentp,delallea,marcotteg@iro.umontreal.ca
Abstract
Convexity has recently received a lot of attention in the machine learning
community, and the lack of convexity has been seen as a major disad-
vantage of many learning algorithms, such as multi-layer artificial neural
networks. We show that training multi-layer neural networks in which the
number of hidden units is learned can be viewed as a convex optimization
problem. This problem involves an infinite number of variables, but can be
solved by incrementally inserting a hidden unit at a time, each time finding
a linear classifier that minimizes a weighted sum of errors.
1 Introduction
The objective of this paper is not to present yet another learning algorithm, but rather to point
to a previously unnoticed relation between multi-layer neural networks (NNs),Boosting (Fre-
und and Schapire, 1997) and convex optimization. Its main contributions concern the mathe-
matical analysis of an algorithm that is similar to previously proposed incremental NNs, with
L1 regularization on the output weights. This analysis helps to understand the underlying
convex optimization problem that one is trying to solve.
This paper was motivated by the unproven conjecture (based on anecdotal experience) that
when the number of hidden units is “large”, the resulting average error is rather insensitive to
the random initialization of the NN parameters. One way to justify this assertion is that to re-
ally stay stuck in a local minimum, one must have second derivatives positive simultaneously
in all directions. When the number of hidden units is large, it seems implausible for none of
them to offer a descent direction. Although this paper does not prove or disprove the above
conjecture, in trying to do so we found an interesting characterization of the optimization
problem for NNs as a convex program if the output loss function is convex in the NN out-
put and if the output layer weights are regularized by a convex penalty. More specifically,
if the regularization is the L1 norm of the output layer weights, then we show that a “rea-
sonable” solution exists, involving a finite number of hidden units (no more than the number
of examples, and in practice typically much less). We present a theoretical algorithm that
is reminiscent of Column Generation (Chvatal, 1983), in which hidden neurons are inserted ´
one at a time. Each insertion requires solving a weighted classification problem, very much
like in Boosting (Freund and Schapire, 1997) and in particular Gradient Boosting (Mason
et al., 2000; Friedman, 2001).
Neural Networks, Gradient Boosting, and Column Generation
Denote x~2Rd+1 the extension of vector x2Rd with one element with value 1. What
we call “Neural Network” (NN) here is a predictor for supervised learning of the form
<<FORMULA>> where x is an input vector, <<h_i(x)>> is obtained from a linear dis-
criminant function hi <<FORMULA>> with e.g. <<s(a) = sign(a)>>, or <<s(a) = tanh(a)>> or
<<s(a) = 1>>. A learning algorithm must specify how to select m, the <<FORMULA>>
i s and the vi s.
The classical solution (Rumelhart, Hinton and Williams, 1986) involves (a) selecting a loss
function Q(^y;y)that specifies how to penalize for mismatches between y^(x)and the ob-
served ys (target output or target class), (b) optionally selecting a regularization penalty that
favors “small” parameters, and (c) choosing a method to approximately minimize the sum of
the losses on the training data D=f(x1 ;y 1 );:::;(xn ;y n )gplus the regularization penalty.
Note that in this formulation, an output non-linearity can still be used, by inserting it in the
loss function Q. Examples of such loss functions are the quadratic loss jjy^yjj 2 , the hinge
loss <<FORMULA>> (used in SVMs), the cross-entropy loss <<FORMULA>>
(used in logistic regression), and the exponential loss <<FORMULA>> (used in Boosting).
Gradient Boosting has been introduced in (Friedman, 2001) and (Mason et al., 2000) as a
non-parametric greedy-stagewise supervised learning algorithm in which one adds a function
at a time to the current solution <<y^(x)>>, in a steepest-descent fashion, to form an additive model
as above but with the functions hi typically taken in other kinds of sets of functions, such as
those obtained with decision trees. In a stagewise approach, when the (m+1)-th basis <<FORMULA>> is added,
only <<w_m+1>> is optimized (by a line search), like in matching pursuit algorithms. Such
a greedy-stagewise approach is also at the basis of Boosting algorithms (Freund and Schapire,
1997), which is usually applied using decision trees as bases and Qthe exponential loss.
It may be difficult to minimize exactly for wm+1 and hm+1 when the previous bases and
weights are fixed, so (Friedman, 2001) proposes to “follow the gradient” in function space,
i.e., look for a base learner hm+1 that is best correlated with the gradient of the average
loss on the <<FORMULA>> (that would be the residue <<FORMULA>> in the case of the square loss). The
algorithm analyzed here also involves maximizing the correlation between Q0 (the derivative
of Q with respect to its first argument, evaluated on the training predictions) and the next
basis hm+1 . However, we follow a “stepwise”, less greedy, approach, in which all the output
weights are optimized at each step, in order to obtain convergence guarantees.
Our approach adapts the Column Generation principle (Chvatal, 1983), a decomposition´
technique initially proposed for solving linear programs with many variables and few con-
straints. In this framework, active variables, or “columns”, are only generated as they are
required to decrease the objective. In several implementations, the column-generation sub-
problem is frequently a combinatorial problem for which efficient algorithms are available.
In our case, the subproblem corresponds to determining an “optimal” linear classifier.
2 Core Ideas
Informally, consider the set Hof all possible hidden unit functions (i.e., of all possible hidden
unit weight vectors vi ). Imagine a NN that has all the elements in this set as hidden units. We
might want to impose precision limitations on those weights to obtain either a countable or
even a finite set. For such a NN, we only need to learn the output weights. If we end up with
a finite number of non-zero output weights, we will have at the end an ordinary feedforward
NN. This can be achieved by using a regularization penalty on the output weights that yields
sparse solutions, such as the L1 penalty. If in addition the loss function is convex in the output
layer weights (which is the case of squared error, hinge loss, -tube regression loss, and
logistic or softmax cross-entropy), then it is easy to show that the overall training criterion
is convex in the parameters (which are now only the output weights). The only problem is
that there are as many variables in this convex program as there are elements in the set H,
which may be very large (possibly infinite). However, we find that with L1 regularization,
a finite solution is obtained, and that such a solution can be obtained by greedily inserting
one hidden unit at a time. Furthermore, it is theoretically possible to check that the global
optimum has been reached.
Definition 2.1.Let Hbe a set of functions from an input space X to R. Elements of H
can be understood as “hidden units” in a NN. Let Wbe the Hilbert space of functions from
Hto R, with an inner product denoted by <<FORMULA>>. An element of W can be
understood as the output weights vector in a neural network. Let <<h(x):H -> R>> the function
that maps any element <<h_i>> of <<H to h_i(x)>>. <<h(x)>> can be understood as the vector of activations
of hidden units when input x is observed. Let w2 W represent a parameter(the output
weights). The NN prediction is denoted <<FORMULA>>. Let <<Q:R -> RxR>> be a
cost function convex in its first argument that takes a scalar prediction y^(x)and a scalar
target value y and returns a scalar cost. This is the cost to be minimized on example pair
(x;y). Let <<FORMULA>> be the training set. Let <<FORMULA>> be a convex
regularization functional that penalizes for the choice of more “complex” parameters (e.g.,
<<FORMULA>> according to a 1-norm in W, if His countable). We define the convex NN
criterion C(H;Q;;D;w)with parameter was follows:
<<FORMULA>> (1)
The following is a trivial lemma, but it is conceptually very important as it is the basis for the
rest of the analysis in this paper.
Lemma 2.2.The convex NN cost <<FORMULA>> is a convex function of w.
Proof. <<FORMULA>> is convex in w and <<>> is convex in w, by the above construction. C
is additive in <<FORMULA>> and additive in . Hence C is convex in w.
Note that there are no constraints in this convex optimization program, so that at the global
minimum all the partial derivatives of C with respect to elements of w cancel.
Let jHj be the cardinality of the set H. If it is not finite, it is not obvious that an optimal
solution can be achieved in finitely many iterations.
Lemma 2.2 says that training NNs from a very large class (with one or more hidden layer)
can be seen as convex optimization problems, usually in a very high dimensional space,as
long as we allow the number of hidden units to be selected by the learning algorithm.
By choosing a regularizer that promotes sparse solutions, we obtain a solution that has a
finite number of “active” hidden units (non-zero entries in the output weights vector w).
This assertion is proven below, in theorem 3.1, for the case of the hinge loss.
However, even if the solution involves a finite number of active hidden units, the convex
optimization problem could still be computationally intractable because of the large number
of variables involved. One approach to this problem is to apply the principles already suc-
cessfully embedded in Gradient Boosting, but more specifically in Column Generation (an
optimization technique for very large scale linear programs), i.e., add one hidden unit at a
time in an incremental fashion. The important ingredient here is a way to know that we
have reached the global optimum, thus not requiring to actually visit all the possible
hidden units.We show that this can be achieved as long as we can solve the sub-problem
of finding a linear classifier that minimizes the weighted sum of classification errors. This
can be done exactly only on low dimensional data sets but can be well approached using
weighted linear SVMs, weighted logistic regression, or Perceptron-type algorithms.
Another idea (not followed up here) would be to consider first a smaller set H1 , for which
the convex problem can be solved in polynomial time, and whose solution can theoretically
be selected as initialization for minimizing the criterion <<FORMULA>>, with <<FORMULA>>,
and where H2 may have infinite cardinality (countable or not). In this way we could show
that we can find a solution whose cost satisfies <<FORMULA>>,
i.e., is at least as good as the solution of a more restricted convex optimization problem. The
second minimization can be performed with a local descent algorithm, without the necessity
to guarantee that the global optimum will be found.
3 Finite Number of Hidden Neurons
In this section we consider the special case with <<FORMULA>> the hinge loss,
and <<L1>> regularization, and we show that the global optimum of the convex cost involves at
most n+ 1 hidden neurons, using an approach already exploited in (Ratsch, Demiriz and¨
Bennett, 2002) for L1-loss regression Boosting with L1 regularization of output weights. Xn
The training criterion is <<FORMULA>>. Let us rewrite t=1 this cost function as the
constrained optimization problem:
<<FORMULA>> (C1)
<<FORMULA>> (C2)
Using a standard technique, the above program can be recast as a linear program. Defin-
ing <<FORMULA>> the vector of Lagrangian multipliers for the constraints C1 , its dual
problem (P)takes the form (in the case of a finite number Jof base learners):
<<FORMULA>>
In the case of a finite number Jof base learners, <<FORMULA>>. If
the number of hidden units is uncountable, then Iis a closed bounded interval of R.
Such an optimization problem satisfies all the conditions needed for using Theorem 4.2
from (Hettich and Kortanek, 1993). Indeed:
<<FORMULA>> it is compact (as a closed bounded interval of <<FORMULA>> is a concave function
it is even a linear function);
<<FORMULA>> is convex in <<>> (it is actually linear in <<>>);
<<FORMULA>> (therefore finite) ( (P)is the largest value of F satisfying the constraints);
for every set of n+1 points <<FORMULA>>, there exists ~such that <<FORMULA>> for
<<FORMULA>> (one can take <<FORMULA>> since K>0).
Then, from Theorem 4.2 from (Hettich and Kortanek, 1993), the following theorem holds:
Theorem 3.1.The solution of (P) can be attained with constraints C0 and only n+1 constraints C0
(i.e., there exists a subset of n+1 constraints C0 giving rise to the same maximum 1
as when using the whole set of constraints). Therefore, the primal problem associated is the
minimization of the cost function of a NN with n+1 hidden neurons.
4 Incremental Convex NN Algorithm
In this section we present a stepwise algorithm to optimize a NN, and show that there is a cri-
terion that allows to verify whether the global optimum has been reached. This is a specializa-
tion of minimizing <<FORMULA>>, with <<FORMULA>> 1 and <<FORMULA>>
is the set of soft or hard linear classifiers (depending on choice of s()).
Algorithm ConvexNN( D, Q, , s)
<<ALGORITHM>>
Theorem 4.1.AlgorithmConvexNN Pstops when it reaches the global optimum of
<<FORMULA>>.
Proof.Let wbe the output weights vector when the algorithm stops. Because the set of
hidden units Hwe consider is such that when his in H, h is also in H, we can assume
all weights to be non-negative. By contradiction, if w0 6=wis the global optimum, with
<<C(w_0) < C(w)>>, then, since Cis convex in the output weights, for any 2(0;1) , we have
<<FORMULA>>. For
small enough, we can assume all weights in w that are strictly positive to be also strictly
positive in w . Let us denote by Ip the set of strictly positive weights in w (and w), by
Iz the set of weights set to zero in w but to a non-zero value in w , and by k the difference
w;k wk in the weight of hidden unit hk between wand w . We can assume j < 0 for
j2Iz , because instead of setting a small positive weight to hj , one can decrease the weight
of hj by the same amount, which will give either the same cost, or possibly a lower one
when the weight of <<FORMULA>> is positive. With o() denoting a quantity such that o()!0
when !0, the difference (w) =XC(w )C(w)can now be written:
<<FORMULA>>
since for i2Ip , thanks to step (7) of the algorithm, we have @C (w) = 0 . Thus the @w
inequality <<FORMULA>> rewrites into <<FORMULA>>
which, when !0, yields (note that <<FORMULA>> does not depend on ! since j is linear in ):
<<FORMULA>> (2)
i being the optimal classifier chosen in step (5a) or (5c), all hidden units <<FORMULA>> verify <<FORMULA>>
<<FORMULA>>
<<FORMULA>> , contradicting eq. 2.
(Mason et al., 2000) prove a related global convergence result for the AnyBoost algorithm,
a non-parametric Boosting algorithm that is also similar to Gradient Boosting (Friedman,
2001). Again, this requires solving as a sub-problem an exact minimization to find a function
hi 2 H that is maximally correlated with the gradient Q0 on the output. We now show a
simple procedure to select a hyperplane with the best weighted classification error.
Exact Minimization
In step (5a) we are required to find a linear classifier that minimizes the weighted sum of
classification errors. Unfortunately, this is an NP-hard problem (w.r.t. d, see theorem 4
in (Marcotte and Savard, 1992)). However, an exact solution can be easily found in O(n3 )
computations for d= 2 inputs.
Proposition 4.2.Finding a linear classifier that minimizes the weighted sum of classification
error can be achieved in O(n3 )steps when the input dimension is d= 2 .
Proof.We want to <<FORMULA>> with respect to u and b, the cs being
in <<FORMULA>> Consider u fixed and sort the xi s according to their dot product with u and denote r
the function which maps ito r(i) such that xr(i) is in i-th position in the sort. Depending on P
the value of b, we will have n+1 possible sums, respectively <<FORMULA>>,
<<FORMULA>>. It is obvious that those sums only depend on the order of the products <<FORMULA>>,
<<FORMULA>>. When u varies smoothly on the unit circle, as the dot product is a continuous
function of its arguments, the changes in the order of the dot products will occur only when
there is a pair (i,j) such that <<FORMULA>>. Therefore, there are at most as many order
changes as there are pairs of different points, i.e., <<FORMULA>>. In the case of d=2, we
can enumerate all the different angles for which there is a change, namely a1 ;:::;a z with
<<FORMULA>>. We then need to test at least one <<FORMULA>> for each interval a2 i <
<<FORMULA>>, and also one u for <<FORMULA>>, which makes a total of <<FORMULA>> possibilities. 2
It is possible to generalize this result in higher dimensions, and as shown in (Marcotte and
Savard, 1992), one can achieve <<O(log(n)nd)>> time.
Algorithm 1 Optimal linear classifier search
<<ALGORITHM>>
Approximate Minimization
For data in higher dimensions, the exact minimization scheme to find the optimal linear
classifier is not practical. Therefore it is interesting to consider approximate schemes for
obtaining a linear classifier with weighted costs. Popular schemes for doing so are the linear
SVM (i.e., linear classifier with hinge loss), the logistic regression classifier, and variants of
the Perceptron algorithm. In that case, step (5c) of the algorithm is not an exact minimization,
and one cannot guarantee that the global optimum will be reached. However, it might be
reasonable to believe that finding a linear classifier by minimizing a weighted hinge loss
should yield solutions close to the exact minimization. Unfortunately, this is not generally
true, as we have found out on a simple toy data set described below. On the other hand,
if in step (7) one performs an optimization not only of the output weights wj (ji) but
also of the corresponding weight vectors vj , then the algorithm finds a solution close to the
global optimum (we could only verify this on 2-D data sets, where the exact solution can be
computed easily). It means that at the end of each stage, one first performs a few training
iterations of the whole NN (for the hidden units ji) with an ordinary gradient descent
mechanism (we used conjugate gradients but stochastic gradient descent would work too),
optimizing the wj s and the vj s, and then one fixes the vj s and obtains the optimal wj s for
these vj s (using a convex optimization procedure). In our experiments we used a quadratic
Q, for which the optimization of the output weights can be done with a neural network, using
the outputs of the hidden layer as inputs.
Let us consider now a bit more carefully what it means to tune the v_js in step (7). Indeed,
changing the weight vector vj of a selected hidden neuron to decrease the cost is equivalent
to a change in the output weights ws. More precisely, consider the step in which the
value of vj becomes v0 . This is equivalent to the following operation on the ws, when wj j is the corresponding output weight value: the output weight associated with the value vj of
a hidden neuron is set to 0, and the output weight associated with the value v0 of a hidden j
neuron is set to wj . This corresponds to an exchange between two variables in the convex
program. We are justified to take any such step as long as it allows us to decrease the cost
C(w). The fact that we are simultaneously making such exchanges on all the hidden units
when we tune the vj s allows us to move faster towards the global optimum.
Extension to multiple outputs
The multiple outputs case is more involved than the single-output case because it is not P
enough to check the condition <<FORMULA>>. Consider a new hidden neuron whose output is
hi when the input is xi . Let us also denote <<FORMULA>> the vector of output weights
between the new hidden neuron and the <<FORMULA>> output neurons. The gradient with respect to j
is <<FORMULA>> with <<FORMULA>> the value of the j-th output neuron with input <<FORMULA>>.
This means that if, for a given j, we have <<FORMULA>>, moving Pj away from 0 can
only increase the cost. Therefore, the right quantity to consider is <<FORMULA>>.
We must therefore find <<FORMULA>>. As before, this sub-problem is not + convex, but it is not
as obvious how to approximate it by a convex problem. The stopping P criterion becomes: if there is no j
such that <<FORMULA>>, then all weights must remain equal to 0 and a global minimum is reached.
Experimental Results
We performed experiments on the 2-D double moon toy dataset (as used in (Delalleau, Ben-
gio and Le Roux, 2005)), to be able to compare with the exact version of the algorithm. In
these experiments, <<FORMULA>>. The set-up is the following:
Select a new linear classifier, either (a) the optimal one or (b) an approximate using logistic
regression.
Optimize the output weights using a convex optimizer.
In case (b), tune both input and output weights by conjugate gradient descent on Cand
finally re-optimize the output weights using LASSO regression.
Optionally, remove neurons whose output weight has been set to 0.
Using the approximate algorithm yielded for 100 training examples an average penalized
( = 1 ) squared error of 17.11 (over 10 runs), an average test classification error of 3.68%
and an average number of neurons of 5.5 . The exact algorithm yielded a penalized squared
error of 8.09, an average test classification error of 5.3%, and required 3 hidden neurons. A
penalty of = 1 was nearly optimal for the exact algorithm whereas a smaller penalty further
improved the test classification error of the approximate algorithm. Besides, when running
the approximate algorithm for a long time, it converges to a solution whose quadratic error is
extremely close to the one of the exact algorithm.
5 Conclusion
We have shown that training a NN can be seen as a convex optimization problem, and have
analyzed an algorithm that can exactly or approximately solve this problem. We have shown
that the solution with the hinge loss involved a number of non-zero weights bounded by
the number of examples, and much smaller in practice. We have shown that there exists a
stopping criterion to verify if the global optimum has been reached, but it involves solving a
sub-learning problem involving a linear classifier with weighted errors, which can be computationally
hard if the exact solution is sought, but can be easily implemented for toy data
sets (in low dimension), for comparing exact and approximate solutions.
The above experimental results are in agreement with our initial conjecture: when there are
many hidden units we are much less likely to stall in the optimization procedure, because
there are many more ways to descend on the convex cost C(w). They also suggest, based
on experiments in which we can compare with the exact sub-problem minimization, that
applying Algorithm ConvexNN with an approximate minimization for adding each hidden
unit while continuing to tune the previous hidden unit s tends to lead to fast convergence
to the global minimum. What can get us stuck in a “local minimum” (in the traditional sense,
i.e., of optimizing ws and vs together) is simply the inability to find a new hidden unit
weight vector that can improve the total cost (fit and regularization term) even if there
exists one.
Note that as a side-effect of the results presented here, we have a simple way to train P neural
networks with hard-threshold hidden units, since increasing <<FORMULA>> can be either achieved
exactly (at great price) or approximately (e.g. by using a cross-entropy
or hinge loss on the corresponding linear classifier).
Acknowledgments
The authors thank the following for support: NSERC, MITACS, and the Canada Research
Chairs. They are also grateful for the feedback and stimulating exchanges with Sam Roweis,
Nathan Srebro, and Aaron Courville.
References
Chvatal, V. (1983).´ Linear Programming. W.H. Freeman.
Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efficient non-parametric function induction
in semi-supervised learning. In Cowell, R. and Ghahramani, Z., editors,Proceedings of AIS-
TATS2005, pages 96103.
Freund, Y. and Schapire, R. E. (1997). A decision theoretic generalization of on-line learning and an
application to boosting.Journal of Computer and System Science, 55(1):119139.
Friedman, J. (2001). Greedy function approximation: a gradient boosting machine.Annals of Statis-
tics, 29:1180.
Hettich, R. and Kortanek, K. (1993). Semi-infinite programming: theory, methods, and applications.
SIAM Review, 35(3):380429.
Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem.Zeitschrift fr
Operations Research (Theory), 36:517545.
Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. (2000). Boosting algorithms as gradient descent.
InAdvances in Neural Information Processing Systems 12, pages 512518.
Ratsch, G., Demiriz, A., and Bennett, K. P. (2002). Sparse regression ensembles in infinite and finite¨
hypothesis spaces.Machine Learning.
Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating
errors.Nature, 323:533536
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
DEEP COMPRESSION: COMPRESSING DEEP NEURAL
NETWORKS WITH PRUNING , T RAINED QUANTIZATION
AND HUFFMAN CODING
Song Han
Stanford University, Stanford, CA 94305, USA
songhan@stanford.edu
Huizi Mao
Tsinghua University, Beijing, 100084, China
mhz12@mails.tsinghua.edu.cn
William J. Dally
Stanford University, Stanford, CA 94305, USA
NVIDIA, Santa Clara, CA 95050, USA
dally@stanford.edu
ABSTRACT
Neural networks are both computationally intensive and memory intensive, making
them difficult to deploy on embedded systems with limited hardware resources. To
address this limitation, we introduce “deep compression”, a three stage pipeline:
pruning, trained quantization and Huffman coding, that work together to reduce
the storage requirement of neural networks by 35% to 49% without affecting their
accuracy. Our method first prunes the network by learning only the important
connections. Next, we quantize the weights to enforce weight sharing, finally, we
apply Huffman coding. After the first two steps we retrain the network to fine
tune the remaining connections and the quantized centroids. Pruning, reduces the
number of connections by 9% to 13%; Quantization then reduces the number of
bits that represent each connection from 32 to 5. On the ImageNet dataset, our
method reduced the storage required by AlexNet by 35%, from 240MB to 6.9MB,
without loss of accuracy. Our method reduced the size of VGG-16 by 49% from
552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model
into on-chip SRAM cache rather than off-chip DRAM memory. Our compression
method also facilitates the use of complex neural networks in mobile applications
where application size and download bandwidth are constrained. Benchmarked on
CPU, GPU and mobile GPU, compressed network has 3% to 4% layerwise speedup
and 3% to 7% better energy efficiency.
1 INTRODUCTION
Deep neural networks have evolved to the state-of-the-art technique for computer vision tasks
(Krizhevsky et al., 2012)(Simonyan & Zisserman, 2014). Though these neural networks are very
powerful, the large number of weights consumes considerable storage and memory bandwidth. For
example, the AlexNet Caffemodel is over 200MB, and the VGG-16 Caffemodel is over 500MB
(BVLC). This makes it difficult to deploy deep neural networks on mobile system.
First, for many mobile-first companies such as Baidu and Facebook, various apps are updated via
different app stores, and they are very sensitive to the size of the binary files. For example, App
Store has the restriction “apps above 100 MB will not download until you connect to Wi-Fi”. As a
result, a feature that increases the binary size by 100MB will receive much more scrutiny than one
that increases it by 10MB. Although having deep neural networks running on mobile has many great
<<FIGURE>>
Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning
reduces the number of weights by10%, while quantization further improves the compression rate:
between27%and31%. Huffman coding gives more compression: between35%and49%. The
compression rate already included the meta-data for sparse representation. The compression scheme
doesnt incur any accuracy loss.
features such as better privacy, less network bandwidth and real time processing, the large storage
overhead prevents deep neural networks from being incorporated into mobile apps.
The second issue is energy consumption. Running large neural networks require a lot of memory
bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes
considerable energy. Mobile devices are battery constrained, making power hungry applications such
as deep neural networks hard to deploy.
Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit
floating point add consumes 0.9PJ, a 32bit SRAM cache access takes 5PJ, while a 32bit DRAM
memory access takes 640PJ, which is 3 orders of magnitude of an add operation. Large networks
do not fit in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion
connection neural network, for example, at 20fps would require (20Hz)(1G)(640PJ) = 12.8W just
for DRAM access - well beyond the power envelope of a typical mobile device.
Our goal is to reduce the storage and energy required to run inference on such large networks so they
can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three-
stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves
the original accuracy. First, we prune the networking by removing the redundant connections, keeping
only the most informative connections. Next, the weights are quantized so that multiple connections
share the same weight, thus only the codebook (effective weights) and the indices need to be stored.
Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights.
Our main insight is that, pruning and trained quantization are able to compress the network without
interfering each other, thus lead to surprisingly high compression rate. It makes the required storage
so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM
which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al.
(2016) was later proposed that works on the compressed model, achieving significant speedup and
energy efficiency improvement.
2 NETWORK PRUNING
Network pruning has been widely studied to compress CNN models. In early work, network pruning
proved to be a valid way to reduce the network complexity and over-fitting (LeCun et al., 1989;
Hanson & Pratt, 1989; Hassibi et al., 1993; Strom, 1997). Recently Han et al. (2015) pruned state- ¨
of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on
the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we
prune the small-weight connections: all connections with weights below a threshold are removed
from the network. Finally, we retrain the network to learn the final weights for the remaining sparse
connections. Pruning reduced the number of parameters by9%and13%for AlexNet and VGG-16
model.
<<FIGURE>>
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow.
<<FIGURE>>
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
We store the sparse structure that results from pruning using compressed sparse row (CSR) or
compressed sparse column (CSC) format, which requires2a+n+1numbers, where a is the number
of non-zero elements and n is the number of rows or columns.
To compress further, we store the index difference instead of the absolute position, and encode this
difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger
than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds
8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3 TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing the
number of bits required to represent each weight. We limit the number of effective weights we need to
store by having multiple connections share the same weight, and then fine-tune those shared weights.
Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4
output neurons, the weight is a 4x4 matrix. On the top left is the 4x4 weight matrix, and on the
bottom left is the 4x4 gradient matrix. The weights are quantized to 4 bins (denoted with 4 colors),
all the weights in the same bin share the same value, thus for each weight, we then need to store only
a small index into a table of shared weights. During update, all the gradients are grouped by the color
and summed together, multiplied by the learning rate and subtracted from the shared centroids from
last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each
CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log_2(k) bits to encode the index. In
general, for a network with n connections and each connection is represented with b bits, constraining
the connections to have only k shared weights will result in a compression rate of:
<<FORMULA>> (1)
For example, Figure 3 shows the weights of a single layer neural network with four input units and
four output units. There are4%4 = 16weights originally but there are only4shared weights: similar
weights are grouped together to share the same value. Originally we need to store 16 weights each
<<FIGURE>>
Figure 4: Left: Three different methods for centroids initialization. Right: Distribution of weights
(blue) and distribution of codebook before (green cross) and after fine-tuning (red dot).
has 32 bits, now we need to store only 4 effective weights (blue, green, red and orange), each has 32
bits, together with 16 2-bit indices giving a compression rate of <<FORMULA>>
3.1 WEIGHT SHARING
We use k-means clustering to identify the shared weights for each layer of a trained network, so that
all the weights that fall into the same cluster will share the same weight. Weights are not shared across
layers. We partition n original weights <<FORMULA>> into k clusters <<FORMULA>>,
n%k, so as to minimize the within-cluster sum of squares (WCSS):
<<FORMULA>> (2)
Different from HashNet (Chen et al., 2015) where weight sharing is determined by a hash function
before the networks sees any training data, our method determines weight sharing after a network is
fully trained, so that the shared weights approximate the original network.
3.2 INITIALIZATION OF SHARED WEIGHTS
Centroid initialization impacts the quality of clustering and thus affects the networks prediction
accuracy. We examine three initialization methods: Forgy(random), density-based, and linear
initialization. In Figure 4 we plotted the original weights distribution of conv3 layer in AlexNet
(CDF in blue, PDF in red). The weights forms a bimodal distribution after network pruning. On the
bottom it plots the effective weights (centroids) with 3 different initialization methods (shown in blue,
red and yellow). In this example, there are 13 clusters.
Forgy(random) initialization randomly chooses k observations from the data set and uses these as
the initial centroids. The initialized centroids are shown in yellow. Since there are two peaks in the
bimodal distribution, Forgy method tend to concentrate around those two peaks.
Density-based initialization linearly spaces the CDF of the weights in the y-axis, then finds the
horizontal intersection with the CDF, and finally finds the vertical intersection on the x-axis, which
becomes a centroid, as shown in blue dots. This method makes the centroids denser around the two
peaks, but more scatted than the Forgy method.
Linear initialization linearly spaces the centroids between the [min, max] of the original weights.
This initialization method is invariant to the distribution of the weights and is the most scattered
compared with the former two methods.
Larger weights play a more important role than smaller weights (Han et al., 2015), but there are fewer
of these large weights. Thus for both Forgy initialization and density-based initialization, very few
centroids have large absolute value which results in poor representation of these few large weights.
Linear initialization does not suffer from this problem. The experiment section compares the accuracy
<<FIGURE>>
Figure 5: Distribution for weight (Left) and index (Right). The distribution is biased.
of different initialization methods after clustering and fine-tuning, showing that linear initialization
works best.
3.3 FEED-FORWARD AND BACK-PROPAGATION
The centroids of the one-dimensional k-means clustering are the shared weights. There is one level
of indirection during feed forward phase and back-propagation phase looking up the weight table.
An index into the shared weight table is stored for each connection. During back-propagation, the
gradient for each shared weight is calculated and used to update the shared weight. This procedure is
shown in Figure 3.
We denote the loss byL, the weight in the ith column and jth row by Wij, the centroid index of
element Wij by Iij, the kth centroid of the layer by Ck. By using the indicator function <<1(.)>>, the
gradient of the centroids is calculated as:
<<FORMULA>> (3)
4 HUFFMAN CODING
A Huffman code is an optimal prefix code commonly used for lossless data compression(Van Leeuwen,
1976). It uses variable-length codewords to encode source symbols. The table is derived from the
occurrence probability for each symbol. More common symbols are represented with fewer bits.
Figure 5 shows the probability distribution of quantized weights and the sparse matrix index of the
last fully connected layer in AlexNet. Both distributions are biased: most of the quantized weights are
distributed around the two peaks; the sparse matrix index difference are rarely above 20. Experiments
show that Huffman coding these non-uniformly distributed values saves 20% to 30% of network
storage.
5 EXPERIMENTS
We pruned, quantized, and Huffman encoded four networks: two on MNIST and two on ImageNet
data-sets. The network parameters and accuracy- 1 before and after pruning are shown in Table 1. The
compression pipeline saves network storage by 35% to 49% across different networks without loss
of accuracy. The total size of AlexNet decreased from 240MB to 6.9MB, which is small enough to
be put into on-chip SRAM, eliminating the need to store the model in energy-consuming DRAM
memory.
Training is performed with the Caffe framework (Jia et al., 2014). Pruning is implemented by adding
a mask to the blobs to mask out the update of the pruned connections. Quantization and weight
sharing are implemented by maintaining a codebook structure that stores the shared weight, and
group-by-index after calculating the gradient of each layer. Each shared weight is updated with all
the gradients that fall into that bucket. Huffman coding doesnt require training and is implemented
offline after all the fine-tuning is finished.
5.1 LE NET-300-100 AND LE NET-5 ON MNIST
We first experimented on MNIST dataset with LeNet-300-100 and LeNet-5 network (LeCun et al.,
1998). LeNet-300-100 is a fully connected network with two hidden layers, with 300 and 100
1 Reference model is from Caffe model zoo, accuracy is measured without data augmentation
Table 1: The compression pipeline can save35%to49%parameter storage with no loss of accuracy.
<<TABLE>>
Table 2: Compression statistics for LeNet-300-100. P: pruning, Q:quantization, H:Huffman coding.
<<TABLE>>
Table 3: Compression statistics for LeNet-5. P: pruning, Q:quantization, H:Huffman coding.
<<TABLE>>
neurons each, which achieves 1.6% error rate on Mnist. LeNet-5 is a convolutional network that
has two convolutional layers and two fully connected layers, which achieves 0.8% error rate on
Mnist. Table 2 and table 3 show the statistics of the compression pipeline. The compression rate
includes the overhead of the codebook and sparse indexes. Most of the saving comes from pruning
and quantization (compressed 32%), while Huffman coding gives a marginal gain (compressed 40%)
5.2 ALEX NET ON IMAGE NET
We further examine the performance of Deep Compression on the ImageNet ILSVRC-2012 dataset,
which has 1.2M training examples and 50k validation examples. We use the AlexNet Caffe model as
the reference model, which has 61 million parameters and achieved a top-1 accuracy of 57.2% and a
top-5 accuracy of 80.3%. Table 4 shows that AlexNet can be compressed to2:88%of its original size
without impacting accuracy. There are 256 shared weights in each CONV layer, which are encoded
with 8 bits, and 32 shared weights in each FC layer, which are encoded with only 5 bits. The relative
sparse index is encoded with 4 bits. Huffman coding compressed additional 22%, resulting in 35%
compression in total.
5.3 VGG-16 ON IMAGE NET
With promising results on AlexNet, we also looked at a larger, more recent network, VGG-16 (Si-
monyan & Zisserman, 2014), on the same ILSVRC-2012 dataset. VGG-16 has far more convolutional
layers but still only three fully-connected layers. Following a similar methodology, we aggressively
compressed both convolutional and fully-connected layers to realize a significant reduction in the
number of effective weights, shown in Table5.
The VGG16 network as a whole has been compressed by49%. Weights in the CONV layers are
represented with 8 bits, and FC layers use 5 bits, which does not impact the accuracy. The two largest
fully-connected layers can each be pruned to less than 1.6% of their original size. This reduction
Table 4: Compression statistics for AlexNet. P: pruning, Q: quantization, H:Huffman coding.
<<TABLE>>
Table 5: Compression statistics for VGG-16. P: pruning, Q:quantization, H:Huffman coding.
<<TABLE>>
is critical for real time image processing, where there is little reuse of these layers across images
(unlike batch processing). This is also critical for fast object detection algorithms where one CONV
pass is used by many FC passes. The reduced layers will fit in an on-chip SRAM and have modest
bandwidth requirements. Without the reduction, the bandwidth requirements are prohibitive.
6 DISCUSSIONS
6.1 PRUNING AND QUANTIZATION WORKING TOGETHER
Figure 6 shows the accuracy at different compression rates for pruning and quantization together
or individually. When working individually, as shown in the purple and yellow lines, accuracy of
pruned network begins to drop significantly when compressed below 8% of its original size; accuracy
of quantized network also begins to drop significantly when compressed below 8% of its original
size. But when combined, as shown in the red line, the network can be compressed to 3% of original
size with no loss of accuracy. On the far right side compared the result of SVD, which is inexpensive
but has a poor compression rate.
The three plots in Figure 7 show how accuracy drops with fewer bits per connection for CONV layers
(left), FC layers (middle) and all layers (right). Each plot reports both top-1 and top-5 accuracy.
Dashed lines only applied quantization but without pruning; solid lines did both quantization and
pruning. There is very little difference between the two. This shows that pruning works well with
quantization.
Quantization works well on pruned network because unpruned AlexNet has 60 million weights to
quantize, while pruned AlexNet has only 6.7 million weights to quantize. Given the same amount of
centroids, the latter has less error.
<<FIGURE>>
Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning and
quantization works best when combined.
<<FIGURE>>
Figure 7: Pruning doesnt hurt quantization. Dashed: quantization on unpruned network. Solid:
quantization on pruned network; Accuracy begins to drop at the same number of quantization bits
whether or not the network has been pruned. Although pruning made the number of parameters less,
quantization still works well, or even better(3 bits case on the left figure) as in the unpruned network.
<<FIGURE>>
Figure 8: Accuracy of different initialization methods. Left: top-1 accuracy. Right: top-5 accuracy.
Linear initialization gives best result.
The first two plots in Figure 7 show that CONV layers require more bits of precision than FC layers.
For CONV layers, accuracy drops significantly below 4 bits, while FC layer is more robust: not until
2 bits did the accuracy drop significantly.
6.2 CENTROID INITIALIZATION
Figure 8 compares the accuracy of the three different initialization methods with respect to top-1
accuracy (Left) and top-5 accuracy (Right). The network is quantized to2%8bits as shown on
x-axis. Linear initialization outperforms the density initialization and random initialization in all
cases except at 3 bits.
The initial centroids of linear initialization spread equally across the x-axis, from the min value to the
max value. That helps to maintain the large weights as the large weights play a more important role
than smaller ones, which is also shown in network pruning Han et al. (2015). Neither random nor
density-based initialization retains large centroids. With these initialization methods, large weights are
clustered to the small centroids because there are few large weights. In contrast, linear initialization
allows large weights a better chance to form a large centroid.
<<FIGURE>>
Figure 9: Compared with the original network, pruned network layer achieved 3% speedup on CPU,
3.5% on GPU and 4.2% on mobile GPU on average. Batch size = 1 targeting real time processing.
Performance number normalized to CPU.
<<FIGURE>>
Figure 10: Compared with the original network, pruned network layer takes 7% less energy on CPU,
3.3% less on GPU and 4.2% less on mobile GPU on average. Batch size = 1 targeting real time
processing. Energy number normalized to CPU.
6.3 SPEEDUP AND ENERGY EFFICIENCY
Deep Compression is targeting extremely latency-focused applications running on mobile, which
requires real-time inference, such as pedestrian detection on an embedded processor inside an
autonomous vehicle. Waiting for a batch to assemble significantly adds latency. So when bench-
marking the performance and energy efficiency, we consider the case when batch size = 1. The cases
of batching are given in Appendix A.
Fully connected layer dominates the model size (more than90%) and got compressed the most by
Deep Compression (96%weights pruned in VGG-16). In state-of-the-art object detection algorithms
such as fast R-CNN (Girshick, 2015), up to 38% computation time is consumed on FC layers on
uncompressed model. So its interesting to benchmark on FC layers, to see the effect of Deep
Compression on performance and energy. Thus we setup our benchmark on FC6, FC7, FC8 layers of
AlexNet and VGG-16. In the non-batched case, the activation matrix is a vector with just one column,
so the computation boils down to dense / sparse matrix-vector multiplication for original / pruned
model, respectively. Since current BLAS library on CPU and GPU doesnt support indirect look-up
and relative indexing, we didnt benchmark the quantized model.
We compare three different off-the-shelf hardware: the NVIDIA GeForce GTX Titan X and the Intel
Core i7 5930K as desktop processors (same package as NVIDIA Digits Dev Box) and NVIDIA Tegra
K1 as mobile processor. To run the benchmark on GPU, we used cuBLAS GEMV for the original
dense layer. For the pruned sparse layer, we stored the sparse matrix in in CSR format, and used
cuSPARSE CSRMV kernel, which is optimized for sparse matrix-vector multiplication on GPU. To
run the benchmark on CPU, we used MKL CBLAS GEMV for the original dense model and MKL
SPBLAS CSRMV for the pruned sparse model.
To compare power consumption between different systems, it is important to measure power at a
consistent manner (NVIDIA, b). For our analysis, we are comparing pre-regulation power of the
entire application processor (AP) / SOC and DRAM combined. On CPU, the benchmark is running on
single socket with a single Haswell-E class Core i7-5930K processor. CPU socket and DRAM power
are as reported by the pcm-power utility provided by Intel. For GPU, we used nvidia-smi
utility to report the power of Titan X. For mobile GPU, we use a Jetson TK1 development board and
measured the total power consumption with a power-meter. We assume 15% AC to DC conversion
loss,85% regulator efficiency and 15% power consumed by peripheral components (NVIDIA, a) to
report the AP+DRAM power for Tegra K1.
Table 6: Accuracy of AlexNet with different aggressiveness of weight sharing and quantization. 8/5
bit quantization has no loss of accuracy; 8/4 bit quantization, which is more hardware friendly, has
negligible loss of accuracy of 0.01%; To be really aggressive, 4/2 bit quantization resulted in 1.99%
and 2.60% loss of accuracy.
<<TABLE>>
The ratio of memory access over computation characteristic with and without batching is different.
When the input activations are batched to a matrix the computation becomes matrix-matrix multipli-
cation, where locality can be improved by blocking. Matrix could be blocked to fit in caches and
reused efficiently. In this case, the amount of memory access isO(n2 ), and that of computation is
O(n3 ), the ratio between memory access and computation is in the order of1=n.
In real time processing when batching is not allowed, the input activation is a single vector and the
computation is matrix-vector multiplication. In this case, the amount of memory access isO(n2 ), and
the computation isO(n2 ), memory access and computation are of the same magnitude (as opposed
to1=n). That indicates MV is more memory-bounded than MM. So reducing the memory footprint
is critical for the non-batching case.
Figure 9 illustrates the speedup of pruning on different hardware. There are 6 columns for each
benchmark, showing the computation time of CPU / GPU / TK1 on dense / pruned network. Time is
normalized to CPU. When batch size = 1, pruned network layer obtained 3% to 4% speedup over the
dense network on average because it has smaller memory footprint and alleviates the data transferring
overhead, especially for large matrices that are unable to fit into the caches. For example VGG16s
FC6 layer, the largest layer in our experiment, contains 400MB data, which is far from the capacity of L3 cache.
In those latency-tolerating applications, batching improves memory locality, where weights could
be blocked and reused in matrix-matrix multiplication. In this scenario, pruned network no longer
shows its advantage. We give detailed timing results in Appendix A.
Figure 10 illustrates the energy efficiency of pruning on different hardware. We multiply power
consumption with computation time to get energy consumption, then normalized to CPU to get
energy efficiency. When batch size = 1, pruned network layer consumes 3% to 7% less energy over
the dense network on average. Reported by nvidia-smi, GPU utilization is 99% for both dense
and sparse cases.
6.4 RATIO OF WEIGHTS, INDEX AND CODEBOOK
Pruning makes the weight matrix sparse, so extra space is needed to store the indexes of non-zero
elements. Quantization adds storage for a codebook. The experiment section has already included
these two factors. Figure 11 shows the breakdown of three different components when quantizing
four networks. Since on average both the weights and the sparse indexes are encoded with 5 bits,
their storage is roughly half and half. The overhead of codebook is very small and often negligible.
<<FIGURE>>
Figure 11: Storage ratio of weight, index and codebook.
Table 7: Comparison with other compression methods on AlexNet. (Collins & Kohli, 2014) reduced
the parameters by 4% and with inferior accuracy. Deep Fried Conv nets(Yang et al., 2014) worked
on fully connected layers and reduced the parameters by less than 4%. SVD save parameters but
suffers from large accuracy loss as much as 2%. Network pruning (Han et al., 2015) reduced the
parameters by 9%, not including index overhead. On other networks similar to AlexNet, (Denton
et al., 2014) exploited linear structure of conv nets and compressed the network by 2.4% to 13.4%
layer wise, with 0.9% accuracy loss on compressing a single layer. (Gong et al., 2014) experimented
with vector quantization and compressed the network by 16% to 24%, incurring 1% accuracy loss.
<<TABLE>>
7 RELATED WORK
Neural networks are typically over-parametrized, and there is significant redundancy for deep learning
models(Denil et al., 2013). This results in a waste of both computation and memory usage. There
have been various proposals to remove the redundancy: Vanhoucke et al. (2011) explored a fixed-
point implementation with 8-bit integer (vs 32-bit floating point) activations. Hwang & Sung
(2014) proposed an optimization method for the fixed-point network with ternary weights and 3-bit
activations. Anwar et al. (2015) quantized the neural network using L2 error minimization and
achieved better accuracy on MNIST and CIFAR-10 datasets.Denton et al. (2014) exploited the linear
structure of the neural network by finding an appropriate low-rank approximation of the parameters
and keeping the accuracy within 1% of the original model.
The empirical success in this paper is consistent with the theoretical study of random-like sparse
networks with +1/0/-1 weights (Arora et al., 2014), which have been proved to enjoy nice properties
(e.g. reversibility), and to allow a provably polynomial time algorithm for training.
Much work has been focused on binning the network parameters into buckets, and only the values in
the buckets need to be stored. HashedNets(Chen et al., 2015) reduce model sizes by using a hash
function to randomly group connection weights, so that all connections within the same hash bucket
share a single parameter value. In their method, the weight binning is pre-determined by the hash
function, instead of being learned through training, which doesnt capture the nature of images. Gong
et al. (2014) compressed deep conv nets using vector quantization, which resulted in 1% accuracy
loss. Both methods studied only the fully connected layer, ignoring the convolutional layers.
There have been other attempts to reduce the number of parameters of neural networks by replacing
the fully connected layer with global average pooling. The Network in Network architecture(Lin et al.,
2013) and GoogLenet(Szegedy et al., 2014) achieves state-of-the-art results on several benchmarks by
adopting this idea. However, transfer learning, i.e. reusing features learned on the ImageNet dataset
and applying them to new tasks by only fine-tuning the fully connected layers, is more difficult with
this approach. This problem is noted by Szegedy et al. (2014) and motivates them to add a linear
layer on the top of their networks to enable transfer learning.
Network pruning has been used both to reduce network complexity and to reduce over-fitting. An
early approach to pruning was biased weight decay (Hanson & Pratt, 1989). Optimal Brain Damage
(LeCun et al., 1989) and Optimal Brain Surgeon (Hassibi et al., 1993) prune networks to reduce
the number of connections based on the Hessian of the loss function and suggest that such pruning
is more accurate than magnitude-based pruning such as weight decay. A recent work (Han et al.,
2015) successfully pruned several state of the art large scale networks and showed that the number of
parameters could be reduce by an order of magnitude. There are also attempts to reduce the number
of activations for both compression and acceleration Van Nguyen et al. (2015).
8 FUTURE WORK
While thE pruned network has been benchmarked on various hardware, the quantized network with
weight sharing has not, because off-the-shelf cuSPARSE or MKL SPBLAS library does not support
indirect matrix entry lookup, nor is the relative index in CSC or CSR format supported. So the full
advantage of Deep Compression that fit the model in cache is not fully unveiled. A software solution
is to write customized GPU kernels that support this. A hardware solution is to build custom ASIC
architecture specialized to traverse the sparse and quantized network structure, which also supports
customized quantization bit width. We expect this architecture to have energy dominated by on-chip
SRAM access instead of off-chip DRAM access.
9 CONCLUSION
We have presented “Deep Compression” that compressed neural networks without affecting accuracy.
Our method operates by pruning the unimportant connections, quantizing the network using weight
sharing, and then applying Huffman coding. We highlight our experiments on AlexNet which
reduced the weight storage by 35% without loss of accuracy. We show similar results for VGG-16
and LeNet networks compressed by 49% and 39% without loss of accuracy. This leads to smaller
storage requirement of putting conv nets into mobile app. After Deep Compression the size of these
networks fit into on-chip SRAM cache (5pJ/access) rather than requiring off-chip DRAM memory
(640pJ/access). This potentially makes deep neural networks more energy efficient to run on mobile.
Our compression method also facilitates the use of complex neural networks in mobile applications
where application size and download bandwidth are constrained.
REFERENCES
Anwar, Sajid, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point optimization of deep convolutional
neural networks for object recognition. InAcoustics, Speech and Signal Processing (ICASSP),
2015 IEEE International Conference on, pp. 11311135. IEEE, 2015.
Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma, Tengyu. Provable bounds for learning some
deep representations. InProceedings of the 31th International Conference on Machine Learning,
ICML 2014, pp. 584592, 2014.
BVLC. Caffe model zoo. URLhttp://caffe.berkeleyvision.org/model_zoo.
Chen, Wenlin, Wilson, James T., Tyree, Stephen, Weinberger, Kilian Q., and Chen, Yixin. Compress-
ing neural networks with the hashing trick.arXiv preprint arXiv:1504.04788, 2015.
Collins, Maxwell D and Kohli, Pushmeet. Memory bounded deep convolutional networks.arXiv
preprint arXiv:1412.1442, 2014.
Denil, Misha, Shakibi, Babak, Dinh, Laurent, de Freitas, Nando, et al. Predicting parameters in deep
learning. InAdvances in Neural Information Processing Systems, pp. 21482156, 2013.
Denton, Emily L, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear
structure within convolutional networks for efficient evaluation. InAdvances in Neural Information
Processing Systems, pp. 12691277, 2014.
Girshick, Ross. Fast r-cnn.arXiv preprint arXiv:1504.08083, 2015.
Gong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir. Compressing deep convolutional
networks using vector quantization.arXiv preprint arXiv:1412.6115, 2014.
Han, Song, Pool, Jeff, Tran, John, and Dally, William J. Learning both weights and connections for
efficient neural networks. InAdvances in Neural Information Processing Systems, 2015.
Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A, and Dally,
William J. EIE: Efficient inference engine on compressed deep neural network.arXiv preprint
arXiv:1602.01528, 2016.
Hanson, Stephen Jose and Pratt, Lorien Y. Comparing biases for minimal network construction with´
back-propagation. InAdvances in neural information processing systems, pp. 177185, 1989.
Hassibi, Babak, Stork, David G, et al. Second order derivatives for network pruning: Optimal brain
surgeon.Advances in neural information processing systems, pp. 164164, 1993.
Hwang, Kyuyeon and Sung, Wonyong. Fixed-point feedforward deep neural network design using
weights+ 1, 0, and- 1. InSignal Processing Systems (SiPS), 2014 IEEE Workshop on, pp. 16.
IEEE, 2014.
Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross,
Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature
embedding.arXiv preprint arXiv:1408.5093, 2014.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep
convolutional neural networks. InNIPS, pp. 10971105, 2012.
LeCun, Yann, Denker, John S, Solla, Sara A, Howard, Richard E, and Jackel, Lawrence D. Optimal
brain damage. InNIPs, volume 89, 1989.
LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied
to document recognition.Proceedings of the IEEE, 86(11):22782324, 1998.
Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network.arXiv:1312.4400, 2013.
NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing
to embedded systems, a. URLhttp://www.nvidia.com.
NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis, b.
URLhttp://www.nvidia.com/object/white-papers.html.
Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image
recognition.arXiv preprint arXiv:1409.1556, 2014.
Strom, Nikko. Phoneme probability estimation with dynamic sparsely connected artificial neural¨
networks.The Free Speech Journal, 1(5):141, 1997.
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir,
Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions.
arXiv preprint arXiv:1409.4842, 2014.
Van Leeuwen, Jan. On the construction of huffman trees. InICALP, pp. 382410, 1976.
Van Nguyen, Hien, Zhou, Kevin, and Vemulapalli, Raviteja. Cross-domain synthesis of medical
images using efficient location-sensitive deep network. InMedical Image Computing and Computer-
Assisted InterventionMICCAI 2015, pp. 677684. Springer, 2015.
Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Improving the speed of neural networks on
cpus. InProc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011.
Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and
Wang, Ziyu. Deep fried convnets.arXiv preprint arXiv:1412.7149, 2014.
A APPENDIX :DETAILED TIMING / POWER REPORTS OF DENSE & SPARSE
NETWORK LAYERS
Table 8: Average time on different layers. To avoid variance, we measured the time spent on each
layer for 4096 input samples, and averaged the time regarding each input sample. For GPU, the time
consumed bycudaMallocandcudaMemcpyis not counted. For batch size = 1,gemvis used;
For batch size = 64,gemmis used. For sparse case,csrmvandcsrmmis used, respectively.
<<TABLE>>
Table 9: Power consumption of different layers. We measured the Titan X GPU power with
nvidia-smi, Core i7-5930k CPU power withpcm-powerand Tegra K1 mobile GPU power with
an external power meter (scaled to AP+DRAM, see paper discussion). During power measurement,
we repeated each computation multiple times in order to get stable numbers. On CPU, dense matrix
multiplications consume2xenergy than sparse ones because it is accelerated with multi-threading.
<<TABLE>>
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
DEEP DOUBLE DESCENT: WHERE BIGGER MODELS AND MORE DATA HURT
Preetum Nakkiran Gal Kaplun y Yamini Bansal y Tristan Yang
Harvard University Harvard University Harvard University Harvard University
Boaz Barak Ilya Sutskever
Harvard University OpenAI
ABSTRACT
We show that a variety of modern deep learning tasks exhibit a “double-descent”
phenomenon where, as we increase model size, performance first gets worse and
then gets better. Moreover, we show that double descent occurs not just as a
function of model size, but also as a function of the number of training epochs.
We unify the above phenomena by defining a new complexity measure we call
the effective model complexity and conjecture a generalized double descent with
respect to this measure. Furthermore, our notion of model complexity allows us to
identify certain regimes where increasing (even quadrupling) the number of train
samples actually hurts test performance.
1 INTRODUCTION
<<FIGURE>>
Figure 1:Left:Train and test error as a function of model size, for ResNet18s of varying width
on CIFAR-10 with 15% label noise.Right:Test error, shown for varying train epochs. All models
trained using Adam for 4K epochs. The largest model (width64) corresponds to standard ResNet18.
The bias-variance trade-off is a fundamental concept in classical statistical learning theory (e.g.,
Hastie et al. (2005)). The idea is that models of higher complexity have lower bias but higher vari-
ance. According to this theory, once model complexity passes a certain threshold, models “overfit”
with the variance term dominating the test error, and hence from this point onward, increasing model
complexity will only decrease performance (i.e., increase test error). Hence conventional wisdom
in classical statistics is that, once we pass a certain threshold,“larger models are worse.”
However, modern neural networks exhibit no such phenomenon. Such networks have millions of
parameters, more than enough to fit even random labels (Zhang et al. (2016)), and yet they perform
much better on many tasks than smaller models. Indeed, conventional wisdom among practitioners
is that“larger models are better (Krizhevsky et al. (2012), Huang et al. (2018), Szegedy et al.
<<FIGURE>>
Figure 2:Left:Test error as a function of model size and train epochs. The horizontal line corre-
sponds to model-wise double descentvarying model size while training for as long as possible. The
vertical line corresponds to epoch-wise double descent, with test error undergoing double-descent
as train time increases.RightTrain error of the corresponding models. All models are Resnet18s
trained on CIFAR-10 with 15% label noise, data-augmentation, and Adam for up to 4K epochs.
(2015), Radford et al. (2019)). The effect of training time on test performance is also up for debate.
In some settings, “early stopping” improves test performance, while in other settings training neu-
ral networks to zero training error only improves performance. Finally, if there is one thing both
classical statisticians and deep learning practitioners agree on is“more data is always better”.
In this paper, we present empirical evidence that both reconcile and challenge some of the above
“conventional wisdoms.” We show that many deep learning settings have two different regimes.
In the under-parameterized regime, where the model complexity is small compared to the number
of samples, the test error as a function of model complexity follows the U-like behavior predicted
by the classical bias/variance tradeoff. However, once model complexity is sufficiently large to
interpolate i.e., achieve (close to) zero training error, then increasing complexity only decreases test
error, following the modern intuition of “bigger models are better”. Similar behavior was previously
observed in Opper (1995; 2001), Advani & Saxe (2017), Spigler et al. (2018), and Geiger et al.
(2019b). This phenomenon was first postulated in generality by Belkin et al. (2018) who named
it “double descent”, and demonstrated it for decision trees, random features, and 2-layer neural
networks with2 loss, on a variety of learning tasks including MNIST and CIFAR-10.
Main contributions. We show that double descent is a robust phenomenon that occurs in a variety
of tasks, architectures, and optimization methods (see Figure 1 and Section 5; our experiments are
summarized in Table A). Moreover, we propose a much more general notion of “double descent”
that goes beyond varying the number of parameters. We define the effective model complexity (EMC)
of a training procedure as the maximum number of samples on which it can achieve close to zero
training error. The EMC depends not just on the data distribution and the architecture of the classifier
but also on the training procedure—and in particular increasing training time will increase the EMC.
We hypothesize that for many natural models and learning algorithms, double descent occurs as a
function of the EMC. Indeed we observe “epoch-wise double descent” when we keep the model fixed
and increase the training time, with performance following a classical U-like curve in the underfitting
stage (when the EMC is smaller than the number of samples) and then improving with training time
once the EMC is sufficiently larger than the number of samples (see Figure 2). As a corollary, early
stopping only helps in the relatively narrow parameter regime of critically parameterized models.
Sample non-monotonicity. Finally, our results shed light on test performance as a function of
the number of train samples. Since the test error peaks around the point where EMC matches the
number of samples (the transition from the under- to over-parameterization), increasing the number
of samples has the effect of shifting this peak to the right. While in most settings increasing the
number of samples decreases error, this shifting effect can sometimes result in a setting wheremore
data is worse!For example, Figure 3 demonstrates cases in which increasing the number of samples
by a factor of4:5results in worse test performance.
Figure 3: Test loss (per-token perplexity) as a
function of Transformer model size (embed-
ding dimension d model) on language trans-
<<FIGURE>> lation (IWSLT14 German-to-English). The
curve for 18k samples is generally lower than
the one for 4k samples, but also shifted to
the right, since fitting 18k samples requires
a larger model. Thus, for some models, the
performance for 18k samples is worse than
for 4k samples.
2 OUR RESULTS
To state our hypothesis more precisely, we define the notion of effective model complexity. We define
a training procedure T to be any procedure that takes as input a set <<FORMULA>>
of labeled training samples and outputs a classifier <<T(S)>> mapping data to labels. We define the
effective model complexity of T (w.r.t. distributionD) to be the maximum number of samples non
which T achieves on average <<FORMULA>> training error.
Definition 1 (Effective Model Complexity)TheEffective Model Complexity(EMC) of a training
procedureT, with respect to distribution D and parameter <<FORMULA>>, is defined as:
<<FORMULA>>
whereError <<S(M)>> is the mean error of modelMon train samplesS.
Our main hypothesis can be informally stated as follows:
Hypothesis 1 (Generalized Double Descent hypothesis, informal)For any natural data distribu-
tion D, neural-network-based training procedureT, and small <<FORMULA>>, if we consider the task of
predicting labels based on n samples from D then:
Under-parametrized regime.If <<FORMULA>> is sufficiently smaller than n, any perturbation of T
that increases its effective complexity will decrease the test error.
Over-parameterized regime.If <<FORMULA>> is sufficiently larger than n, any perturbation of T
that increases its effective complexity will decrease the test error.
Critically parameterized regime.If <<FORMULA>>, then a perturbation of T that increases its
effective complexity might decrease or increase the test error.
Hypothesis 1 is informal in several ways. We do not have a principled way to choose the parameter
<<FORMULA>> (and currently heuristically use <<FORMULA>>). We also are yet to have a formal specification for
“sufficiently smaller” and “sufficiently larger”. Our experiments suggest that there is a critical
interval around the interpolation threshold when <<FORMULA>>: below and above this interval
increasing complexity helps performance, while within this interval it may hurt performance. The
width of the critical interval depends on both the distribution and the training procedure in ways we
do not yet completely understand.
We believe Hypothesis 1 sheds light on the interaction between optimization algorithms, model size,
and test performance and helps reconcile some of the competing intuitions about them. The main
result of this paper is an experimental validation of Hypothesis 1 under a variety of settings, where
we considered several natural choices of datasets, architectures, and optimization algorithms, and
we changed the “interpolation threshold” by varying the number of model parameters, the length of
training, the amount of label noise in the distribution, and the number of train samples.
Model-wise Double Descent.In Section 5, we study the test error of models of increasing size,
for a fixed large number of optimization steps. We show that “model-wise double-descent” occurs
for various modern datasets (CIFAR-10, CIFAR-100, IWSLT14 de-en, with varying amounts of
label noise), model architectures (CNNs, ResNets, Transformers), optimizers (SGD, Adam), number
of train samples, and training procedures (data-augmentation, and regularization). Moreover, the
peak in test error systematically occurs at the interpolation threshold. In particular, we demonstrate
realistic settings in which bigger models are worse.
Epoch-wise Double Descent.In Section 6, we study the test error of a fixed, large architecture over
the course of training. We demonstrate, in similar settings as above, a corresponding peak in test
performance when models are trained just long enough to reach <<FORMULA>> train error. The test error of a
large model first decreases (at the beginning of training), then increases (around the critical regime),
then decreases once more (at the end of training)—that is,training longer can correct overfitting.
Sample-wise Non-monotonicity.In Section 7, we study the test error of a fixed model and training
procedure, for varying number of train samples. Consistent with our generalized double-descent
hypothesis, we observe distinct test behavior in the “critical regime”, when the number of samples
is near the maximum that the model can fit. This often manifests as a long plateau region, in which
taking significantly more data might not help when training to completion (as is the case for CNNs on
CIFAR-10). Moreover, we show settings (Transformers on IWSLT14 en-de), where this manifests
as a peak—and for a fixed architecture and training procedure,more data actually hurts.
Remarks on Label Noise.We observe all forms of double descent most strongly in settings with
label noise in the train set (as is often the case when collecting train data in the real-world). How-
ever, we also show several realistic settings with a test-error peak even without label noise: ResNets
(Figure 4a) and CNNs (Figure 20) on CIFAR-100; Transformers on IWSLT14 (Figure 8). More-
over, all our experiments demonstrate distinctly different test behavior in the critical regime— often
manifesting as a “plateau” in the test error in the noiseless case which develops into a peak with
added label noise. See Section 8 for further discussion.
3 RELATED WORK
Model-wise double descent was first proposed as a general phenomenon by Belkin et al. (2018).
Similar behavior had been observed in Opper (1995; 2001), Advani & Saxe (2017), Spigler et al.
(2018), and Geiger et al. (2019b). Subsequently, there has been a large body of work studying the
double descent phenomenon. A growing list of papers that theoretically analyze it in the tractable
setting of linear least squares regression includes Belkin et al. (2019); Hastie et al. (2019); Bartlett
et al. (2019); Muthukumar et al. (2019); Bibas et al. (2019); Mitra (2019); Mei & Montanari (2019).
Moreover, Geiger et al. (2019a) provide preliminary results for model-wise double descent in con-
volutional networks trained on CIFAR-10. Our work differs from the above papers in two crucial
aspects: First, we extend the idea of double-descent beyond the number of parameters to incorpo-
rate the training procedure under a unified notion of “Effective Model Complexity”, leading to novel
insights like epoch-wise double descent and sample non-monotonicity. The notion that increasing
train time corresponds to increasing complexity was also presented in Nakkiran et al. (2019). Sec-
ond, we provide an extensive and rigorous demonstration of double-descent for modern practices
spanning a variety of architectures, datasets optimization procedures. An extended discussion of the
related work is provided in Appendix C.
4 EXPERIMENTAL SETUP
We briefly describe the experimental setup here; full details are in Appendix B1. We consider three
families of architectures: ResNets, standard CNNs, and Transformers.ResNets:We parameterize
a family of ResNet18s (He et al. (2016)) by scaling the width (number of filters) of convolutional
layers. Specifically, we use layer widths [k;2k;4k;8k] for varying k. The standard ResNet18
corresponds tok= 64. Standard CNNs:We consider a simple family of 5-layer CNNs, with
4 convolutional layers of widths [k;2k;4k;8k] for varying k, and a fully-connected layer. For
context, the CNN with width k=64, can reach over 90% test accuracy on CIFAR-10 with data-
augmentation.Transformers:We consider the 6 layer encoder-decoder from Vaswani et al. (2017),
as implemented by Ott et al. (2019). We scale the size of the network by modifying the embedding
dimension d model , and setting the width of the fully-connected layers proportionally (<<FORMULA>>).
The raw data from our experiments are available at: https://gitlab.com/harvard-machine-learning/double-descent/tree/master
For ResNets and CNNs, we train with cross-entropy loss, and the following optimizers: (1) Adam
with learning-rate0:0001for 4K epochs; (2) SGD with learning rate/p1 for 500K gradient steps. T We train Transformers for 80K gradient steps, with 10% label smoothing and no drop-out.
Label Noise. In our experiments, label noise of probability prefers to training on a samples which
have the correct label with probability (<<FORMULA>>), and a uniformly random incorrect label otherwise
(label noise is sampled only once and not per epoch). Figure 1 plots test error on the noisy distribu-
tion, while the remaining figures plot test error with respect to the clean distribution (the two curves
are just linear rescaling of one another).
5 MODEL-WISE DOUBLE DESCENT
<<FIGURE>>
Figure 4:Model-wise double descent for ResNet18s.Trained on CIFAR-100 and CIFAR-10, with
varying label noise. Optimized using Adam with LR0:0001for 4K epochs, and data-augmentation.
In this section, we study the test error of models of increasing size, when training to completion
(for a fixed large number of optimization steps). We demonstrate model-wise double descent across
different architectures, datasets, optimizers, and training procedures. The critical region exhibits
distinctly different test behavior around the interpolation point and there is often a peak in test error
that becomes more prominent in settings with label noise.
For the experiments in this section (Figures 4, 5, 6, 7, 8), notice that all modifications which increase
the interpolation threshold (such as adding label noise, using data augmentation, and increasing the
number of train samples) also correspondingly shift the peak in test error towards larger models.
Additional plots showing the early-stopping behavior of these models, and additional experiments
showing double descent in settings with no label noise (e.g. Figure 19) are in Appendix E.2. We
also observed model-wise double descent for adversarial training, with a prominent robust test error
peak even in settings without label noise. See Figure 26 in Appendix E.2.
Discussion. Fully understanding the mechanisms behind model-wise double descent in deep neu-
ral networks remains an important open question. However, an analog of model-wise double descent
occurs even for linear models. A recent stream of theoretical works analyzes this setting (Bartlett
et al. (2019); Muthukumar et al. (2019); Belkin et al. (2019); Mei & Montanari (2019); Hastie et al.
(2019)). We believe similar mechanisms may be at work in deep neural networks.
Informally, our intuition is that for model-sizes at the interpolation threshold, there is effectively
only one model that fits the train data and this interpolating model is very sensitive to noise in the
<<FIGURE>>
Figure 5: Effect of Data Augmentation. 5-layer CNNs on CIFAR10, with and without data-
augmentation. Data-augmentation shifts the interpolation threshold to the right, shifting the test
error peak accordingly. Optimized using SGD for 500K steps. See Figure 27 for larger models.
<<FIGURE>> <<FIGURE>>
Figure 6:SGD vs. Adam.5-Layer CNNs Figure 7: Noiseless settings. 5-layer
on CIFAR-10 with no label noise, and no CNNs on CIFAR-100 with no label noise;
data augmentation. Optimized using SGD note the peak in test error. Trained with
for 500K gradient steps, and Adam for 4K SGD and no data augmentation. See Fig-
epochs. ure 20 for the early-stopping behavior of
these models.
train set and/or model mis-specification. That is, since the model is just barely able to fit the train
data, forcing it to fit even slightly-noisy or mis-specified labels will destroy its global structure, and
result in high test error. (See Figure 28 in the Appendix for an experiment demonstrating this noise
sensitivity, by showing that ensembling helps significantly in the critically-parameterized regime).
However for over-parameterized models, there are many interpolating models that fit the train set,
and SGD is able to find one that “memorizes” (or “absorbs”) the noise while still performing well
on the distribution.
The above intuition is theoretically justified for linear models. In general, this situation manifests
even without label noise for linear models (Mei & Montanari (2019)), and occurs whenever there
Figure 8:Transformers on language trans-
lation tasks:Multi-head-attention encoder-
decoder Transformer model trained for
<<FIGURE>> 80k gradient steps with labeled smoothed
cross-entropy loss on IWSLT14 German-
to-English (160K sentences) and WMT14
English-to-French (subsampled to 200K sen-
tences) dataset. Test loss is measured as per-
token perplexity.
is model mis-specification between the structure of the true distribution and the model family. We
believe this intuition extends to deep learning as well, and it is consistent with our experiments.
6 EPOCH-WISE DOUBLE DESCENT
In this section, we demonstrate a novel form of double-descent with respect to training epochs,
which is consistent with our unified view of effective model complexity (EMC) and the generalized
double descent hypothesis. Increasing the train time increases the EMC—and thus a sufficiently
large model transitions from under- to over-parameterized over the course of training.
<<FIGURE>>
Figure 9:Left:Training dynamics for models in three regimes. Models are ResNet18s on CIFAR10
with 20% label noise, trained using Adam with learning rate0:0001, and data augmentation.Right:
Test error over (Model size Epochs). Three slices of this plot are shown on the left.
As illustrated in Figure 9, sufficiently large models can undergo a “double descent” behavior where
test error first decreases then increases near the interpolation threshold, and then decreases again. In
contrast, for “medium sized” models, for which training to completion will only barely reach 0
error, the test error as a function of training time will follow a classical U-like curve where it is
better to stop early. Models that are too small to reach the approximation threshold will remain in
the “under parameterized” regime where increasing train time monotonically decreases test error.
Our experiments (Figure 10) show that many settings of dataset and architecture exhibit epoch-wise
double descent, in the presence of label noise. Further, this phenomenon is robust across optimizer
variations and learning rate schedules (see additional experiments in Appendix E.1). As in model-
wise double descent, the test error peak is accentuated with label noise.
Conventional wisdom suggests that training is split into two phases: (1) In the first phase, the net-
work learns a function with a small generalization gap (2) In the second phase, the network starts
to over-fit the data leading to an increase in test error. Our experiments suggest that this is not the
complete picture—in some regimes, the test error decreases again and may achieve a lower value at
the end of training as compared to the first minimum (see Fig 10 for 10% label noise).
<<FIGURE>>
Figure 10:Epoch-wise double descent for ResNet18 and CNN (width=128). ResNets trained using
Adam with learning rate0:0001, and CNNs trained with SGD with inverse-square root learning rate.
7 SAMPLE-WISE NON-MONOTONICITY
In this section, we investigate the effect of varying the number of train samples, for a fixed model and
training procedure. Previously, in model-wise and epoch-wise double descent, we explored behavior
in the critical regime, where <<FORMULA>>, by varying the EMC. Here, we explore the critical
regime by varying the number of train samples n. By increasing n, the same training procedure T
can switch from being effectively over-parameterized to effectively under-parameterized.
We show that increasing the number of samples has two different effects on the test error vs. model
complexity graph. On the one hand, (as expected) increasing the number of samples shrinks the area
under the curve. On the other hand, increasing the number of samples also has the effect of “shifting
the curve to the right” and increasing the model complexity at which test error peaks.
<<FIGURE>>
Figure 11: Sample-wise non-monotonicity.
These twin effects are shown in Figure 11a. Note that there is a range of model sizes where the
effects “cancel out”—and having 4% more train samples does not help test performance when
training to completion. Outside the critically-parameterized regime, for sufficiently under- or over-
parameterized models, having more samples helps. This phenomenon is corroborated in Figure 12,
which shows test error as a function of both model and sample size, in the same setting as Figure 11a.
<<FIGURE>>
Figure 12:Left:Test Error as a function of model size and number of train samples, for 5-layer
CNNs on CIFAR-10 +20% noise. Note the ridge of high test error again lies along the interpolation
threshold. Right: Three slices of the left plot, showing the effect of more data for models of
different sizes. Note that, when training to completion, more data helps for small and large models,
but does not help for near-critically-parameterized models (green).
In some settings, these two effects combine to yield a regime of model sizes where more data actually
hurts test performance as in Figure 3 (see also Figure 11b). Note that this phenomenon is not unique
to DNNs: more data can hurt even for linear models (see Appendix D).
8 CONCLUSION AND DISCUSSION
We introduce a generalized double descent hypothesis: models and training procedures exhibit atyp-
ical behavior when their Effective Model Complexity is comparable to the number of train samples.
We provide extensive evidence for our hypothesis in modern deep learning settings, and show that
it is robust to choices of dataset, architecture, and training procedures. In particular, we demon-
strate “model-wise double descent” for modern deep networks and characterize the regime where
bigger models can perform worse. We also demonstrate “epoch-wise double descent,” which, to the
best of our knowledge, has not been previously proposed. Finally, we show that the double descent
phenomenon can lead to a regime where training on more data leads to worse test performance.
Preliminary results suggest that double descent also holds as we vary the amount of regularization
for a fixed model (see Figure 22).
We also believe our characterization of the critical regime provides a useful way of thinking for
practitioners—if a model and training procedure are just barely able to fit the train set, then small
changes to the model or training procedure may yield unexpected behavior (e.g. making the model
slightly larger or smaller, changing regularization, etc. may hurt test performance).
Early stopping. We note that many of the phenomena that we highlight often do not occur with
optimal early-stopping. However, this is consistent with our generalized double descent hypothesis:
if early stopping prevents models from reaching0train error then we would not expect to see double-
descent, since the EMC does not reach the number of train samples. Further, we show at least one
setting where model-wise double descent can still occur even with optimal early stopping (ResNets
on CIFAR-100 with no label noise, see Figure 19). We have not observed settings where more data
hurts when optimal early-stopping is used. However, we are not aware of reasons which preclude
this from occurring. We leave fully understanding the optimal early stopping behavior of double
descent as an important open question for future work.
Label Noise. In our experiments, we observe double descent most strongly in settings with label
noise. However, we believe this effect is not fundamentally about label noise, but rather about
model mis-specification. For example, consider a setting where the label noise is not truly random,
but rather pseudorandom (with respect to the family of classifiers being trained). In this setting,
the performance of the Bayes optimal classifier would not change (since the pseudorandom noise
is deterministic, and invertible), but we would observe an identical double descent as with truly
random label noise. Thus, we view adding label noise as merely a proxy for making distributions
“harder”— i.e. increasing the amount of model mis-specification.
Other Notions of Model Complexity. Our notion of Effective Model Complexity is related to
classical complexity notions such as Rademacher complexity, but differs in several crucial ways:
(1) EMC depends on the true labels of the data distribution, and (2) EMC depends on the training
procedure, not just the model architecture.
Other notions of model complexity which do not incorporate features (1) and (2) would not suffice
to characterize the location of the double-descent peak. Rademacher complexity, for example, is
determined by the ability of a model architecture to fit a randomly-labeled train set. But Rademacher
complexity and VC dimension are both insufficient to determine the model-wise double descent
peak location, since they do not depend on the distribution of labels— and our experiments show
that adding label noise shifts the location of the peak.
Moreover, both Rademacher complexity and VC dimension depend only on the model family and
data distribution, and not on the training procedure used to find models. Thus, they are not capable
of capturing train-time double-descent effects, such as “epoch-wise” double descent, and the effect
of data-augmentation on the peak location.
ACKNOWLEDGMENTS
We thank Mikhail Belkin for extremely useful discussions in the early stages of this work. We
thank Christopher Olah for suggesting the Model SizeEpoch visualization, which led to the
investigation of epoch-wise double descent, as well as for useful discussion and feedback. We also
thank Alec Radford, Jacob Steinhardt, and Vaishaal Shankar for helpful discussion and suggestions.
P.N. thanks OpenAI, the Simons Institute, and the Harvard Theory Group for a research environment
that enabled this kind of work.
We thank Dimitris Kalimeris, Benjamin L. Edelman, and Sharon Qian, and Aditya Ramesh for
comments on an early draft of this work.
This work supported in part by NSF grant CAREER CCF 1452961, BSF grant 2014389, NSF US-
ICCS proposal 1540428, a Google Research award, a Facebook research award, a Simons Investiga-
tor Award, a Simons Investigator Fellowship, and NSF Awards CCF 1715187, CCF 1565264, CCF
1301976, IIS 1409097, and CNS 1618026. Y.B. would like to thank the MIT-IBM Watson AI Lab
for contributing computational resources for experiments.
REFERENCES
Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural
networks.arXiv preprint arXiv:1710.03667, 2017.
Peter L Bartlett, Philip M Long, Gabor Lugosi, and Alexander Tsigler. Benign overfitting in linear´
regression.arXiv preprint arXiv:1906.11300, 2019.
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning
and the bias-variance trade-off.arXiv preprint arXiv:1812.11118, 2018.
Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.arXiv
preprint arXiv:1903.07571, 2019.
Koby Bibas, Yaniv Fogel, and Meir Feder. A new look at an old problem: A universal learning
approach to linear regression.arXiv preprint arXiv:1905.04708, 2019.
Mauro Cettolo, Christian Girardi, and Marcello Federico. Wit 3 : Web inventory of transcribed and
translated talks. InProceedings of the 16 th Conference of the European Association for Machine
Translation (EAMT), pp. 261268, Trento, Italy, May 2012.
Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stephane dAscoli,´
Giulio Biroli, Clement Hongler, and Matthieu Wyart. Scaling description of generalization with´
number of parameters in deep learning.arXiv preprint arXiv:1901.01608, 2019a.
Mario Geiger, Stefano Spigler, Stephane dAscoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli,´
and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep
neural networks.Physical Review E, 100(1):012115, 2019b.
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
examples.arXiv preprint arXiv:1412.6572, 2014.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical
learning: data mining, inference and prediction.The Mathematical Intelligencer, 27(2):8385,
2005.
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-
dimensional ridgeless least squares interpolation.arXiv preprint arXiv:1903.08560, 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
networks. InEuropean conference on computer vision, pp. 630645. Springer, 2016.
Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and
Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism.
CoRR, abs/1811.06965, 2018. URLhttp://arxiv.org/abs/1811.06965.
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. InAdvances in neural information processing systems, pp. 10971105,
2012.
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.
Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083,
2017.
Song Mei and Andrea Montanari. The generalization error of random features regression: Precise
asymptotics and double descent curve.arXiv preprint arXiv:1908.05355, 2019.
Partha P. Mitra. Understanding overfitting peaks in generalization error: Analytical risk curves for
l2 and l1 penalized interpolation.ArXiv, abs/1906.03667, 2019.
Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. Harmless interpolation of noisy data in
regression.arXiv preprint arXiv:1903.09139, 2019.
Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L Edelman, Fred
Zhang, and Boaz Barak. Sgd on neural networks learns functions of increasing complexity.arXiv
preprint arXiv:1905.11604, 2019.
Manfred Opper. Statistical mechanics of learning: Generalization.The Handbook of Brain Theory
and Neural Networks, 922-925., 1995.
Manfred Opper. Learning to generalize.Frontiers of Life, 3(part 2), pp.763-775., 2001.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,
and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. InProceedings of
NAACL-HLT 2019: Demonstrations, 2019.
David Page. How to train your resnet. https://myrtle.ai/how-to-train-your-resnet-4-architecture/, 2018.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
PyTorch. InNeurIPS Autodiff Workshop, 2017.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. 2019.
Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in
neural information processing systems, pp. 11771184, 2008.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with
subword units.ArXiv, abs/1508.07909, 2015.
Stefano Spigler, Mario Geiger, Stephane dAscoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart.´
A jamming transition from under-to over-parametrization affects loss landscape and generaliza-
tion.arXiv preprint arXiv:1810.09665, 2018.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du-
mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In
Computer Vision and Pattern Recognition (CVPR), 2015. URLhttp://arxiv.org/abs/
1409.4842.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
deep learning requires rethinking generalization.ICLR, abs/1611.03530, 2016.
A SUMMARY TABLE OF EXPERIMENTAL RESULTS
<<TABLE>>
B APPENDIX: EXPERIMENTAL DETAILS
B.1 MODELS
We use the following families of architectures. The PyTorch Paszke et al. (2017)
specification of our ResNets and CNNs are available at https://gitlab.com/harvard-machine-learning/double-descent/tree/master.
ResNets. We define a family of ResNet18s of increasing size as follows. We follow the Preac-
tivation ResNet18 architecture of He et al. (2016), using 4 ResNet blocks, each consisting of two
BatchNorm-ReLU-Convolution layers. The layer widths for the 4 blocks are [k;2k;4k;8k] for
varyingk2Nand the strides are [1, 2, 2, 2]. The standard ResNet18 corresponds to k=64 con-
volutional channels in the first layer. The scaling of model size withkis shown in Figure 13b. Our
implementation is adapted from https://github.com/kuangliu/pytorch-cifar.
Standard CNNs. We consider a simple family of 5-layer CNNs, with four Conv-BatchNorm-
ReLU-MaxPool layers and a fully-connected output layer. We scale the four convolutional layer
widths as [k;2k;4k;8k]. The MaxPool is [1, 2, 2, 8]. For all the convolution layers, the kernel
size = 3, stride = 1 and padding=1. This architecture is based on the “backbone” architecture from
Page (2018). For k=64, this CNN has 1558026 parameters and can reach >90% test accuracy on
CIFAR-10 (Krizhevsky (2009)) with data-augmentation. The scaling of model size with k is shown
in Figure 13a.
Transformers. We consider the encoder-decoder Transformer model from Vaswani et al. (2017)
with 6 layers and 8 attention heads per layer, as implemented by fairseq Ott et al. (2019). We scale
the size of the network by modifying the embedding dimension (d model), and scale the width of the
fully-connected layers proportionally (dff = 4d model). We train with 10% label smoothing and no
drop-out, for 80 gradient steps.
<<FIGURE>>
Figure 13: Scaling of model size with our parameterization of width & embedding dimension.
B.2 IMAGE CLASSIFICATION: EXPERIMENTAL SETUP
We describe the details of training for CNNs and ResNets below.
Loss function:Unless stated otherwise, we use the cross-entropy loss for all the experiments.
Data-augmentation: In experiments where data-augmentation was used, we apply
RandomCrop(32, padding=4)andRandomHorizontalFlip. In experiments with
added label noise, the label for all augmentations of a given training sample are given the same
label.
Regularization:No explicit regularization like weight decay or dropout was applied unless explic-
itly stated.
Initialization:We use the default initialization provided by PyTorch for all the layers.
Optimization:
Adam: Unless specified otherwise, learning rate was set at constant to 1e^4 and all other
parameters were set to their default PyTorch values.
SGD: Unless specified otherwise, learning rate schedule inverse-square root (defined be-
low) was used with initial learning rate <<FORMULA>> and updates every L=512 gradient steps.
No momentum was used.
We found our results are robust to various other natural choices of optimizers and learning rate
schedule. We used the above settings because (1) they optimize well, and (2) they do not require
experiment-specific hyperparameter tuning, and allow us to use the same optimization across many
experiments.
Batch size: All experiments use a batchsize of 128.
Learning rate schedule descriptions:
Inverse-square root (<<FORMULA>>): At gradient stept, the learning rate is set to <<FORMULA>>. We set learning-rate with respect to number of gradient steps, and not epochs, <<FORMULA>>
in order to allow comparison between experiments with varying train-set sizes.
Dynamic drop (<<FORMULA>>, drop, patience): Starts with an initial learning rate of 0 and drops by
a factor of drop if the training loss has remained constant or become worse for patience
number of gradient steps.
B.3 NEURAL MACHINE TRANSLATION: EXPERIMENTAL SETUP
Here we describe the experimental setup for the neural machine translation experiments.
Training procedure.
In this setting, the distributionDconsists of triples
<<FORMULA>>
where V_src and V_tgt are the source and target vocabularies, the stringxis a sentence in the source
language,yis its translation in the target language, andiis the index of the token to be predicted by
the model. We assume that <<FORMULA>> is distributed uniformly on <<FORMULA>>.
A standard probabilistic model defines an autoregressive factorization of the likelihood:
<<FORMULA>>
Given a set of training samplesS, we define
<<FORMULA>>
In practice,S is not constructed from independent samples from D, but rather by first sampling
<<(x,y)>> and then including all <<FORMULA>> in S.
For training transformers, we replicate the optimization procedure specified in Vaswani et al. (2017)
section 5.3, where the learning rate schedule consists of a “warmup” phase with linearly increasing
learning rate followed by a phase with inverse square-root decay. We preprocess the data using byte
pair encoding (BPE) as described in Sennrich et al. (2015). We use the implementation provided by
fairseq (https://github.com/pytorch/fairseq).
Datasets. The IWSLT14 German to English dataset contains TED Talks as described in Cettolo
et al. (2012). The WMT14 English to French dataset is taken from http://www.statmt.org/wmt14/translation-task.html.
B.4 PER-SECTION EXPERIMENTAL DETAILS
Here we provide full details for experiments in the body, when not otherwise provided.
Introduction: Experimental Details Figure 1: All models were trained using Adam with learning-
rate 0.0001 for 4K epochs. Plotting means and standard deviations for 5 trials, with random network
initialization.
Model-wise Double Descent: Experimental Details Figure 7: Plotting means and standard devia-
tions for 5 trials, with random network initialization.
Sample-wise Nonmonotonicity: Experimental DetailsFigure 11a: All models are trained with
SGD for 500K epochs, and data-augmentation. Bottom: Means and standard deviations from 5
trials with random initialization, and random subsampling of the train set.
C EXTENDED DISCUSSION OF RELATED WORK
Belkin et al. (2018): This paper proposed, in very general terms, that the apparent contradiction
between traditional notions of the bias-variance trade-off and empirically successful practices in
deep learning can be reconciled under a double-descent curve—as model complexity increases, the
test error follows the traditional “U-shaped curve”, but beyond the point of interpolation, the error
starts todecrease. This work provides empirical evidence for the double-descent curve with fully
connected networks trained on subsets of MNIST, CIFAR10, SVHN and TIMIT datasets. They use
thel2 loss for their experiments. They demonstrate that neural networks are not an aberration in this
regard—double-descent is a general phenomenon observed also in linear regression with random
features and random forests.
Theoretical works on linear least squares regression: A variety of papers have attempted to the-
oretically analyze this behavior in restricted settings, particularly the case of least squares regression
under various assumptions on the training data, feature spaces and regularization method.
1.Advani & Saxe (2017); Hastie et al. (2019) both consider the linear regression problem
stated above and analyze the generalization behavior in the asymptotic limit <<FORMULA>>
using random matrix theory. Hastie et al. (2019) highlight that when the model is mis-
specified, the minimum of training error can occur for over-parameterized models
2.Belkin et al. (2019) Linear least squares regression for two data models, where the input
data is sampled from a Gaussian and a Fourier series model for functions on a circle. They
provide a finite-sample analysis for these two cases
3.Bartlett et al. (2019) provides generalization bounds for the minimuml2 -norm interpolant
for Gaussian features
4.Muthukumar et al. (2019) characterize the fundamental limit of of any interpolating solu-
tion in the presence of noise and provide some interesting Fourier-theoretic interpretations.
5.Mei & Montanari (2019): This work provides asymptotic analysis for ridge regression over
random features
Similar double descent behavior was investigated in Opper (1995; 2001)
Geiger et al. (2019b) showed that deep fully connected networks trained on the MNIST dataset with
hinge loss exhibit a “jamming transition” when the number of parameters exceeds a threshold that
allows training to near-zero train loss. Geiger et al. (2019a) provide further experiments on CIFAR-
10 with a convolutional network. They also highlight interesting behavior with ensembling around
the critical regime, which is consistent with our informal intuitions in Section 5 and our experiments
in Figures 28, 29.
Advani & Saxe (2017); Geiger et al. (2019b;a) also point out that double-descent is not observed
when optimal early-stopping is used.
D RANDOM FEATURES: A CASE STUDY
<<FIGURE>>
Figure 14:Random Fourier Featureson the Fashion MNIST dataset. The setting is equivalent
to two-layer neural network witheix activation, with randomly-initialized first layer that is fixed
throughout training. The second layer is trained using gradient flow.
In this section, for completeness sake, we show that both the model- and sample-wise double de-
scent phenomena are not unique to deep neural networks—they exist even in the setting of Random
Fourier Features of Rahimi & Recht (2008). This setting is equivalent to a two-layer neural network
with <<FORMULA>> activation. The first layer is initialized with aN(0;1 )Gaussian distribution and then
fixed throughout training. The width (or embedding dimension) d dof the first layer parameterizes
the model size. The second layer is initialized with0s and trained with MSE loss.
Figure 14 shows the grid of Test Error as a function of both number of samplesnand model sized.
Note that in this settingEMC =d(the embedding dimension). As a result, as demonstrated in the
figure, the peak follows the path ofn=d. Both model-wise and sample-wise (see figure 15) double
descent phenomena are captured, by horizontally and vertically crossing the grid, respectively.
<<FIGURE>>
Figure 15: Sample-wise double-descent slice for Random Fourier Features on the Fashion MNIST
dataset. In this figure the embedding dimension (number of random features) is 1000.
E APPENDIX: ADDITIONAL EXPERIMENTS
E.1 EPOCH-WISE DOUBLE DESCENT: ADDITIONAL RESULTS
Here, we provide a rigorous evaluation of epoch-wise double descent for a variety of optimizers and
learning rate schedules. We train ResNet18 on CIFAR-10 with data-augmentation and 20% label
noise with three different optimizers—Adam, SGD, SGD + Momentum (momentum set to 0.9) and
three different learning rate schedules—constant, inverse-square root, dynamic drop for differnet
values of initial learning rate. We observe that double-descent occurs reliably for all optimizers and
learning rate schedules and the peak of the double descent curve shifts with the interpolation point.
<<FIGURE>>
Figure 16:Epoch-wise double descentfor ResNet18 trained with Adam and multiple learning rate
schedules
A practical recommendation resulting from epoch-wise double descent is that stopping the training
when the test error starts to increase may not always be the best strategy. In some cases, the test error
may decrease again after reaching a maximum, and the final value may be lower than the minimum
earlier in training.
<<FIGURE>>
Figure 17:Epoch-wise double descentfor ResNet18 trained with SGD and multiple learning rate
schedules
<<FIGURE>>
Figure 18:Epoch-wise double descentfor ResNet18 trained with SGD+Momentum and multiple
learning rate schedules
E.2 MODEL-WISE DOUBLE DESCENT: ADDITIONAL RESULTS
E.2.1 CLEAN SETTINGS WITH MODEL-WISE DOUBLE DESCENT
<<FIGURE>>
Figure 19:Top:Train and test performance as a function of both model size and train epochs.
Bottom:Test error dynamics of the same model (ResNet18, on CIFAR-100 with no label noise,
data-augmentation and Adam optimizer trained for 4k epochs with learning rate 0.0001). Note that
even with optimal early stopping this setting exhibits double descent.
<<FIGURE>>
Figure 20:Top:Train and test performance as a function of both model size and train epochs.
Bottom:Test error dynamics of the same models. 5-Layer CNNs, CIFAR-100 with no label noise,
no data-augmentation Trained with SGD for 1e6 steps. Same experiment as Figure 7.
E.2.2 WEIGHT DECAY
<<FIGURE>>
Figure 21:Left:Test error dynamics with weight decay of 5e-4 (bottom left) and without weight
decay (top left). Right:Test and train error andtest lossfor models with varying amounts of
weight decay. All models are 5-Layer CNNs on CIFAR-10 with 10% label noise, trained with
data-augmentation and SGD for 500K steps.
Here, we now study the effect of varying the level of regularization on test error. We train CIFAR10
with data-augmentation and 20% label noise on ResNet18 for weight decay coefficients <<FORMULA>> rang-
ing from 0 to 0.1. We train the networks using SGD + inverse-square root learning rate. Figure
below shows a picture qualitatively very similar to that observed for model-wise double descent
wherein ”model complexity” is now controlled by the regularization parameter. This confirms our
generalized double descent hypothesis along yet another axis of Effective Model Complexity.
<<FIGURE>>
Figure 22: Generalized double descent for weight decay. We found that using the same initial
learning rate for all weight decay values led to training instabilities. This resulted in some noise in
the Test Error (Weight DecayEpochs) plot shown above.
E.2.3 EARLY STOPPING DOES NOT EXHIBIT DOUBLE DESCENT
<<FIGURE>>
Figure 23: Model-wise test error dynamics for a subsampled IWSLT14 dataset. Left: 4k samples,
Right: 18k samples. Note that with optimal early-stopping, more samples is always better.
<<FIGURE>>
Figure 24: Model-wise test error dynamics for a IWSLT14 de-en and subsampled WMT14 en-fr
datasets.Left: IWSLT14,Right: subsampled (200k samples) WMT14. Note that with optimal
early-stopping, the test error is much lower for this task.
<<FIGURE>>
Figure 25:Top:Train and test performance as a function of both model size and train epochs.
Bottom:Test error dynamics of the same model (CNN, on CIFAR-10 with 10% label noise, data-paugmentation and SGD optimizer with learning rate/1= T).
E.2.4 TRAINING PROCEDURE
<<FIGURE>>
Figure 26:Model-wise double descent for adversarial trainingResNet18s on CIFAR-10 (sub-
sampled to 25k train samples) with no label noise. We train for L2 robustness of radius <<FORMULA>> and
<<FORMULA>>, using 10-step PGD (Goodfellow et al. (2014); Madry et al. (2017)). Trained using SGD
(batch size 128) with learning rate0:1for 400 epochs, then0:01for 400 epochs.
<<FIGURE>>
Figure 27
E.3 ENSEMBLING
<<FIGURE>>
Figure 28:Effect of Ensembling (ResNets, 15% label noise). Test error of an ensemble of 5
models, compared to the base models. The ensembled classifier is determined by plurality vote over
the 5 base models. Note that emsembling helps most around the critical regime. All models are
ResNet18s trained on CIFAR-10 with 15% label noise, using Adam for 4K epochs (same setting
as Figure 1). Test error is measured against the original (not noisy) test set, and each model in the
ensemble is trained using a train set with independently-sampled 15% label noise.
<<FIGURE>>
Figure 29:Effect of Ensembling (CNNs, no label noise). Test error of an ensemble of 5 models,
compared to the base models. All models are 5-layer CNNs trained on CIFAR-10 with no label
noise, using SGD and no data augmentation. (same setting as Figure 7).
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Deep Residual Learning for Image Recognition
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research {kahe, v-xiangz, v-shren, jiansun}@microsoft.com
Abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learn.ing residual functions with reference to the layer inputs, in.stead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers 8. deeper than VGG nets [41] but still having lower complex.
ity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
1. Introduction
Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 50, 40]. Deep networks naturally integrate low/mid/high.level features [50] and classifiers in an end-to-end multi.layer fashion, and the levels of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit very deep [41] models, with a depth of sixteen [41] to thirty [16]. Many other non.trivial visual recognition tasks [8, 12, 7, 32, 27] have also
<<FIGURE>>
Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer plain networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.
greatly benefited from very deep models. Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers?
An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start con.
verging for stochastic gradient descent (SGD) with back-propagation [22].
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.
The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).
In this paper, we address the degradation problem by introducing a deep residual learning framework. In.stead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these lay.ers fit a residual mapping. Formally, denoting the desired underlying mapping as <<H(x)>>, we let the stacked nonlinear layers fit another mapping of <<FORMULA>>. The original mapping is recast into <<F(x)+x>>. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.
The formulation of <<FORMULA>> can be realized by feedforward neural networks with shortcut connections (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity short.cut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.
We present comprehensive experiments on ImageNet
[36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart plain nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.
Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.
On the ImageNet classification dataset [36], we obtain excellent results by extremely deep residual nets. Our 152.layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.
2. Related Work Residual Representations. In image recognition, VLAD
[18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 48]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.
In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis pre.conditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3, 45, 46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.
Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an inception layer is composed of a shortcut branch and a few deeper branches.
Concurrent with our work, highway networks [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is closed (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, high.way networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).
3. Deep Residual Learning
3.1. Residual Learning
Let us consider <<H(x)>> as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., <<H(x) . x>> (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate <<H(x)>>, we explicitly let these layers approximate a residual function <<F(x) := H(x) . x>>. The original function thus becomes <<F(x)+x>>. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.
This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear lay.ers toward zero to approach identity mappings.
In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity map.pings provide reasonable preconditioning.
3.2. Identity Mapping by Shortcuts
We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as:
<<y = F(x, {Wi})+ x>>. (1)
Here x and y are the input and output vectors of the lay.ers considered. The function <<F(x, {Wi})>> represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, <<F = W_2.(W_1 . x)>> in which <<FORMULA>> denotes
ReLU [29] and the biases are omitted for simplifying notations. The operation <<FORMULA>> is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., <<FORMULA>>, see Fig. 2).
The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).
The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:
<<FORMULA>>. (2)
We can also use a square matrix <<W_s>> in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions.
The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: <<y = W1x + x>>, for which we have not observed advantages.
We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function <<F(x, {Wi})>> can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.
3.3. Network Architectures
We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.
Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [41] (Fig. 3, left). The convolutional layers mostly have 3.3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle).
It is worth noticing that our model has fewer filters and lower complexity than VGG nets [41] (Fig. 3, left). Our 34 layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).
Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1.1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.
3.4. Implementation
Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41]. A 224.224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 . 104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16].
In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully
convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).
4. Experiments
4.1. ImageNet Classification
We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates.
Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for de.
tailed architectures.
The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we com.pare their training/validation errors during the training procedure. We have observed the degradation problem
<<TABLE>>
Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Down-sampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.
Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts.
<<FIGURE>>
Table 2. Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts. Fig. 4 shows the training procedures.
34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one.
We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve compet.itive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error3. The reason for such opti.mization difficulties will be studied in the future.
Residual Networks. Next we evaluate 18-layer and 34.layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3.3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts.
We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learn.ing fi the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth.
Second, compared to its plain counterpart, the 34-layer
3We have experimented with more training iterations (3.) and still ob.served the degradation problem, suggesting that this problem cannot be feasibly addressed by simply using more iterations.
<<TABLE>>
Table 3. Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions.
<<TABLE>>
Table 4. Error rates (%) of single-model results on the ImageNet validation set (except fi reported on the test set).
<<TABLE>>
Table 5. Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server.
<<TABLE>>
ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems.
Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is not overly deep (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage.
Identity vs. Projection Shortcuts. We have shown that
Figure 5. A deeper residual function F for ImageNet. Left: a building block (on 56.56 feature maps) as in Fig. 3 for ResNet.
<<FIGURE>>
parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter-free (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.
Table 3 shows that all three options are considerably bet.
ter than the plain counterpart. B is slightly better than A. We argue that this is because the zero-padded dimensions in A indeed have no residual learning. C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce mem.ory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below.
Deeper Bottleneck Architectures. Next we describe our deeper nets for ImageNet. Because of concerns on the train.ing time that we can afford, we modify the building block as a bottleneck design4. For each residual function F, we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1.1, 3.3, and 1.1 convolutions, where the 1.1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3.3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.
The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity short.cut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.
50-layer ResNet: We replace each 2-layer block in the
4Deeper non-bottleneck ResNets (e.g., Fig. 5 left) also gain accuracy from increased depth (as shown on CIFAR-10), but are not as economical as the bottleneck ResNets. So the usage of bottleneck designs is mainly due to practical considerations. We further note that the degradation problem of plain nets is also witnessed for the bottleneck designs.
34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs.
101-layer and 152-layer ResNets: We construct 101.layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 bil.lion FLOPs).
The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4).
Comparisons with State-of-the-art Methods. In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5).
This entry won the 1st place in ILSVRC 2015.
4.2. CIFAR-10 and Analysis
We conducted more studies on the CIFAR-10 dataset [20], which consists of 50k training images and 10k test.ing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows.
The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32.32 images, with the per-pixel mean subtracted. The first layer is 3.3 convolutions. Then we use a stack of 6n layers with 3.3 convolutions on the feature maps of sizes {32, 16, 8} respectively, with 2n layers for each feature map size. The numbers of filters are {16, 32, 64} respectively. The subsampling is per.formed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture:
<<TABLE>>
When shortcut connections are used, they are connected to the pairs of 3.3 layers (totally 3n shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A),
<<TABLE>>
Table 6. Classification error on the CIFAR-10 test set. All meth.ods are with data augmentation. For ResNet-110, we run it 5 times and show best (mean std) as in [43].
so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts.
We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [13] and BN [16] but with no dropout. These models are trained with a mini-batch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmen.tation in [24] for training: 4 pixels are padded on each side, and a 32.32 crop is randomly sampled from the padded image or its horizontal fiip. For testing, we only evaluate the single view of the original 32.32 image.
We compare n = {3, 5, 7, 9}, leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see [42]), suggesting that such an optimization difficulty is a fundamental problem.
Fig. 6 (middle) shows the behaviors of ResNets. Also similar to the ImageNet cases (Fig. 4, right), our ResNets manage to overcome the optimization difficulty and demon.strate accuracy gains when the depth increases.
We further explore n = 18 that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging5. So we use
0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin
5With an initial learning rate of 0.1, it starts converging (<90% error) after several epochs, but still reaches similar accuracy.
<<FIGURE>>
Figure 7. Standard deviations (std) of layer responses on CIFAR.
10. The responses are the outputs of each 3.3 layer, after BN and before nonlinearity. Top: the layers are shown in their original order. Bottom: the responses are ranked in descending order.
networks such as FitNet [35] and Highway [42] (Table 6), yet is among the state-of-the-art results (6.43%, Table 6).
Analysis of Layer Responses. Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3.3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analysis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our ba.sic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less.
Exploring Over 1000 layers. We explore an aggressively deep model of over 1000 layers. We set n = 200 that leads to a 1202-layer network, which is trained as described above. Our method shows no optimization difficulty, and this 103-layer network is able to achieve training error <0.1% (Fig. 6, right). Its test error is still fairly good (7.93%, Table 6).
But there are still open problems on such aggressively deep models. The testing result of this 1202-layer network is worse than that of our 110-layer network, although both
<<TABLE>>
Table 7. Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also Ta.ble 10 and 11 for better results.
<<TABLE>>
Table 8. Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also Table 9 for better results.
have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M) for this small dataset. Strong regularization such as maxout [10] or dropout [14] is applied to obtain the best results ([10, 25, 24, 35]) on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may im.prove results, which we will study in the future.
4.3. Object Detection on PASCAL and MS COCO
Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection baseline results on PASCAL VOC 2007 and 2012
[5] and COCO [26]. We adopt Faster R-CNN [32] as the detection method. Here we are interested in the improvements of replacing VGG-16 [41] with ResNet-101. The detection implementation (see appendix) of using both models is the same, so the gains can only be attributed to better networks. Most remarkably, on the challenging COCO dataset we ob.tain a 6.0% increase in COCOfis standard metric (mAP@[.5, .95]), which is a 28% relative improvement. This gain is solely due to the learned representations.
Based on deep residual nets, we won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: Im.ageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The details are in the appendix.
References
[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157fi166, 1994.
[2] C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995.
[3] W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam, 2000.
[4] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.
[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, pages 303fi338, 2010.
[6] S. Gidaris and N. Komodakis. Object detection via a multi-region & semantic segmentation-aware cnn model. In ICCV, 2015.
[7] R. Girshick. Fast R-CNN. In ICCV, 2015.
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hier.archies for accurate object detection and semantic segmentation. In CVPR, 2014.
[9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
[10] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.
[11] K. He and J. Sun. Convolutional neural networks at constrained time cost. In CVPR, 2015.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
[14] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735fi1780, 1997.
[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
[17] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. TPAMI, 33, 2011.
[18] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. Aggregating local image descriptors into compact codes. TPAMI, 2012.
[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
[20] A. Krizhevsky. Learning multiple layers of features from tiny im.ages. Tech Report, 2009.
[21] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
[22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to hand.written zip code recognition. Neural computation, 1989.
[23] Y. LeCun,L.Bottou,G.B.Orr,andK.-R.Mfiuller. Efficientbackprop. In Neural Networks: Tricks of the Trade, pages 9fi50. Springer, 1998.
[24] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. arXiv:1409.5185, 2014.
[25] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.
[26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollfiar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014.
[27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
[28] G. Montfiufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In NIPS, 2014.
[29] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
[30] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
[31] T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012.
[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
[33] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. arXiv:1504.06066, 2015.
[34] B. D. Ripley. Pattern recognition and neural networks. Cambridge university press, 1996.
[35] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
[36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.
[37] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
[38] N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998.
[39] N. N. Schraudolph. Centering neural network gradient factors. In Neural Networks: Tricks of the Trade, pages 207fi226. Springer, 1998.
[40] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-Cun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
[41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[42] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv:1505.00387, 2015.
[43] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 1507.06228, 2015.
[44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er.han, V. Vanhoucke, and A. Rabinovich. Going deeper with convolu.tions. In CVPR, 2015.
[45] R. Szeliski. Fast surface interpolation using hierarchical basis func.tions. TPAMI, 1990.
[46] R. Szeliski. Locally adapted hierarchical basis preconditioning. In SIGGRAPH, 2006.
[47] T. Vatanen, T. Raiko, H. Valpola, and Y. LeCun. Pushing stochas.tic gradient towards second-order methodsfibackpropagation learn.ing with transformations in nonlinearities. In Neural Information Processing, 2013.
[48] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008.
[49] W. Venables and B. Ripley. Modern applied statistics with s-plus. 1999.
[50] M. D. Zeiler and R. Fergus. Visualizing and understanding convolu.tional neural networks. In ECCV, 2014.
A. Object Detection Baselines
In this section we introduce our detection method based on the baseline Faster R-CNN [32] system. The models are initialized by the ImageNet classification models, and then fine-tuned on the object detection data. We have experi.mented with ResNet-50/101 at the time of the ILSVRC & COCO 2015 detection competitions.
Unlike VGG-16 used in [32], our ResNet has no hidden fc layers. We adopt the idea of fiNetworks on Conv feature maps (NoC) [33] to address this issue. We compute the full-image shared conv feature maps using those lay.ers whose strides on the image are no greater than 16 pixels (i.e., conv1, conv2 x, conv3 x, and conv4 x, totally 91 conv layers in ResNet-101; Table 1). We consider these layers as analogous to the 13 conv layers in VGG-16, and by doing so, both ResNet and VGG-16 have conv feature maps of the same total stride (16 pixels). These layers are shared by a region proposal network (RPN, generating 300 proposals)
[32] and a Fast R-CNN detection network [7]. RoI pool.ing [7] is performed before conv5 1. On this RoI-pooled feature, all layers of conv5 x and up are adopted for each region, playing the roles of VGG-16fis fc layers. The final classification layer is replaced by two sibling layers (classi.fication and box regression [7]).
For the usage of BN layers, after pre-training, we compute the BN statistics (means and variances) for each layer on the ImageNet training set. Then the BN layers are fixed during fine-tuning for object detection. As such, the BN layers become linear activations with constant offsets and scales, and BN statistics are not updated by fine-tuning. We fix the BN layers mainly for reducing memory consumption in Faster R-CNN training.
PASCAL VOC
Following [7, 32], for the PASCAL VOC 2007 test set, we use the 5k trainval images in VOC 2007 and 16k train-val images in VOC 2012 for training (fi07+12fi). For the PASCAL VOC 2012 test set, we use the 10k trainval+test images in VOC 2007 and 16k trainval images in VOC 2012 for training (fi07++12fi). The hyper-parameters for train.ing Faster R-CNN are the same as in [32]. Table 7 shows the results. ResNet-101 improves the mAP by >3% over VGG-16. This gain is solely because of the improved features learned by ResNet.
MS COCO
The MS COCO dataset [26] involves 80 object cate.gories. We evaluate the PASCAL VOC metric (mAP @ IoU = 0.5) and the standard COCO metric (mAP @ IoU = .5:.05:.95). We use the 80k images on the train set for train.ing and the 40k images on the val set for evaluation. Our detection system for COCO is similar to that for PASCAL VOC. We train the COCO models with an 8-GPU imple.mentation, and thus the RPN step has a mini-batch size of 8 images (i.e., 1 per GPU) and the Fast R-CNN step has a mini-batch size of 16 images. The RPN step and Fast R.CNN step are both trained for 240k iterations with a learn.ing rate of 0.001 and then for 80k iterations with 0.0001.
Table 8 shows the results on the MS COCO validation set. ResNet-101 has a 6% increase of mAP@[.5, .95] over VGG-16, which is a 28% relative improvement, solely con.tributed by the features learned by the better network. Re.markably, the mAP@[.5, .95]fis absolute increase (6.0%) is nearly as big as mAP@.5fis (6.9%). This suggests that a deeper network can improve both recognition and localiza.tion.
B. Object Detection Improvements
For completeness, we report the improvements made for the competitions. These improvements are based on deep features and thus should benefit from residual learning.
MS COCO
Box refinement. Our box refinement partially follows the it.erative localization in [6]. In Faster R-CNN, the final output is a regressed box that is different from its proposal box. So for inference, we pool a new feature from the regressed box and obtain a new classification score and a new regressed box. We combine these 300 new predictions with the orig.inal 300 predictions. Non-maximum suppression (NMS) is applied on the union set of predicted boxes using an IoU threshold of 0.3 [8], followed by box voting [6]. Box re.finement improves mAP by about 2 points (Table 9).
Global context. We combine global context in the Fast R-CNN step. Given the full-image conv feature map, we pool a feature by global Spatial Pyramid Pooling [12] (with a fisingle-levelfi pyramid) which can be implemented as fiRoIfi pooling using the entire imagefis bounding box as the RoI. This pooled feature is fed into the post-RoI layers to obtain a global context feature. This global feature is con.catenated with the original per-region feature, followed by the sibling classification and box regression layers. This new structure is trained end-to-end. Global context im.proves mAP@.5 by about 1 point (Table 9).
Multi-scale testing. In the above, all results are obtained by single-scale training/testing as in [32], where the imagefis shorter side is s = 600 pixels. Multi-scale training/testing has been developed in [12, 7] by selecting a scale from a feature pyramid, and in [33] by using maxout layers. In our current implementation, we have performed multi-scale testing following [33]; we have not performed multi-scale training because of limited time. In addition, we have per.formed multi-scale testing only for the Fast R-CNN step (but not yet for the RPN step). With a trained model, we compute conv feature maps on an image pyramid, where the imagefis shorter sides are s .{200, 400, 600, 800, 1000}.
<<TABLE>>
Table 9. Object detection improvements on MS COCO using Faster R-CNN and ResNet-101.
<<TABLE>>
Table 10. Detection results on the PASCAL VOC 2007 test set. The baseline is the Faster R-CNN system. The system fibaseline+++fi include box refinement, context, and multi-scale testing in Table 9.
system net data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
baseline baseline baseline+++ VGG-16 ResNet-101 ResNet-101 07++12 07++12 COCO+07++12 70.4 73.8 83.8 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6 92.1 88.4 84.8 75.9 71.4 86.3 87.8 94.2 66.8 89.4 69.2 93.9 91.9 90.9 89.6 67.9 88.2 76.8 90.3 80.0
Table 11. Detection results on the PASCAL VOC 2012 test set (http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=4). The baseline is the Faster R-CNN system. The system baseline+++ include box refinement, context, and multi-scale testing in Table 9.
We select two adjacent scales from the pyramid following [33]. RoI pooling and subsequent layers are performed on the feature maps of these two scales [33], which are merged by maxout as in [33]. Multi-scale testing improves the mAP by over 2 points (Table 9).
Using validation data. Next we use the 80k+40k trainval set for training and the 20k test-dev set for evaluation. The test.dev set has no publicly available ground truth and the result is reported by the evaluation server. Under this setting, the results are an mAP@.5 of 55.7% and an mAP@[.5, .95] of 34.9% (Table 9). This is our single-model result.
Ensemble. In Faster R-CNN, the system is designed to learn region proposals and also object classifiers, so an ensemble can be used to boost both tasks. We use an ensemble for proposing regions, and the union set of proposals are pro.cessed by an ensemble of per-region classifiers. Table 9 shows our result based on an ensemble of 3 networks. The mAP is 59.0% and 37.4% on the test-dev set. This result won the 1st place in the detection task in COCO 2015.
We revisit the PASCAL VOC dataset based on the above model. With the single model on the COCO dataset (55.7% mAP@.5 in Table 9), we fine-tune this model on the PAS.CAL VOC sets. The improvements of box refinement, con.text, and multi-scale testing are also adopted. By doing so we achieve 85.6% mAP on PASCAL VOC 2007 (Table 10) and 83.8% on PASCAL VOC 2012 (Table 11)6. The result on PASCAL VOC 2012 is 10 points higher than the previ.ous state-of-the-art result [6].
<<TABLE>>
Table 12. Our results (mAP, %) on the ImageNet detection dataset. Our detection system is Faster R-CNN [32] with the improvements in Table 9, using ResNet-101.
ImageNet Detection
The ImageNet Detection (DET) task involves 200 object categories. The accuracy is evaluated by mAP@.5. Our object detection algorithm for ImageNet DET is the same as that for MS COCO in Table 9. The networks are pre.trained on the 1000-class ImageNet classification set, and are fine-tuned on the DET data. We split the validation set into two parts (val1/val2) following [8]. We fine-tune the detection models using the DET training set and the val1 set. The val2 set is used for validation. We do not use other ILSVRC 2015 data. Our single model with ResNet-101 has
<<TABLE>>
Table 13. Localization error (%) on the ImageNet validation. In the column of fiLOC error on GT classfi ([41]), the ground truth class is used. In the fitestingfi column, fi1-cropfi denotes testing on a center crop of 224.224 pixels, fidensefi denotes dense (fully convolutional) and multi-scale testing.
<<TABLE>>
Table 14. Comparisons of localization error (%) on the ImageNet dataset with state-of-the-art methods.
58.8% mAP and our ensemble of 3 models has 62.1% mAP on the DET test set (Table 12). This result won the 1st place in the ImageNet detection task in ILSVRC 2015, surpassing the second place by 8.5 points (absolute).
C. ImageNet Localization
The ImageNet Localization (LOC) task [36] requires to classify and localize the objects. Following [40, 41], we assume that the image-level classifiers are first adopted for predicting the class labels of an image, and the localiza.tion algorithm only accounts for predicting bounding boxes based on the predicted classes. We adopt the fiper-class re.gressionfi (PCR) strategy [40, 41], learning a bounding box regressor for each class. We pre-train the networks for Im.ageNet classification and then fine-tune them for localiza.tion. We train networks on the provided 1000-class Ima.geNet training set.
Our localization algorithm is based on the RPN frame.work of [32] with a few modifications. Unlike the way in
[32] that is category-agnostic, our RPN for localization is designed in a per-class form. This RPN ends with two sib.ling 1.1 convolutional layers for binary classification (cls) and box regression (reg), as in [32]. The cls and reg layers are both in a per-class from, in contrast to [32]. Specifi.cally, the cls layer has a 1000-d output, and each dimension is binary logistic regression for predicting being or not be.ing an object class; the reg layer has a 1000.4-d output consisting of box regressors for 1000 classes. As in [32], our bounding box regression is with reference to multiple translation-invariant fianchorfi boxes at each position.
As in our ImageNet classification training (Sec. 3.4), we randomly sample 224.224 crops for data augmentation. We use a mini-batch size of 256 images for fine-tuning. To avoid negative samples being dominate, 8 anchors are ran.domly sampled for each image, where the sampled positive and negative anchors have a ratio of 1:1 [32]. For testing, the network is applied on the image fully-convolutionally.
Table 13 compares the localization results. Following [41], we first perform fioraclefi testing using the ground truth class as the classification prediction. VGGfis paper [41] re-ports a center-crop error of 33.1% (Table 13) using ground truth classes. Under the same setting, our RPN method us.ing ResNet-101 net significantly reduces the center-crop er.ror to 13.3%. This comparison demonstrates the excellent performance of our framework. With dense (fully convolu.tional) and multi-scale testing, our ResNet-101 has an error of 11.7% using ground truth classes. Using ResNet-101 for predicting classes (4.6% top-5 classification error, Table 4), the top-5 localization error is 14.4%.
The above results are only based on the proposal network (RPN) in Faster R-CNN [32]. One may use the detection network (Fast R-CNN [7]) in Faster R-CNN to improve the results. But we notice that on this dataset, one image usually contains a single dominate object, and the proposal regions highly overlap with each other and thus have very similar RoI-pooled features. As a result, the image-centric training of Fast R-CNN [7] generates samples of small variations, which may not be desired for stochastic training. Motivated by this, in our current experiment we use the original R-CNN [8] that is RoI-centric, in place of Fast R-CNN.
Our R-CNN implementation is as follows. We apply the per-class RPN trained as above on the training images to predict bounding boxes for the ground truth class. These predicted boxes play a role of class-dependent proposals. For each training image, the highest scored 200 proposals are extracted as training samples to train an R-CNN classi.fier. The image region is cropped from a proposal, warped to 224.224 pixels, and fed into the classification network as in R-CNN [8]. The outputs of this network consist of two sibling fc layers for cls and reg, also in a per-class form. This R-CNN network is fine-tuned on the training set us.ing a mini-batch size of 256 in the RoI-centric fashion. For testing, the RPN generates the highest scored 200 proposals for each predicted class, and the R-CNN network is used to update these proposalsfi scores and box positions.
This method reduces the top-5 localization error to 10.6% (Table 13). This is our single-model result on the validation set. Using an ensemble of networks for both clas.sification and localization, we achieve a top-5 localization error of 9.0% on the test set. This number significantly out.performs the ILSVRC 14 results (Table 14), showing a 64% relative reduction of error. This result won the 1st place in the ImageNet localization task in ILSVRC 2015.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures
Julien Launay 1;2 Iacopo Poli 1 François Boniface 1 Florent Krzakala 1;2
1 LightOn 2 École Normale Supérieure
Abstract
Despite being the workhorse of deep learning, the backpropagation algorithm is
no panacea. It enforces sequential layer updates, thus preventing efficient paral-
lelization of the training process. Furthermore, its biological plausibility is being
challenged. Alternative schemes have been devised; yet, under the constraint of
synaptic asymmetry, none have scaled to modern deep learning tasks and architec-
tures. Here, we challenge this perspective, and study the applicability of Direct
Feedback Alignment to neural view synthesis, recommender systems, geometric
learning, and natural language processing. In contrast with previous studies lim-
ited to computer vision tasks, our findings show that it successfully trains a large
range of state-of-the-art deep learning architectures, with performance close to
fine-tuned backpropagation. At variance with common beliefs, our work supports
that challenging tasks can be tackled in the absence of weight transport.
1 Introduction
While the backpropagation algorithm (BP) [1,2] is at the heart of modern deep learning achievements,
it is not without pitfalls. For one, its weight updates are non-local and rely on upstream layers. Thus,
they cannot be easily parallelized [3], incurring important memory and compute costs. Moreover,
its biological implementation is problematic [4,5]. For instance, BP relies on the transpose of the
weights to evaluate updates. Hence, synaptic symmetry is required between the forward and backward
path: this is implausible in biological brains, and known as the weight transport problem [6].
Consequently, alternative training algorithms have been developed. Some of these algorithms are
explicitly biologically inspired [713], while others focus on making better use of available compute
resources [3,1419]. Despite these enticing characteristics, none has been widely adopted, as they
are often demonstrated on a limited set of tasks. Moreover, as assessed in [20], their performance on
challenging datasets under the constraint of synaptic asymmetry is disappointing.
We seek to broaden this perspective, and demonstrate the applicability of Direct Feedback Alignment
(DFA) [19] in state-of-the-art settings: from applications of fully connected networks such as neural
view synthesis and recommender systems, to geometric learning with graph convolutions, and natural
language processing with Transformers. Our results define new standards for learning without weight
transport and show that challenging tasks can indeed be tackled under synaptic asymmetry.
All code needed to reproduce our experiments is available at https://github.com/lightonai/dfa-scales-to-modern-deep-learning.
1.1 Related work
Training a neural network is a credit assignment problem: an update is derived for each parameter
from its contribution to a cost function. To solve this problem, a spectrum of algorithms exists [21].
Biologically motivated methods Finding a training method applicable under the constraints of
biological brains remains an open problem. End-to-end propagation of gradients is unlikely to occur
[22], implying local learning is required. Furthermore, the weight transport problem enforces synaptic
asymmetry [6]. Inspired by auto-encoders, target propagation methods (TP) [1012] train distinct
feedback connections to invert the feedforward ones. Feedback alignment (FA) [13] replaces the
transpose of the forward weights used in the backward pass by a random matrix. Throughout training,
the forward weights learn to align with the arbitrary backward weights, eventually approximating BP.
Beyond biological considerations As deep learning models grow bigger, large-scale distributed
training is increasingly desirable. Greedy layer-wise training [14] allows networks to be built layer
by layer, limiting the depth of backpropagation. To enable parallelization of the backward pass,
updates must only depend on local quantities. Unsupervised learning is naturally suited for this,
as it relies on local losses such as Deep InfoMax [17] and Greedy InfoMax [18]. More broadly,
synthetic gradient methods, like decoupled neural interfaces [3,15] and local error signals (LES)
[16], approximate gradients using layer-wise trainable feedback networks. DFA [19] expands on FA
and directly projects a global error to each layer. A shared feedback path is still needed, but it only
depends on a simple random projection operation.
Performance of alternative methods Local training methods are successful in unsupervised learn-
ing [18]. Even in a supervised setting, they scale to challenging datasets like CIFAR-100 or ImageNet
[14,16]. Thus, locality is not too penalizing. However, TP, FA, and DFA are unable to scale to these
tasks [20]. In fact, DFA is unable to train convolutional layers [23]. To enable feedback alignment
techniques to perform well on challenging datasets, some form of weight transport is necessary:
either by explicitly sharing sign information [2426], or by introducing dedicated phases of alignment
for the forward and backward weights where some information is shared [27]. To the best of our
knowledge, no method compatible with the weight transport problem has ever been demonstrated on
challenging tasks.
1.2 Motivations and contributions
We focus on DFA, a compromise between biological and computational considerations. Notably,
DFA is compatible with synaptic asymmetry: this asymmetry raises important challenges, seemingly
preventing learning in demanding settings. Moreover, it allows for asynchronous weight updates,
and puts a single operation at the center of the training stage. This enables new classes of training
co-processors [28, 29], leveraging dedicated hardware to perform the random projection.
Extensive survey We apply DFA in a large variety of settings matching current trends in machine
learning. Previous works have found that DFA is unsuitable for computer vision tasks [20,23]; but
computer vision alone cannot be the litmus test of a training method. Instead, we consider four vastly
different domains, across eight tasks, and with eleven different architectures. This constitutes a survey
of unprecedented scale for an alternative training method, and makes a strong case for the possibility
of learning without weight transport in demanding scenarios.
Challenging settings We demonstrate the ability of DFA to tackle challenging tasks. We success-
fully learn and render real-world 3D scenes (section 3.1.1); we perform recommendation at scale
(section 3.1.2); we explore graph-based citation networks (section 3.2); and we consider language
modelling with a Transformer (section 3.3). We study tasks at the state-of-the-art level, that have
only been recently successfully tackled with deep learning.
Modern architectures We prove that the previously established failure of DFA to train convolutions
does not generalize. By evaluating performance metrics, comparing against a shallow baseline,
measuring alignment, and visualizing t-SNE embeddings, we show that learning indeed occurs in
layers involving graph convolutions and attention. This significantly broadens the applicability of
DFApreviously thought to be limited to simple problems like MNIST and CIFAR-10.
2 Methods
Forward pass In a fully connected network, at layer i out of N, neglecting its biases, with W_i its
weight matrix, f_i its non-linearity, and hi its activations, the forward pass is:
<<FORMULA>> (1)
<<FORMULA>> is the input data, and <<FORMULA>> are the predictions. A task-specific cost function
<<FORMULA>> is computed to quantify the quality of the predictions with respect to the targets y.
Backward pass with BP The weight updates are computed by backpropagation of the error vector.
Using the chain-rule of derivatives, each neuron is updated based on its contribution to the cost
function. Leaving aside the specifics of the optimizer used, the equation for the weight updates is:
<<FORMULA>> (2)
Backward pass with DFA The gradient signal <<FORMULA>> of the (i+1)-th layer violates synaptic
asymmetry. DFA replaces it with a random projection of the topmost derivative of the loss, <<FORMULA>>.
For common classification and regression losses such as the mean squared error or the negative log
likelihood, this corresponds to a random projection of the global error <<FORMULA>>. With B_i, a fixed
random matrix of appropriate shape drawn at initialization for each layers:
<<FORMULA>> (3)
3 Experiments
We study the applicability of DFA to a diverse set of applications requiring state-of-the-art architec-
tures. We start with fully connected networks, where DFA has already been demonstrated, and address
new challenging settings. We then investigate geometric learning: we apply DFA to graph neural net-
works in classification tasks on citation networks, as well as graph autoencoders. These architectures
feature graph convolutions and attention layers. Finally, we use DFA to train a transformer-based
Natural Language Processing (NLP) model on a dataset of more than 100 million tokens.
3.1 Fully connected architectures
DFA has been successful at training fully connected architectures, with performance on-par with
backpropagation [19,20]. However, only computer vision tasks have been considered, where fully
connected networks considerably underperform their convolutional counterpart. Here, we focus on
tasks where fully connected architectures are state-of-the-art. Moreover, the architectures considered
are deeper and more complex than those necessary to solve a simple task like MNIST.
3.1.1 Neural view synthesis with Neural Radiance Fields
The most recent state-of-the-art neural view synthesis methods are based on large fully connected
networks: this is an ideal setting for a first evaluation of DFA on a challenging task.
Background There has been growing interest in methods capable of synthesizing novel renders of
a 3D scene using a dataset of past renders. The network is trained to learn an inner representation of
the scene, and a classical rendering system can then query the model to generate novel views. With
robust enough methods, real-world scenes can also be learned from a set of pictures.
Until recently, most successful neural view synthesis methods were based on sampled volumetric
representations [3032]. In this context, Convolutional Neural Networks (CNNs) can be used to
smooth out the discrete sampling of 3D space [33,34]. However, these methods scale poorly to
higher resolutions, as they still require finer and finer sampling. Conversely, alternative schemes
based on a continuous volume representation have succeeded in generating high-quality renders [35],
even featuring complex phenomenons such as view-dependant scattering [36]. These schemes make
point-wise predictions, and use fully connected neural networks to encode the scene.
<<FIGURE>>
Figure 1: Comparisons of NeRF-DFA with state-of-the-art methods trained with BP on the most
challenging synthetic and real-world scenes. While NeRF-DFA generates render of lower quality,
they maintain multi-view consistency and exhibit no geometric artifacts. BP results from [36].
Setting We employ Neural Radiance Fields (NeRF) [36], the state-of-the-art for neural view
synthesis. NeRF represents scenes as a continuous 5D function of spacethree spatial coordinates,
two viewing anglesand outputs a point-wise RGB radiance and opacity. A ray-casting renderer can
then query the network to generate arbitrary views of the scene. The network modeling the continuous
function is 10 layers deep. Two identical networks are trained: the coarse network predictions inform
the renderer about the spatial coordinates that the fine network should preferentially evaluate to avoid
empty space and occluded regions.
Results We report quantitative results of training NeRF with DFA in Table 1. Neural view synthesis
methods are often better evaluated qualitatively: we showcase some renders in Figure 1.
On a dataset of renders featuring complex scenes with non-Lambertian materials (NeRF-Synthetic
[36]), NeRF-DFA outperforms two previous fine-tuned state-of-the-art methodsScene Representation
Networks (SRN) [35] and Local Light Field Fusion (LLFF) [32]and nearly matches the performance
of Neural Volumes (NV) [34]. While DFA underperforms alternative methods trained with BP on
the real world view dataset (LLFF-Real [32]), its performance remains significant: real world view
synthesis is a challenging tasks, and this level of PSNR indicates that learning is indeed happening.
In particular, we find that NeRF-DFA retains the key characteristics of NeRF-BP: it can render view-
dependant effects, and is multi-view consistent. The last point is an especially important achievement,
and most visible in videos, as it is a challenge for most algorithms [3032,35]. The main drawback
of NeRF-DFA appears to be a seemingly lower render definition. The NeRF architecture has not
Table 1: Peak Signal to Noise Ratio (PSNR, higher is better) of neural view synthesis methods
trained with backpropagation against NeRF trained with DFA. Even when trained with DFA, NeRF
outperforms two state-of-the-art methods on a synthetic dataset (NeRF-Synthetic), and achieves fair
performance on a challenging real world views datasets (LLFF-Real). BP results from [36].
<<TABLE>>
been fine-tuned to achieve these results: DFA works out-of-the-box on this advanced method. Future
research focusing on architectural changes to NeRF could improve performance with DFA; some
preliminary results are included in the supplementary material.
3.1.2 Click-through rate prediction with recommender systems
We have demonstrated that DFA can train large fully connected networks on the difficult task of neural
view synthesis. We now seek to use DFA in more complex heterogeneous architectures, combining
the use of fully connected networks with other machine learning methods.Recommender systems are
an ideal application for such considerations.
Background Recommender systems are used to model the behavior of users and predict future
interactions. In particular, in the context of click-through rate (CTR) prediction, these systems model
the probability of a user clicking on a given item. Building recommender systems is hard [37]: their
input is high-dimensional and sparse, and the model must learn to extract high-order combinatorial
features from the data. Moreover, they need to do so efficiently, as they are used to make millions of
predictions and the training data may contain billions of examples.
Factorization Machines (FM) [38] use inner-products of latent vectors between features to extract
pairwise feature interactions. They constitute an excellent baseline for shallow recommender systems,
but fail to efficiently transcribe higher-level features. To avoid extensive feature engineering, it has
been suggested that deep learning can be used in conjunction with wide shallow models to extract
these higher-level features [39]. In production, these systems are regularly retrained on massive
datasets: the speedup allowed by backward unlocking in DFA is thus of particular interest.
Setting Deep Factorization Machines (DeepFM) [40] combine FM and a deep fully connected
neural network, which we train with DFA. The input embedding is still trained directly via gradient
descent, as weight transport is not necessary to backpropagate through the FM. Deep & Cross
Networks (DCN) [41] replace the FM with a Cross Network, a deep architecture without non-
linearities capable of extracting high-degree interactions across features. We train the fully connected
network, the deep cross network, and the embeddings with DFA. Finally, Adaptative Factorization
Network (AFN) [42] uses Logarithmic Neural Networks [43] to enhance the representational power
of its deep component. We evaluate these methods on the Criteo dataset [44], which features nearly
46 million samples of one million sparse features. This is a difficult task, where performance
improvements of the AUC on the 0.001-level can enhance CTR significantly [39].
Results Performance metrics are reported in Table 2. To obtain these results, a simple hyperpa-
rameter grid search over optimization and regularization parameters was performed for BP and DFA
independently. DFA successfully trains all methods above the FM baseline, and in fact matches BP
performance in both DeepFM and AFN. Because of their complexity, recommender systems require
intensive tuning and feature engineering to perform at the state-of-the-art leveland reproducing
existing results can be challenging [45]. Hence, it is not surprising that a performance gap exists with
Deep&Crossfurther fine-tuning may be necessary for DFA to reach BP performance.
Alignment measurements corroborate that learning is indeed occurring in the special layers of
Deep&Cross and AFNsee supplementary for details. Our results on recommender systems support
that DFA can learn in a large variety of settings, and that weight transport is not necessary to solve a
difficult recommendation task.
Table 2: AUC (higher is better) and log loss (lower is better) of recommender systems trained on the
Criteo dataset [44]. Even in complex heterogeneous architectures, DFA performance is in line with
BP. Values in bold indicate DFA AUC within 0.001 from the BP AUC or better.
<<TABLE>>
3.2 Geometric Learning with Graph Convolutional Networks
The use of sophisticated architectures beyond fully connected layers is necessary for certain tasks,
such as geometric learning[46], where information lies in a complex structured domain. To address
geometric learning tasks, methods capable of handling graph-based data are commonly needed.
Graph convolutional neural networks (GCNNs) [4750] have demonstrated the ability to process
large-scale graph data efficiently. We study the applicability of DFA to these methods, including
recent architectures based on an attention mechanism. Overall, this is an especially interesting setting,
as DFA fails to train more classic 2D image convolutional layers [23].
Background Complex data like social networks or brain connections lie on irregular or non-
Euclidean domains. They can be represented as graphs, and efficient processing in the spectral
domain is possible. Non-spectral techniques to apply neural networks to graphs have also been
developed [5153], but they exhibit unfavorable scaling properties. The success of CNNs in deep
learning can be attributed to their ability to efficiently process structured high-dimensional data
by sharing local filters. Thus, a generalization of the convolution operator to the graph domain is
desirable: [47] first proposed a spectral convolution operation for graphs, and [48] introduced a form
of regularization to enforce spatial locality of the filters. We use DFA to train different such GCNNs
implementations. We study both spectral and non-spectral convolutions, as well as methods inspired
by the attention mechanism. We consider the task of semi-supervised node classification: nodes from
a graph are classified using their relationship to other nodes as well as node-wise features.
Setting Fast Localized Convolutions (ChebConv) [49] approximate the graph convolution kernel
with Chebyshev polynomials, and are one of the first scalable convolution methods on graph. Graph
Convolutions (GraphConv) [50] remove the need for an explicit parametrization of the kernel by
enforcing linearity of the convolution operation on the graph Laplacian spectrum. It is often considered
as the canonical graph convolution. More recent methods do not operate in the spectral domain. Spline
Convolutions (SplineConv) [54] use a spline-based kernel, enabling the inclusion of information
about the relative positioning of nodes, enhancing their representational powerfor instance in the
context of 3D meshes. Graph Attention Networks (GATConv) [55] use self-attention [56] layers to
enable predictions at a given node to attend more specifically to certain parts of its neighborhood.
Finally, building upon Jumping Knowledge Network [57], Just Jump (DNAConv) [58] uses multi-
head attention [59] to enhance the aggregation process in graph convolutions and enable deeper
architectures. We use PyTorch Geometric [60] for reference implementation of all of these methods.
We evaluate performance on three citation network datasets: Cora, CiteSeer, and PubMed [61].
Results We report classification accuracy in Table 3. BP and DFA regularization and optimiza-
tion hyperparameters are fine-tuned separately on the Cora dataset. In general, we find that less
regularization and lower learning rates are needed with DFA. DFA successfully trains all graph
methods, independent of whether they use the spectral domain or not, and even if they use attention.
Furthermore, for GraphConv, SplineConv, and GATConv DFA performance nearly matches BP.
As GCNNs struggle with learning meaningful representations when stacking many layers [62], all
architectures but DNAConv are quite shallow (two layers). However, DFA performance is still
significantly higher than that of a shallow training methodsee supplementary for details. The lower
performance on DNAConv is not a failure to learn: alignment measurements show that learning is
indeed occurring. It may be explained instead by a need for more in-depth fine-tuning, as this is a
deep architecture with 5 successive attention layers.
Table 3: Classification accuracy (%, higher is better) of graph convolution methods trained with BP
and DFA, on citation networks [61]. But for ChebConv and DNAConv, DFA performance nearly
matches BP performance. Values in bold when DFA is within 2.5% of BP.
<<TABLE>>
Table 4: AUC and Average Precision Figure 2: t-SNE visualization of the hidden layer
(AP, higher is better) for a Graph- activations of a two-layer GraphConv trained on
Conv GAE trained with BP or DFA Cora with DFA. Classes forms clear clusters, indicating
that a useful intermediary representation is learned. Colors represent different classes.
on citation networks. DFA reproduces BP performance.
We further demonstrate that DFA helps graph convolutions learn meaningful representations by
applying t-SNE [63,64] to the hidden layer activations in GraphConv (Figure 2). Cluster of classes
are well-separated, indicating that a useful intermediary representation is derived by the first layer.
Graph autoencoders We consider one last application of graph convolutions, in the context of
graph autoencoders (GAE). We train a non-probabilistic GAE [65] based on GraphConv with DFA,
and report results in Table 4. DFA performance is always in line with BP.
3.3 Natural Language Processing with Transformers
We complete our study by training a Transformer [59] on a language modelling task. Transformers
have proved successful in text, image, music generation, machine translation, and many supervised
NLP tasks [59,6669]. Here, we demonstrate that DFA can train them, and we show the influence of
tuning the optimizer hyperparameters in narrowing the gap with BP.
Background NLP has largely benefited from advances in deep learning. Recurrent Neural Net-
works were responsible for early breakthroughs, but their sequential nature prevented efficient
parallelization of data processing. Transformers are attention-based models that do not rely on
recurrence or convolution. Their ability to scale massively has allowed the training of models with
several billion parameters [70,71], obtaining state-of-the-art results on all NLP tasks: Transformers
now top the prominent SQuAD 2.0 [72,73] and SuperGLUE [74] benchmarks. In parallel, transfer
learning in NLP has leaped forward thanks to language modelling, the unsupervised task of predicting
the next word. It can leverage virtually unlimited data from web scraping [75]. This enabled the
training of universal language models[76] on extremely large and diversified text corpora. These
models are useful across a wide range of domains, and can solve most NLP tasks after fine-tuning.
Setting The prominence of both language modelling and Transformers gives us the ideal candidate
for our NLP experiments: we train a Transformer to predict the next word on the WikiText-103
dataset [77], a large collection of good and featured Wikipedia articles. We use byte-pair-encoding
[78] with 32,000 tokens. Our setup is similar to GPT [66]: we adapt the Transformer, originally an
encoder-decoder model designed for machine translation, to language modelling. We keep only the
encoder and mask the tokens to predict. Our architecture consists in 6 layers, 8 attention heads, a
model dimension of 512, and a hidden size of 2048 in the feed-forward blocks. The text is sliced
in chunks of 128 tokens and batches of 64 such chunks, resulting in 8192 tokens per batch. Our
baseline is trained with BP using the optimization setup of [59]. We found perplexity after 20 epochs
to be an excellent indicator of perplexity at convergence; to maximize the number of experiments
we could perform, we report the best validation perplexity after 20 epochs. We study two ways of
implementing DFA: applying the feedback after every encoder block (macro) or after every layer in
those blocks (micro). The input embedding layer receives gradients from the next feedback point
through BP. This leaves some amount of weight transport even in the micro-case.
Table 5: Best validation perplexity after 20 epochs of a Transformer trained on WikiText-103 (lower
is better). The BP and DFA baselines share all hyper-parameters. In Macro the feedback is applied
after every transformer layer, while in Micro the feedback is applied after every sub-layer. The
learning rate of Adam without the learning rate scheduler is <<FORMULA>>. With the scheduler, the initial
learning rate is <<FORMULA>> and it is multiplied by 0.2 when performance plateaus, with a patience of 1.
* score after 22 epochs to let the learning rate scheduler take effect
<<TABLE>>
Results Our results are summarized in Table 5. Hyper-parameters fine-tuned for BP did not fare
well with DFA, but changes in the optimizer narrowed the gap between BP and DFA considerably.
The learning rate schedule used on top of Adam [79] in [59] proved detrimental. Using Adam alone
required reducing the learning rate between BP and DFA. Increasing 2 from 0.98 [59] to 0.999
improved performance significantly. Finally, a simple scheduler that reduces the learning rate when
the validation perplexity plateaus helped reducing it further. Considering that the perplexity of the
shallow baseline is over 400, DFA is clearly able to train Transformers. However, our results are not
on par with BP, especially in the micro setting. A substantial amount of work remains to make DFA
competitive with BP, even more so in a minimal weight transport scenario. The large performance
improvements brought by small changes in the optimizer indicate that intensive fine-tuning, common
in publications introducing state-of-the-art results, could close the gap between BP and DFA.
4 Conclusion and outlooks
We conducted an extensive study demonstrating the ability of DFA to train modern architectures. We
considered a broad selection of domains and tasks, with complex models featuring graph convolutions
and attention. Our results on large networks like NeRF and Transformers are encouraging, suggesting
that with further tuning, such leading architectures can be effectively trained with DFA. Future work
on principled training with DFAin particular regarding the influence of common practices and
whether new procedures are requiredwill help close the gap with BP.
More broadly, we verified for the first time that learning under synaptic asymmetry is possible beyond
fully-connected layers, and in tasks significantly more difficult than previously considered. This
addresses a notable concern in biologically-plausible architectures. DFA still requires an implausible
global feedback pathway; however, local training has already been demonstrated at scale. The next
step towards biologically-compatible learning is a local method without weight transport.
While the tasks and architectures we have considered are not biologically inspired, they constitute
a good benchmark for behavioral realism[20]. Any learning algorithm claiming to approximate
the brain should reproduce its ability to solve complex and unseen task. Furthermore, even though
the current implementation of mechanisms like attention is devoid of biological considerations, they
represent broader concepts applicable to human brains [80]. Understanding how our brain learns is a
gradual process, and future research could incorporate further realistic elements, like spiking neurons.
Finally, unlocking the backward pass in large architectures like Transformers is promising. More opti-
mized implementation of DFAbuilt at a lower-level of existing ML librariescould unlock significant
speed-up. Leveraging the use of a single random projection as the cornerstone of training, dedicated
accelerators may employ more exotic hardware architectures. This will open new possibilities in the
asynchronous training of massive models.
Broader Impact
Of our survey This study is the first experimental validation of DFA as an effective training method
in a wide range of challenging tasks and neural networks architectures. This significantly broadens the
applications of DFA, and more generally brings new insight on training techniques alternative to back-
propagation. From neural rendering and recommender systems, to natural language processing or
geometric learning, each of these applications has its own potential impact. Our task selection process
was motivated by current trends in deep learning, as well as by technically appealing mechanisms
(graph convolutions, attention). A limit of our survey is that ourarguably biasedselection of tasks
cannot be exhaustive. Our experiments required substantial cloud compute resources, with state-of-
the-art GPU hardware. Nevertheless, as this study provides new perspectives for hardware accelerator
technologies, it may favor the application of neural networks in fields previously inaccessible because
of computational limits. Future research on DFA should continue to demonstrate its use in novel
contexts of interest as they are discovered.
Of the considered applications Each of the applications considered in our study has a wide
potential impact, consider for example the impact of textual bias in pretrained word embeddings [81].
We refer to [82] and references therein for a discussion of ethical concerns of AI applications.
Of DFA as a training method DFA enables parallelization of the backward pass and places a
single operation at the center of the training process, opening the prospect of reducing the power
consumption of training chips by an order of magnitude [28]. Not only is more efficient training a
path to more environmentally responsible machine learning [83], but it may lower the barrier of entry,
supporting equality and sustainable development goals. A significant downside of moving from BP to
DFA is a far more limited understanding of how to train models and how the trained models behave.
There is a clear empirical understanding of the impact of techniques such as batch normalization
or skip connections on the performance of BP; new insights need to be obtained for DFA. BP also
enjoys decades of works on topics like adversarial attacks, interpretability, and fairness. Much of
this work has to be cross-checked for alternative training methods, something we encourage further
research to consider as the next step towards safely and responsively scaling up DFA.
Of biologically motivated method Finally, a key motivation for this study was to demonstrate that
learning challenging tasks was possible without weight transport. Biologically motivated methods
are a more foundational research direction, and as such the possible long-term impact of our findings
is harder to estimate under this light. However, fundamental research of this kind is important to open
new pathways for ML and neuroscience.
Acknowledgments and Disclosure of Funding
We thank Igor Carron and Laurent Daudet for the general guidance on the subject of this investigation
and the insightful comments, as well as the larger LightOn team for their support.
References
[1]P. J. Werbos.Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences. PhD thesis, Harvard University, 1974.
[2]D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error
propagation. InParallel Distributed Processing, volume 1, pages 318362. MIT Press, 1986.
[3]Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves,
David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients.
InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages
16271635, 2017.
[4]Francis Crick. The recent excitement about neural networks.Nature, 337(6203):129132, 1989.
[5]Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep
learning and neuroscience.Frontiers in computational neuroscience, 10:94, 2016.
[6]Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance.
Cognitive science, 11(1):2363, 1987.
[7]Javier R Movellan. Contrastive hebbian learning in the continuous hopfield model. InConnec-
tionist models, pages 1017. Elsevier, 1991.
[8]Randall C OReilly. Biologically plausible error-driven learning using local activation differ-
ences: The generalized recirculation algorithm.Neural computation, 8(5):895938, 1996.
[9]Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. InArtificial intelligence
and statistics, pages 448455, 2009.
[10]Yann Le Cun. Learning process in an asymmetric threshold network. InDisordered systems
and biological organization, pages 233240. Springer, 1986.
[11]Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target
propagation.arXiv preprint arXiv:1407.7906, 2014.
[12]Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propaga-
tion. InJoint european conference on machine learning and knowledge discovery in databases,
pages 498515. Springer, 2015.
[13]Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synap-
tic feedback weights support error backpropagation for deep learning.Nature communications,
7(1):110, 2016.
[14]Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can
scale to imagenet. InInternational Conference on Machine Learning, pages 583593, 2019.
[15]Wojciech M Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan
Pascanu. Sobolev training for neural networks. InAdvances in Neural Information Processing
Systems, pages 42784287, 2017.
[16]Arild Nøkland and Lars Hiller Eidnes. Training neural networks with local error signals. In
International Conference on Machine Learning, pages 48394850, 2019.
[17]R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,
Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information
estimation and maximization. InInternational Conference on Learning Representations, 2019.
URLhttps://openreview.net/forum?id=Bklr3j0cKX.
[18]Sindy Löwe, Peter OConnor, and Bastiaan Veeling. Putting an end to end-to-end: Gradient-
isolated learning of representations. InAdvances in Neural Information Processing Systems,
pages 30333045, 2019.
[19] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In
Advances in neural information processing systems, pages 10371045, 2016.
[20]Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy
Lillicrap. Assessing the scalability of biologically-motivated deep learning algorithms and
architectures. InAdvances in Neural Information Processing Systems, pages 93689378, 2018.
[21]Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton.
Backpropagation and the brain.Nature Reviews Neuroscience, pages 112, 2020.
[22]Natalia Caporale and Yang Dan. Spike timingdependent plasticity: a hebbian learning rule.
Annu. Rev. Neurosci., 31:2546, 2008.
[23]Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with
direct feedback alignment.arXiv preprint arXiv:1906.04554, 2019.
[24]Qianli Liao, Joel Z Leibo, and Tomaso Poggio. How important is weight symmetry in back-
propagation? InThirtieth AAAI Conference on Artificial Intelligence, 2016.
[25]Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep
convolutional networks.arXiv preprint arXiv:1812.06488, 2018.
[26]Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. Biologically-plausible learning
algorithms can scale to large datasets. InInternational Conference on Learning Representations,
2019. URL https://openreview.net/forum?id=SygvZ209F7.
[27]Mohamed Akrout, Collin Wilson, Peter C Humphreys, Timothy Lillicrap, and Douglas Tweed.
Using weight mirrors to improve feedback alignment.arXiv preprint arXiv:1904.05391, 2019.
[28]Julien Launay, Iacopo Poli, Kilian Müller, Igor Carron, Laurent Daudet, Florent Krzakala, and
Sylvain Gigan. Light-in-the-loop: using a photonics co-processor for scalable training of neural
networks, 2020.
[29]Charlotte Frenkel.Bottom-Up and Top-Down Neuromorphic Processor Design: Unveiling
Roads to Embedded Cognition. PhD thesis, UCL-Université Catholique de Louvain, 2020.
[30]Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis.ACM Transactions on
Graphics (TOG), 36(6):111, 2017.
[31]John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck,
Noah Snavely, and Richard Tucker. Deepview: View synthesis with learned gradient descent.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
23672376, 2019.
[32]Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi
Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis
with prescriptive sampling guidelines.ACM Transactions on Graphics (TOG), 38(4):114,
2019.
[33]Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael
Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. InProceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 24372446, 2019.
[34]Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and
Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.ACM
Transactions on Graphics (TOG), 38(4):65, 2019.
[35]Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks:
Continuous 3d-structure-aware neural scene representations. InAdvances in Neural Information
Processing Systems, pages 11191130, 2019.
[36]Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi,
and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.arXiv
preprint arXiv:2003.08934, 2020.
[37]H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady,
Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view
from the trenches. InProceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 12221230, 2013.
[38]Steffen Rendle. Factorization machines. In2010 IEEE International Conference on Data
Mining, pages 9951000. IEEE, 2010.
[39]Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye,
Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for
recommender systems. InProceedings of the 1st workshop on deep learning for recommender
systems, pages 710, 2016.
[40]Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a
factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247,
2017.
[41]Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click
predictions. InProceedings of the ADKDD17, ADKDD17, New York, NY, USA, 2017.
Association for Computing Machinery. ISBN 9781450351942. doi: 10.1145/3124749.3124754.
URLhttps://doi.org/10.1145/3124749.3124754.
[42]Weiyu Cheng, Yanyan Shen, and Linpeng Huang. Adaptive factorization network: Learning
adaptive-order feature interactions. InThirty-Fourth AAAI Conference on Artificial Intelligence,
2020.
[43]J Wesley Hines. A logarithmic neural network architecture for unbounded non-linear function
approximation. InProceedings of International Conference on Neural Networks (ICNN96),
volume 2, pages 12451250. IEEE, 1996.
[44]Criteo. Kaggle contest dataset is now available for academic use!http://labs.criteo.com/
2014/09/kaggle-contest-dataset-now-available-academic-use/, 2014. accessed
on the 2020-05-20.
[45]Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much
progress? a worrying analysis of recent neural recommendation approaches. InProceedings of
the 13th ACM Conference on Recommender Systems, pages 101109, 2019.
[46]Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Magazine, 34
(4):1842, 2017.
[47]Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally
connected networks on graphs. InInternational Conference on Learning Representations, pages
httpopenreview, 2014.
[48]Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured
data.arXiv preprint arXiv:1506.05163, 2015.
[49]Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks
on graphs with fast localized spectral filtering. InAdvances in neural information processing
systems, pages 38443852, 2016.
[50]Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. InInternational Conference on Learning Representations (ICLR), 2017.
[51]Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph
domains. InProceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.,
volume 2, pages 729734. IEEE, 2005.
[52]Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
The graph neural network model.IEEE Transactions on Neural Networks, 20(1):6180, 2008.
[53]Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural
networks. InInternational Conference on Learning Representations, 2016.
[54]Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric
deep learning with continuous b-spline kernels. InProceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 869877, 2018.
[55]Petar Velickoviˇ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua´
Bengio. Graph attention networks. InInternational Conference on Learning Representations,
2018. URLhttps://openreview.net/forum?id=rJXMpikCZ.
[56] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. In3rd International Conference on Learning Representations,
ICLR 2015, 2015.
[57]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? InInternational Conference on Machine Learning, 2018.
[58]Matthias Fey. Just jump: Dynamic neighborhood aggregation in graph neural networks. In
ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[59]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information
processing systems, pages 59986008, 2017.
[60]Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric.
InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[61]Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-
Rad. Collective classification in network data.AI magazine, 29(3):9393, 2008.
[62]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? InInternational Conference on Learning Representations, 2019. URLhttps:
//openreview.net/forum?id=ryGs6iA5Km.
[63]Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine
learning research, 9(Nov):25792605, 2008.
[64]David M Chan, Roshan Rao, Forrest Huang, and John F Canny. Gpu accelerated t-distributed
stochastic neighbor embedding.Journal of Parallel and Distributed Computing, 131:113,
2019.
[65]Thomas N Kipf and Max Welling. Variational graph auto-encoders.NIPS Workshop on Bayesian
Deep Learning, 2016.
[66]Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language
understanding with unsupervised learning.Technical report, OpenAI, 2018.
[67]Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku,
and Dustin Tran. Image transformer.ArXiv, abs/1802.05751, 2018.
[68]Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya
Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341, 2020.
[69]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer-
ence of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), pages 41714186, Minneapolis,
Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.
URLhttps://www.aclweb.org/anthology/N19-1423.
[70]Mohammad Shoeybi, Mostofa Ali Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and
Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model
parallelism.ArXiv, abs/1909.08053, 2019.
[71]Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners.arXiv preprint arXiv:2005.14165, 2020.
[72]Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+
questions for machine comprehension of text. InProceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, pages 23832392, Austin, Texas, Novem-
ber 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL
https://www.aclweb.org/anthology/D16-1264.
[73]Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you dont know: Unanswerable
questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pages 784789, Melbourne, Australia,
July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL
https://www.aclweb.org/anthology/P18-2124.
[74]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose
language understanding systems. InAdvances in Neural Information Processing Systems, pages
32613275, 2019.
[75]The Common Crawl Team. Common Crawl.https://commoncrawl.org, 2020.
[76]Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifica-
tion. InACL. Association for Computational Linguistics, 2018. URLhttp://arxiv.org/
abs/1801.06146.
[77]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
models.ArXiv, abs/1609.07843, 2017.
[78]Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare
words with subword units. InProceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 17151725, Berlin, Germany,
August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL
https://www.aclweb.org/anthology/P16-1162.
[79]Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International
Conference on Learning Representations, 12 2014.
[80]Grace W Lindsay. Attention in psychology, neuroscience, and machine learning.Frontiers in
Computational Neuroscience, 14:29, 2020.
[81]Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.
Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In
Advances in neural information processing systems, pages 43494357, 2016.
[82]Alexandra Luccioni and Yoshua Bengio. On the morality of artificial intelligence.arXiv preprint
arXiv:1912.11945, 2019.
[83]Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for
deep learning in nlp.arXiv preprint arXiv:1906.02243, 2019.
[84]Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer:
Rethinking self-attention in transformer models.arXiv preprint arXiv:2005.00743, 2020.
[85]Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao,
and Jiawei Han. On the variance of the adaptive learning rate and beyond.arXiv preprint
arXiv:1908.03265, 2019.
[86]Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. Fixed encoder self-attention patterns
in transformer-based machine translation.arXiv preprint arXiv:2002.10260, 2020.
[87]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-
performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32,
pages 80248035. Curran Associates, Inc., 2019. URLhttp://papers.neurips.cc/paper/
9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
pdf.
Appendix
We first provide additional elements to corroborate our findings: alignment measurement (Section
A), and shallow baselines (Section B). We then discuss the process of adapting the considered
architectures for DFA (Section C), and the issue of weight transport in attention layers (Section D).
We provide some supplementary results for NeRF (Section E), including details of performance on
each scene of each datatset, and a discussion on possible mitigation of DFA shortcomings. Finally,
we outline steps necessary for reproduction of this work (Section F).
A Alignment
Alignment measurement In feedback alignment methods, the forward weights learn toalignwith
the random backward weights, making the delivered updates useful. This alignment can be quantified
by measuring the cosine similarity between the gradient signal delivered by DFABi ay and the
gradient signal BP would have deliveredWT <<FORMULA>>. For learning to occur and DFA to work as
a training method, there must be alignment. This can be measured numerically [23]. Measuring
alignments allows to check whether or not the layers are effectively being trained by DFA, regardless
of performance metrics. We note that any alignment value superior to 0 signifies that learning is
occuring. Values closer to 1 indicate a better match with BP, but small alignment values are sufficient
to enable learning. We report values measured at the deepest DFA layer.
Recommender systems We measure alignment on the Criteo dataset, in the two architectures
featuring non-conventional fully-connected layers: Deep & Cross and AFN. Alignment is measured
after 15 epochs of training, and averaged over a random batch of 512 samples. Results are reported in
table A.1. These alignment measurements indicate that learning is indeed occurring in the cross and
logarithmic layers. High-variance of alignment in the cross layers is unique: it may be explained by
the absence of non-linearity, and account for the difference in performance between BP and DFA on
this architecturewhich is higher than on the others.
Table A.1: Alignment cosine similarity (higher is better, standard deviation in parenthesis) of
recommender systems as measured on the Criteo dataset. Learning occurs in both architectures, and
high variance may explain the larger performance gap on Deep & Cross compared to other methods.
<<TABLE>>
Graph convolutions We measure alignment on the Cora dataset, after 250 epochs of training,
averaging values over every sample availabletrain, validation, and test split included. Results are
reported in Table A.2. We observe high alignment values in all architectures, indicative that learning
is indeed occuring. Slightly lower values in SplineConv and GATConv may be explained by the use
of the Exponential Linear Unit (ELU) instead of the Rectified Linear Unit (ReLU) used as activation
in other architectures.
Table A.2: Alignment cosine similarity (standard deviation in parenthesis) of various graph convolu-
tions architectures as measured on the Cora dataset. These values corroborate that DFA successfully
trains all architectures considered.
<<TABLE>>
B Shallow baselines
Shallow learning We compare DFA to BP, but also to shallow learningwhere only the topmost
layer is trained. While DFA may not reach the performance level of BP, it should still vastly
Figure A.1: Comparisons of Tiny-NeRF trained with BP, DFA, and a shallow approach. Shallow
training is insufficient to learn scene geometry. Lego scene from the NeRF synthetic dataset.
<<FIGURE>>
outperform shallow learning: failure to do so would mean that the weight updates delivered by DFA
are useless. On a simple task like MNIST, a shallow baseline may be as high as 90%. However, given
the difficulty of the tasks we consider, the shallow baseline is here usually much lower.
NeRF Because NeRF models are expensive to trainup to 15 hours on a V100we consider a
simplified setup for the shallow baseline, NeRF-Tiny. This setup operates at half the full resolution
of the training images available, runs for 5000 iterations only, and does away with view-dependant
characteristics. Furthermore, the network is cut down to 3 layers of half the width of NeRF, and
no coarse network is used to inform the sampling. We train this network on the Lego scene of the
NeRF-Synthetic dataset, and compare results.
Figure A.1 presents renders generated by NeRF-Tiny trained with BP, DFA, and a shallow approach.
While BP and DFA delivers similar renders, shallow training fails to reproduce even basic scene
geometry, instead outputting a diffuse cloud of colors. This highlights that while DFA may not reach
a level of performance on-par with BP on NeRF, it nonetheless delivers meaningful updates enabling
the learning of complex features.
Recommender systems Because recommender systems require fine-tuning, we perform the same
hyperparameter search for shallow learning than for DFA and BP. Results are detailed in Table A.3.
Performance of shallow training is always well under BP and DFAremember that0.001-levelmatter
in recommender systems. In particular, in Deep & Cross, where there was the biggest gap between
BP and DFA, the performance of the shallow method is extremely poor, well below the FM baseline.
Finally, it is expected to see that DeepFM recovers more or less the performance of FM even with a
shallow baseline.
Table A.3: Shallow baseline for recommender system models on the Criteo dataset. Performance is
always well below BP and DFA, as expected.
<<TABLE>>
Graph convolutions We use the same hyperparameters as for DFA to produce the shallow baseline
on graph datasets. Results are reported in Table A.4. Performance is always much worse than BP
and DFA. GATConv recovers the best performance: random attention layers may still deliver useful
features [84], as do random convolutions.
Transformers In the baseline setting (optimizer and hyper-parameters of [59]), a Transformer
trained in the shallow regime yields a perplexity of 428 on WikiText-103. We do not consider
Table A.4: Shallow baseline for GCNNs on Cora, CiteSeer, and PubMed [61]. Performance is always
well below BP and DFA.
<<TABLE>>
other settings, as the cost of training a Transformer is high and we do not expect any meaningful
improvementsas with NeRF above.
C Adapting architectures to DFA
NeRF We use an architecture identical to the one used in [36], but based on the effective code
implementation rather than the description in the paper 1 . During our tests, we have found that
lowering the learning rate to <<FORMULA>> rather than <<FORMULA>> works best with DFA.
Recommender systems For all training methods (BP, DFA, and shallow), we have conducted
independent hyperparameter searches. We performed a grid search over the learning rate, from
<<FORMULA>> to <<FORMULA>> in <<FORMULA>> steps, as well as over the dropout probability, from <<FORMULA>> to <FORMULA> in <<FORMULA>> steps
(where applicable). On DeepFM, this search leads to reduce the learning rate from <<FORMULA>> with BP
to <<FORMULA>> with DFA, but to keep the 0.5 dropout rate. On Deep & Cross, we reduce learning rate
from <<FORMULA>> to <<FORMULA>>, with no dropout in both cases. In AFN, we reduce dropout from <<FORMULA>> to
<<FORMULA>> and dropout from 0.3 to 0.
Graph convolutions We manually test for a few hyperparameters configuration on the Cora dataset,
focusing on learning rate, weight decay, and dropout. We do not consider architectural changes, such
as changing the number of filters or of attention heads. For ChebConv and GraphConv, we reduce
weight decay to <<FORMULA>> instead of <<FORMULA>>, and set the dropout rate to 0 and 0.1 respectively, instead
of 0.5 with BP. For SplineConv, we find that no change in the hyperparameters are necessary. For
GATConv, we reduce weight decay to <<FORMULA>> instead of <<FORMULA>> and reduce dedicated dropout layer
to 0.1 instead of 0.6 but keep the 0.6 dropout rate within the GAT layer. Finally, on DNAConv we
disable weight decay entirely, instead of an original value of <<FORMULA>>, double the learning rate from
<<FORMULA>> to <<FORMULA>>, and disable dropout entirely. In all cases, we share the backward random matrix
across all nodes in a graph.
Transformers The model hyper-parameters were fixed across all of our experiments, except for
the number of attention heads in one case, that we will precise below, and dropout. We tested several
values of dropout probability between 0 and 0.5, but found the original value of 0.1 to perform
best. We manually tested a number of optimizers, optimizer parameters and attention mechanisms.
We tested four combinations of optimizers and schedulers : Adam with the scheduler used in [59],
Adam alone, RAdam [85] alone, and Adam with a scheduler that reduces the learning rate when
the validation perplexity plateaus. We found it necessary to reduce the initial learning rate of Adam
from <<FORMULA>> to <<FORMULA>>, although it could be set back to <<FORMULA>> with a scheduler. We tried two values
of 0.98 and 0.999. We also tried to change <<FORMULA>> and observed some small differences that were
not significant enough for the main text. Finally, we tried three attention mechanisms in addition to
the standard multihead scaled dot-product attention: the dense and random (learnable) Synthesizers
of [84], as well as the fixed attention patterns of [86]. The latter needed to be adapted to language
modelling to prevent attending to future tokens, which led us to reduced the number of attention
heads to 4. The backward random matrix is always shared across all tokens and batches.
D Weight transport and attention
We consider an attention layer operating on inputx. The queries, keys, and values are respectively
<<FORMULA>>, and <<FORMULA>> is the dimension of the queries and keys. The layer
performs:
<<FORMULA>> (4)
When using DFA on attention, we deliver the random feedback to the top of the layer. Accordingly,
to obtain updates toWQ ;WK ;andWV we still to have to backpropagate through the attention
mechanism itself. This involves weight transport onWV , sacrificing some biological realism for
simplicity. Overall weight transport between layers still does not occur, and updating the layers in
parallel remains possible.
Beside using FA or DFA within the attention layer, alternative mechanisms like the synthesizer
[84]which uses random attention in place of the query and key systemor fixed attention [86] can
remove the need for weight transport. Implementing these mechanisms in DFA-trained Transformers,
or other attention-powered architectures, will require further research.
E Supplementary NeRF results
Quantitative results We report per-scene scores for each dataset in Table A.5. BP values are taken
from [36]. On three scenes of the synthetic datasets, NeRF-DFA even outperforms past state-of-the-art
methods trained with BP. Note that Neural Volumes (NV) is not applicable to forward-facing view
synthesisas is required in LLFF-Realand thus no results are reported.
Qualitative results We report sample renders from the NeRF-Synthetic dataset (Figure A.2) and
the LLFF-Real dataset (Figure A.2), for every scene available. However, we recommend readers to
consult the supplementary video to make better sense of characteristics like multi-view consistency
and view-dependent effects (most visible on the LLFF-Real Room scene).
Table A.5: Per-scene PSNR for NeRF DFA and BP against other state-of-the-art methods on the
Nerf-Synthetic and LLFF-Real. DFA performance is fairly homogeneous across each dataset and in
line with the differences in other methods.
<<TABLE>>
Possible future directions Despite retranscribing scene geometry in a multi-view consistent way,
NeRF produces renders of a lower quality when trained with DFA instead of BP. In particular, it
struggles to transcribe small-scale details, resulting in "blurry" renders. Moreover, it displays high-
frequency artefacts: not in the scene geometry, but in individual pixels taking values very distant from
their neighborhood. Interestingly, this noise phenomenon is unique to NeRF-DFA: it is not observed
on NeRF-BP with similar PSNR values (achieved during training) or on other methods with similar
or lower PSNR. This leads us to hypothesize this is an aspect unique to DFA, possibly due to the
alignment process. Indeed, DFA creates a bias on the weights, by encouraging them to be "aligned"
with an arbitrary values dependant on the random matrix used. It is possible this could introduce
random noise in the final rendersthough we leave a more principled experiment to future research.
To attempt to alleviate this issue, we first consider NeRF-Dual. In NeRF-Dual, we average the
pixel-wise prediction between the fine and coarse network, to attempt to remove some of the noise.
To do so, we first still use the coarse network to create a probability distribution for the hierarchical
sampling. Then, we evaluate again both the coarse and fine networks at the locations informed by
this probability distribution. Compared to vanilla NeRF, this requires an extra batch of evaluation of
the coarse network for all raysrougly speaking, this increases inference time by 30-50% depending
on the coarse network architecture considered. We note that this is not applied during training, so that
training times remain identical.
Figure A.2 and Figure A.3 showcase comparisons between NeRF and NeRF-Dual trained with DFA
on all scenes. When viewed at high resolutionsuch as in our supplementary videothe NeRF-Dual
renders are more pleasing, especially for the full scenes. They remove most of the high-frequency
noise, leading to smoother renders. However, this averaging process further blurs small-scale details in
the render. This is especially visible in the NeRF-Synthetic dataset, on scenes like Ficus. Furthermore,
NeRF-Dual introduces novel artefacts in the Mic and Ship scenes, with areas improperly colored
with a violet tint. The cause for these artefacts is unknown, but they show that NeRF-Dual is far from
a silver bullet. The PSNR is also minimally increased, by less than 0.5 per scene. Nevertheless, this
shows some promise in possibilities to allievate the shortcomings of NeRF-DFA. It is possible that
changes to the overall rendering process, or the use of classic image processing techniques, may help
enhance the NeRF-DFA images.
Finally, we also experimented with increasing the capacity of the fine network, by widening its layers
to 512 neurons. We call this architecture NeRF-XL. However, we have not succeeded in getting
PSNR values higher than with vanilla NeRF on DFA. In particular, the training process becomes
much more cumbersome, as multi-GPU parallelism is needed to fit the model. It is possible that
higher network capacity may help learning both the task at hand and to align simultaneously, but
further work is required.
F Reproducibility
Hardware used All main experiments require at most a single NVIDIA V100 GPU with 16GB
of memory to reproduce. Alignment measurement on large architectures (NeRF and Transformers)
require a second identical GPU to keep a copy of the network to evaluate BP gradients.
We estimate that a total of around 10,000 GPU-hours on V100s were necessary for this paper.
Accordingly, we estimate the cloud-computing carbon impact of this paper to be of 1700 kgCO 2 eq 2 .
However, without hyperparameter searches, our results can be reproduced with less than 500 GPU-
hours on V100s, with most of that budget going to NeRF and Transformers.
Implementation We use the shared random matrix trick from [23] to reduce memory use in DFA
and enable its scaling to large networks. We use PyTorch [87] for all experiments. For reference
implementation of the methods considered, we relied on various sources. Our NeRF implementation
is based on the PyTorch implementation by Krishna Murthy 3 , with modifications to allow for proper
test and validation, as well as DFA and multi-GPU support. For recommender systems, we use
PyTorch Geometric [60] for all graph operations. Our Transformer implementation is our own.
Our code is available as supplementary material.
NeRF We provide training, testing, and rendering code along with the configurations used to obtain
our results. An example to reproduce our results is given in the supplementary code repository. Given
the computing cost associated with training a NeRF, we also provide our trained models.
Recommender systems We provide bash scripts to reproduce the results in Table 2 and A.3, with
the results of our hyperparameter search. We provide code to reproduce the results in Table A.1.
Graph convolutions We provide the code to reproduce all of our results. Note that the t-SNE
results are not exactly reproducible, as the CUDA implementation used is non-deterministic.
Transformers We provide bash scripts to reproduce Table 5 and the shallow results.
<<FIGURE>>
Figure A.2: Sample renders for every scene of the NeRF-Synthetic dataset, for NeRF and NeRF-Dual
trained with DFA.
<<FIGURE>>
Figure A.3: Sample renders for every scene of the LLFF-Real dataset, for NeRF and NeRF-Dual
trained with DFA.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Efficient Behavior of Small-World Networks
We introduce the concept of efficiency of a network, measuring how efficiently it exchanges information. By using this simple measure small-world networks are seen as systems that are both globally and locally efficient. This allows to give a clear physical meaning to the concept of small-world, and also to perform a precise quantitative analysis of both weighted and unweighted networks. We study neural networks and man-made communication and transportation systems and we show that the underlying general principle of their construction is in fact a small-world principle of high efficiency. PACS numbers 89.70.+c, 05.90.+m, 87.18.Sn, 89.40.+k
We live in a world of networks. In fact any complex system in nature can be modeled as a network, where vertices are the elements of the system and edges represent the interactions between them. Coupled biological and chemical systems, neural networks, social interacting species, computer networks or the Internet are only few of such examples [1]. Characterizing the structural properties of the networks is then of fundamental importance to understand the complex dynamics of these systems. A recent paper [2] has shown that the connection topology of some biological and social networks is neither completely regular nor completely random. These networks, there named small-worlds, in analogy with the concept of small-world phenomenon developed 30 years ago in social psychology [3], are in fact highly clustered like regular lattices, yet having small characteristics path lengths like random graphs. The original paper has triggered a large interest in the study of the properties of small-worlds (see ref. [4] for a recent review). Researchers have focused their attention on different aspects: study of the inset mechanism [5,7], dynamics [8] and spreading of diseases on small-worlds [9], applications to social net.works [10,11] and to the Internet [12,13]. In this letter we introduce the concept of efficiency of a network, measuring how efficiently information is exchanged over the net.work. By using efficiency small-world networks results as systems that are both globally and locally efficient. This formalization gives a clear physical meaning to the concept of small-world, and also allows a precise quantitative analysis of unweighted and weighted networks. We study several systems, like brains, communication and transportation networks, and show that the underlying general principle of their construction is in fact a small-world principle, provided attention is taken not to ignore an important observational property (closure). We start by reexamining the original formulation pro.posed in ref. [2]. There, a generic graph G with N vertices and K edges is considered. G is assumed to be unweighted, i.e. edges are all equal, sparse <<(K . N (N . 1)/2)>>, and connected. i.e. there exists at least one path connecting any couple of vertices with a infinite number of steps. G is therefore represented by simply giving the adjacency (or connection) matrix, i.e. the NxN matrix whose entry a_ij is 1 if there is an edge joining vertex i to vertex j, and 0 otherwise. An important quantity of G is the degree of vertex i, i.e. the number ki of edges incident with vertex i (the number of neighbors of i). The average value of ki is <<k =2K/N>>. Once {<<FORMULA>>} is given it can be used to calculate the matrix of the short.est path lengths d_ij between two generic vertices i and j. The fact that G is assumed to be connected implies that dij is positive and infinite .i = j. In order to quantify the structural properties of G, [2] proposes to evaluate two different quantities: the characteristic path length L and the clustering coefficient C. L is the average distance be-
tween two generic vertices <<FORMULA>>, and C is a local property defined as <<FORMULA>>. Here C_i is the number of edges existing in Gi, the subgraph of the neighbors of i, divided by the maximum possible number ki(ki . 1)/2. In [2] a simple method is considered to pro.duce a class of graphs with increasing randomness. The initial graph G is taken to be a one-dimensional lattice with each vertex connected to its k neighbors and with periodic boundary conditions. Rewiring each edge at ran.dom with probability p, G can be tuned in a continuous way from a regular lattice (p = 0) into a random graph (p = 1). For the regular lattice we expect <<FORMULA>> and a high clustering coefficient <<FORMULA>>, while for a random graph <<FORMULA>> and <<FORMULA>> [14,5]. Although in the two limit cases a large C is associated to a large L and vice versa a small C to a small L, the numerical experiment reveals an intermediate regime at small p where the system is highly clustered like regular lattices, yet having small characteristics path lengths like random graphs. This behavior is there called small-world and it is found to be a property of some social and
biological networks analyzed [2].
Now we propose a more general set-up to investigate real networks. We will show that: the definition of small-world behavior can be given in terms of a single variable with a physical meaning, the efficiency E of the network. -1/L and C can be seen as first approximations of E evaluated resp. on a global and on a local scale. -we can drop all the restrictions on the system, like unweightedness, connectedness and sparseness. We represent a real network as a generic weighted (and possibly even non sparse and non connected) graph G. Such a graph needs two matrices to be described: the adjacency matrix {a_ij} defined as for the unweighted graph, and the matrix {<<FORMULA>>} of physical distances. The number <<FORMULA>> can be the space distance between the two vertices or the strength of their possible interaction: we suppose <<FORMULA to be known even if in the graph there is no edge between i and j. To make some examples, <<FORMULA>> can be the geographical distance between stations in transportation systems (in such a case <<FORMULA>> respects the triangle equality, though this is not a necessary assumption), the time taken to ex.change a packet of information between routers in the Internet, or the inverse v<<E_loc>>ity of chemical reactions along a direct connection in a biological system. Of course, in the particular case of an unweighted graph <<FORMULA>>. The shortest path length dij between two generic points i and j is the smallest sum of the physical distances throughout all the possible paths in the graph from i to j. The matrix {<<FORMULA>>} is therefore calculated by using the information contained both in matrix {a_ij} and in matrix {<<FORMULA>>}. We have <<FORMULA>>, the equality being valid when there is an edge between i and j. Let us now suppose that the system is parallel, i.e. every vertex sends information concurrently along the network, through its edges. The efficiency <<FORMULA>> in the communication between vertex i and j can be then defined to be inversely proportional to the shortest distance: <<FORMULA>>. When there is no path in the graph between i and j, <<FORMULA>> and consistently <<FORMULA>>. The average efficiency of G can be defined as:
<<FORMULA>>
To normalize E we consider the ideal case G_id in which the graph G has all the <<N (N . 1)/2>> possible edges. In such a case the information is propagated in the most efficient way since dij = .ij .i, j, and E assumes its maxi-
<<FORMULA>>. The efficiency <<FORMULA>>
<<E(G)>> considered in the following of the paper is always divided by <<FORMULA>> and therefore <<FORMULA>>. Though the equality E = 1 is valid when there is an edge between each couple of vertices, real networks can reach a high value of E.
In our formalism, we can define the small-world be.haviour by using the single measure E to analyze both the local and global behavior, rather than two different variables L and C. The quantity in eq. (1) is the global efficiency of G and we therefore name it E_glob. Since E is defined also for a disconnected graph we can characterize the local properties of G by evaluating for each vertex i the efficiency of G_i, the subgraph of the neighbors of i. We define the local efficiency as the average efficiency of the local subgraphs, E loc =1/N E(Gi).
This quantity plays a role similar to the clustering co.efficient C. Since <<FORMULA>>, the local efficiency <<FORMULA>> tells how much the system is fault tolerant, thus how efficient is the communication between the first neighbors of i when i is removed [15]. The definition of small-world can now be rephrased and generalized in terms of the information <<FORMULA>>: small-world networks have high <<FORMULA>> and <<FORMULA>>, i.e. are very efficient in global and local communication. This definition is valid both for unweighted and weighted graphs, and can also be applied to disconnected and/or non sparse graphs.
It is interesting to see the correspondence between our measure and the quantities L and C of [2] (or, correspondingly, <<1/L>> and C). The fundamental difference is that E_glob is the efficiency of a parallel systems, where all the nodes in the network concurrently exchange pack.ets of information (such are all the systems in [2], for example), while 1/L measures the efficiency of a sequential system (i.e. only one packet of information goes along the network). <<FORMULA>> is a reasonable approximation of <<E_glob>>when there are not huge differences among the distances in the graph, and this can explain why L works reasonably well in the unweighted examples of [2]. But, in general 1/L can significantly depart from E_glob. For instance, in the Internet, having few computers with an extremely slow connection does not mean that the whole Internet diminishes by far its efficiency: in practice, the presence of such very slow computers goes unnoticed, be.cause the other thousands of computers are exchanging packets among them in a very efficient way. Here 1/L would give a number very close to zero (strictly 0 in the particular case when a computer is disconnected from the others and <<FORMULA>>, while E_glob gives the correct efficiency measure of the Internet. We turn now our attention to the local properties of a network. C is only one among the many possible intuitive measures [10] of how well connected a cluster is. It can be shown that when in a graph most of its local subgraphs Gi are not sparse, then C is a good approximation of E_loc. Summing up there are not two different kinds of analysis to be done for the global and local scales, but just one with a very precise physical meaning: the efficiency in transporting information. We now illustrate the onset of the small-world in an un.weighted graph by means of the same example used in [2]. A regular lattice with <<N = 1000>> and <<k = 20>> is rewired
2
with probability p and <<E_glob>> and <<E_loc>> are reported in <<FORMULA>> as functions of p [16]. For <<p = 0>> we expect the system to be inefficient on a global scale (E_glob . k/N log(N/K)) but locally efficient. The situation is inverted for the ran.dom graph. In fact at p =1 E_glob assumes a maximum value of 0.4, meaning 40% the efficiency of the ideal graph with an edge between each couple of vertices. This at the expenses of the fault tolerance (<<FORMULA>>).
<<FIGURE>>
FIG. 1. FIG.1 Global and local efficiency for the graph example considered in [2]. A regular lattice with <<N = 1000>> and <<k = 20>> is rewired with probability p. The small-world behavior results from the increase of E_glob caused by the introduction of only a few rewired edges (short cuts), which on the other side do not affect <<E_loc>>. At p <20> 0.1, E_glob has almost reached the value of the random graph, though <<E_loc>> has only diminished by very little from the value of 0.82 of the regular lattice. Small worlds have high E_glob and <<E_loc>>.
The small-world behavior appears for intermediate values of p. It results from the fast increase of E_glob (for small p we find a linear increase of E_glob in the logarithmic horizontal scale) caused by the introduction of only a few rewired edges (short cuts), which on the other side do not affect <<E_loc>>. At p . 0.1, E_glob has almost reached the maximum value of 0.4, though <<E_loc>> has only diminished by very little from the maximum value of 0.82. For an unweighted case the description in terms of network efficiency resembles the approximation given in [2]. In particular we have checked that a good agreement with curves L(p) and C(p) [2] can be obtained by reporting <<FORMULA>> and <<FORMULA>>. Of course in such an example the short cuts connect at almost no cost vertices that would otherwise be much farther apart (because <<FORMULA>>). On the other hand this is not true when we consider a weighted network. As real networks we consider first different examples of natural systems (neural networks), and then we turn our attention to man-made communication and transportation systems.
1) Neural Networks. Thanks to recent experiments
neural structures can be studied at several levels of scale. Here we focus first on the analysis of the neuro-anatomical structure of cerebral cortex, and then on a simple nervous system at the level of wiring between neurons. The anatomical connections between cortical areas are of particular importance for their intricate relationship with the functional connectivity of the cerebral cortex [18]. We analyze two databases of cortico-cortical connections in the macaque and in the cat [19]. Tab.1 indicates the two networks are small-worlds [16]: they have high E_glob, respectively 52% and 69% the efficiency of the ideal graph with an edge between each couple of vertices (just slightly smaller than the best possible values of 57% and 70% obtained in random graphs) and high <<E_loc>>, respectively 70% and 83%, i.e. high fault tolerance [22]. These results indicate that in neural cortex each region is intermingled with the others and has grown following a perfect balance between local necessities (fault tolerance) and wide-scope interactions. Next we consider the neural network of C. elegans, the only case of a nervous system completely mapped at the level of neurons and chemical synapses [23]. Tab.1 shows that this is also a small-world network: C. elegans achieves both a 50% of global and local efficiency. Moreover the value of E_glob is similar to <<E_loc>>. This is a difference from cortex databases where fault tolerance is slightly privileged with respect to global communication.
2) Communication Networks. We have considered two of the most important large-scale communication net.works present nowadays: the World Wide Web and the Internet. Tab.2 shows that they have relatively high val.ues of E_glob (slightly smaller than the best possible val.ues obtained for random graphs) and <<E_loc>>. Despite the WWW is a virtual network and the Internet is a physical network, at a global scale they transport information essentially in the same way (as their E_glob<6F>s are almost equal). At a local scale, the bigger <<E_loc>> in the WWW case can be explained both by the tendency in the WWW to create Web communities (where pages talking about the same subject tend to link to each other), and by the fact that many pages within the same site are often quickly connected to each other by some root or menu page.
3) Transport Networks. differently from previous databases the Boston subway transportation system (MBTA) can be better described by a weighted graph, the matrix {.ij } being given by the geographical distances between stations. If we consider the MBTA as an unweighted graph we obtain that it is apparently neither locally nor globally efficient (see Tab.3). On the other hand, when we take into account the geographical distances, we have E_glob =0.63: this shows the MBTA is a very efficient transportation system on a global scale, only 37% less efficient than the ideal subway with a di.rect tunnel from each station to the others. Even in the weighted case <<E_loc>> stays low (0.03), indicating a poor local behavior: differently from a neural network the
MBTA is not fault tolerant and a damage in a station will dramatically affect the connection between the previous and the next station. The difference with respect to neural networks comes from different needs and priorities in the construction and evolution mechanism: when we build a subway system, the priority is given to the achievement of global efficiency, and not to fault tolerance. In fact a temporary problem in a station can be solved by other means: for example, walking, or taking a bus from the previous to the next station. That is to say, the MBTA is not a closed system: it can be considered, after all, as a subgraph of a wider transportation network. This property is of fundamental importance when we analyze a system: while global efficiency is without doubt the major characteristic, it is closure that somehow leads a system to have high local efficiency (without alternatives, there should be high fault-tolerance). The MBTA is not a closed system, and thus this explains why, unlike in the case of the brain, fault tolerance is not a critical issue. Indeed, if we increase the precision of the analysis and change the MBTA subway network by taking into account, for example, the Boston Bus System, this ex.tended transportation system comes back to be a small-world network (<<FORMULA>>). Qualitatively similar results, confirming the similarity of construction principles, have been obtained for other undergrounds and for a wider transportation system consisting of all the main airplane and highway connections throughout the world [25]. Considering all the transportation alter.natives available at that scale makes again the system closed (there are no other reasonable routing alternatives), and so fault-tolerance comes back as a leading construction principle.
Summing up, the introduction of the efficiency mea.sure allows to give a definition of small-world with a clear physical meaning, and provides important hints on why the original formulas of [2] work reasonably well in some cases, and where they fail. The efficiency measure al.lows a precise quantitative analysis of the information flow, and works both in the unweighted abstraction, and in the more realistic assumption of weighted networks. Finally, analysis of real data indicates that various existing (neural, communication and transport) networks exhibit the small-world behavior (even, in some cases, when their unweighted abstractions do not), substantiating the idea that the diffusion of small-world networks can be interpreted as the need to create networks that are both globally and locally efficient.
[1] Y. Bar-Yam, Dynamics of Complex Systems (Addison-Wesley, Reading Mass, 1997).
[2] D.J. Watts and S.H. Strogatz, Nature 393, 440 (1998).
[3] S. Milgram, Physicol. Today, 2, 60 (1967).
[4] M.E.J. Newman, cond-mat/0001118.
[5] A. Barrat, M. Weigt, Europ. Phys. J. B 13, 547 (2000)
[6] M. Marchiori and V. Latora, Physica A285, 539 (2000).
[7] M. Barthelemy, L. Amaral, Phys. Rev. Lett. 82, 3180 (1999).
[8] L. F. Lago-Fernandez et al, Phys. Rev. Lett. 84, 2758 (2000).
[9] C. Moore and M.E.J. Newman, Phys. Rev. E61, 5678 (2000).
[10] M.E.J. Newman, cond-mat/0011144.
[11] L. A. N. Amaral, A. Scala, M. Barth<74>el<65>emy, and H. E. Stanley, Proc. Natl. Acad. Sci. 97, 11149 (2000).
[12] R. Albert, H. Jeong, and A.-L. Barab<61>asi, Nature 401, 130 (1999).
[13] A.-L. Barab<61>asi and R. Albert, Science 286, 509 (1999).
[14] B. Bollob<6F>as, Random Graphs (Academic, London, 1985).
[15] Our concept of fault tolerance is different from the one adopted in R. Albert, H. Jeong, and A.-L. Barab<61>asi, Na.ture 406, 378 (2000); R. Cohen et al. Phys. Rev. Lett. 85, 2758 (2000), where the authors consider the response of the entire network to the removal of i.
[16] Here and in the following the matrix {dij }i,j2G has been computed by using two different methods: the Floyd-Warshall (O(N 3 )) [17] and the Dijkstra algorithm (O(N 2 logN )) [10].
[17] G. Gallo and S. Pallottino, Ann. Oper. Res. 13, 3 (1988).
[18] O. Sporns, G. Tononi, G.M. Edelman, Celebral Cortex 10, 127 (2000). [19] J.W.Scannell, Nature 386, 452 (1997). [20] M.P. Young, Phil.Trans.R.Soc B252, 13 (1993).
[21] J.W. Scannell, M.P. Young and C. Blakemore, J. Neu.rosci. 15, 1463 (1995).
[22] E. Sivan, H. Parnas and D. Dolev, Biol. Cybern. 81, 11.23 (1999).
[23] J.G. White et. al., Phil. Trans. R. Soc. London B314,1 (1986).
[24] T.B. Achacoso and W.S. Yamamoto, AY<41>s Neuroanatomy of C. elegans for Computation (CRC Press, FL, 1992).
[25] M. Marchiori and V. Latora, in preparation.
TABLE I. Macaque and cat cortico-cortical connections [19]. The macaque database contains N = 69 cortical areas and K = 413 connections [20]. The cat database has N = 55 cortical areas (including hippocampus, amygdala, entorhinal cortex and subiculum) and K = 564 (revised database and cortical parcellation from [21]). The nervous system of C. elegans consists of N = 282 neurons and K = 2462 links which can be either synaptic connections or gap junctions [24].
<<TABLE>>
TABLE II. Communication networks. Data on the World Wide Web from http://www.nd.edu/<2F>networks contains N = 325729 documents and K = 1090108 links [12], while the Internet database is taken from http://moat.nlanr.net and has N = 6474 nodes and K = 12572 links.
<<TABLE>>
TABLE III. The Boston underground transportation system (MBTA) consists of N = 124 stations and K = 124 tunnels. The matrix {.ij } of the spatial distances between stations, used for the weighted case, has been calculated us.ing databases from http://www.mbta.com/ and the U.S. Na.tional Mapping Division.
<<TABLE>>
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Efficient Processing of Deep Neural Networks: A Tutorial and Survey
Vivienne Sze,Senior Member, IEEE,Yu-Hsin Chen,Student Member, IEEE,Tien-Ju Yang,Student
Member, IEEE,Joel Emer,Fellow, IEEE
Abstract
Deep neural networks (DNNs) are currently widely representation of an input space. This is different from earlier
used for many artificial intelligence (AI) applications including approaches that use hand-crafted features or rules designed by
While DNNs experts. deliver state-of-the-art accuracy on many AI tasks, it comes at the The superior accuracy of DNNs, however, comes at the cost of high computational complexity. Accordingly, techniques
that enable efficient processing of DNNs to improve energy cost of high computational complexity. While general-purpose
efficiency and throughput without sacrificing application accuracy compute engines, especially graphics processing units (GPUs),
or increasing hardware cost are critical to the wide deployment have been the mainstay for much DNN processing, increasingly of DNNs in AI systems. there is interest in providing more specialized acceleration of This article aims to provide a comprehensive tutorial and the DNN computation. This article aims to provide an overview survey about the recent advances towards the goal of enabling
efficient processing of DNNs. Specifically, it will provide an of DNNs, the various tools for understanding their behavior,
overview of DNNs, discuss various hardware platforms and and the techniques being explored to efficiently accelerate their
architectures that support DNNs, and highlight key trends in computation. reducing the computation cost of DNNs either solely via hardware This paper is organized as follows: design changes or via joint hardware design and DNN algorithm
changes. It will also summarize various development resources Section II provides background on the context of why
that enable researchers and practitioners to quickly get started DNNs are important, their history and applications.
in this field, and highlight important benchmarking metrics and Section III gives an overview of the basic components of design considerations that should be used for evaluating the DNNs and popular DNN models currently in use. rapidly growing number of DNN hardware designs, optionally
including algorithmic co-designs, being proposed in academia Section IV describes the various resources used for DNN
and industry. research and development.
The reader will take away the following concepts from this Section V describes the various hardware platforms used
article: understand the key design considerations for DNNs; be to process DNNs and the various optimizations used able to evaluate different DNN hardware implementations with to improve throughput and energy efficiency without benchmarks and comparison metrics; understand the trade-offs impacting application accuracy (i.e., produce bit-wise between various hardware architectures and platforms; be able to
evaluate the utility of various DNN design techniques for efficient identical results).
processing; and understand recent implementation trends and Section VI discusses how mixed-signal circuits and new
opportunities. memory technologies can be used for near-data processing
to address the expensive data movement that dominates
throughput and energy consumption of DNNs.
I. INTRODUCTION Section VII describes various joint algorithm and hardware
Deep neural networks (DNNs) are currently the foundation optimizations that can be performed on DNNs to improve
for many modern artificial intelligence (AI) applications [1]. both throughput and energy efficiency while trying to
Since the breakthrough application of DNNs to speech recogni- minimize impact on accuracy.
tion [2] and image recognition [3], the number of applications Section VIII describes the key metrics that should be
that use DNNs has exploded. These DNNs are employed in a considered when comparing various DNN designs.
myriad of applications from self-driving cars [4], to detecting
cancer [5] to playing complex games [6]. In many of these II. B ACKGROUND ON DEEP NEURAL NETWORKS (DNN)
domains, DNNs are now able to exceed human accuracy. The In this section, we describe the position of DNNs in thesuperior performance of DNNs comes from its ability to extract context of AI in general and some of the concepts that motivatedhigh-level features from raw sensory data after using statistical its development. We will also present a brief chronology oflearning over a large amount of data to obtain an effective the major steps in its history, and some current domains to
which it is being applied. V. Sze, Y.-H. Chen and T.-J. Yang are with the Department of Electrical
Engineering and Computer Science, Massachusetts Institute of Technol-
ogy, Cambridge, MA 02139 USA. (e-mail: sze@mit.edu; yhchen@mit.edu, A. Artificial Intelligence and DNNs tjy@mit.edu)
J. S. Emer is with the Department of Electrical Engineering and Computer DNNs, also referred to as deep learning, are a part of Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA, the broad field of AI, which is the science and engineering and also with Nvidia Corporation, Westford, MA 01886 USA. (e-mail:
jsemer@mit.edu) of creating intelligent machines that have the ability to 2
<<FIGURE>>
Fig. 2. Connections to a neuron in the brain. <<FORMULA>>,<<FORMULA>>,<<FORMULA>>, and b are the
activations, weights, non-linear function and bias, respectively. (Figure adopted
from [7].)Fig. 1. Deep Learning in the context of Artificial Intelligence.
to be10 14 to10 15 synapses in the average human brain.
achieve goals like humans do, according to John McCarthy, A key characteristic of the synapse is that it can scale the
the computer scientist who coined the term in the 1950s. signal (x_i) crossing it as shown in Fig. 2. That scaling factor
The relationship of deep learning to the whole of artificial can be referred to as a weight (<<FORMULA>>), and the way the brain is
intelligence is illustrated in Fig. 1. believed to learn is through changes to the weights associated
Within artificial intelligence is a large sub-field called with the synapses. Thus, different weights result in different
machine learning, which was defined in 1959 by Arthur Samuel responses to an input. Note that learning is the adjustment
as the field of study that gives computers the ability to learn of the weights in response to a learning stimulus, while the
without being explicitly programmed. That means a single organization (what might be thought of as the program) of the
program, once created, will be able to learn how to do some brain does not change. This characteristic makes the brain an
intelligent activities outside the notion of programming. This is excellent inspiration for a machine-learning-style algorithm.
in contrast to purpose-built programs whose behavior is defined Within the brain-inspired computing paradigm there is a
by hand-crafted heuristics that explicitly and statically define subarea called spiking computing. In this subarea, inspiration
their behavior. is taken from the fact that the communication on the dendrites
The advantage of an effective machine learning algorithm and axons are spike-like pulses and that the information being
is clear. Instead of the laborious and hit-or-miss approach of conveyed is not just based on a spikes amplitude. Instead,
creating a distinct, custom program to solve each individual it also depends on the time the pulse arrives and that the
problem in a domain, the single machine learning algorithm computation that happens in the neuron is a function of not just
simply needs to learn, via a processes called training, to handle a single value but the width of pulse and the timing relationship
each new problem. between different pulses. An example of a project that was
Within the machine learning field, there is an area that is inspired by the spiking of the brain is the IBM TrueNorth [8].
often referred to as brain-inspired computation. Since the brain In contrast to spiking computing, another subarea of brain-
is currently the best machine we know for learning and inspired computing is called neural networks, which is the
solving problems, it is a natural place to look for a machine focus of this article. 1
learning approach. Therefore, a brain-inspired computation is
a program or algorithm that takes some aspects of its basic B. Neural Networks and Deep Neural Networks (DNNs)
form or functionality from the way the brain works. This is in Neural networks take their inspiration from the notion that
contrast to attempts to create a brain, but rather the program a neurons computation involves a weighted sum of the input
aims to emulate some aspects of how we understand the brain values. These weighted sums correspond to the value scaling
to operate. performed by the synapses and the combining of those values
Although scientists are still exploring the details of how the in the neuron. Furthermore, the neuron doesnt just output that
brain works, it is generally believed that the main computational weighted sum, since the computation associated with a cascade
element of the brain is the neuron. There are approximately of neurons would then be a simple linear algebra operation.
86 billion neurons in the average human brain. The neurons Instead there is a functional operation within the neuron that
themselves are connected together with a number of elements is performed on the combined inputs. This operation appears
entering them called dendrites and an element leaving them to be a non-linear function that causes a neuron to generate
called an axon as shown in Fig. 2. The neuron accepts the an output only if the inputs cross some threshold. Thus by
signals entering it via the dendrites, performs a computation on analogy, neural networks apply a non-linear function to the
those signals, and generates a signal on the axon. These input weighted sum of the input values. We look at what some of
and output signals are referred to as activations. The axon of those non-linear functions are in Section III-A1.
one neuron branches out and is connected to the dendrites of
many other neurons. The connections between a branch of the 1 Note: Recent work using TrueNorth in a stylized fashion allows it to be
used to compute reduced precision neural networks [9]. These types of neural axon and a dendrite is called asynapse. There are estimated networks are discussed in Section VII-A. 3
<<FIGURE>>
Fig. 3. Simple neural network example and terminology (Figure adopted (a) Compute the gradient of the loss (b) Compute the gradient of the lossfrom [7]). relative to the filter inputs relative to the weights
<<FIGURE>>
Fig. 4. An example of backpropagation through a neural network.
<<FIGURE>>
Fig. 3(a) shows a diagrammatic picture of a computational neural network. The neurons in the input layer receive some
values and propagate them to the neurons in the middle layer and is referred to as training the network.
Once trained, the
of the network, which is also frequently called a hidden program can perform its task by computing the output of
layer. The weighted sums from one or more hidden layers are the network using the weights determined during the training
ultimately propagated to the output layer, which presents the process. Running the program with these weights is referred
final outputs of the network to the user. To align brain-inspired to as inference.
terminology with neural networks, the outputs of the neurons In this section, we will use image classification, as shown
are often referred to as activations, and the synapses are often in Fig. 6, as a driving example for training and using a DNN.
referred to as weights as shown in Fig. 3(a). We will use the When we perform inference using a DNN, we give an input
activation/weight nomenclature in this article. image and the output of the DNN is a vector of scores, one for
Fig. 3(b) shows an example of the computation at each each object class; the class with the highest score indicates the
most likely class of object in the image. The overarching goal layer: <<For>>, where W_ij ,x_i and y_j are the for training a DNN is to determine the weights that maximize
weights, input activations and output activations, respectively, i=1 the score of the correct class and minimize the scores of the
and <<FORMULA>> is a non-linear function described in SectionIII-A1. incorrect classes. When training the network the correct class
The bias term b is omitted from Fig. 3(b) for simplicity. is often known because it is given for the images used for
Within the domain of neural networks, there is an area called training (i.e., the training set of the network). The gap between
deep learning, in which the neural networks have more than the ideal correct scores and the scores computed by the DNN
three layers, i.e., more than one hidden layer. Today, the typical based on its current weights is referred to as theloss(L).
numbers of network layers used in deep learning range from Thus the goal of training DNNs is to find a set of weights to
five to more than a thousand. In this article, we will generally minimize the average loss over a large training set.
use the terminologydeep neural networks (DNNs)to refer to When training a network, the weights (wij ) are usually
the neural networks used in deep learning. updated using a hill-climbing optimization process called
DNNs are capable of learning high-level features with more gradient descent. A multiple of the gradient of the loss relative
complexity and abstraction than shallower neural networks. An to each weight, which is the partial derivative of the loss with
example that demonstrates this point is using DNNs to process respect to the weight, is used to update the weight (i.e., updated
visual data. In these applications, pixels of an image are fed into <<FORMULA>>, where <<FORMULA>> is called the learning rate).
Note <<FORMULA>> the first layer of a DNN, and the outputs of that layer can be that this gradient indicates how the weights should change in ij
interpreted as representing the presence of different low-level order to reduce the loss. The process is repeated iteratively to
features in the image, such as lines and edges. At subsequent reduce the overall loss.
layers, these features are then combined into a measure of the An efficient way to compute the partial derivatives of
likely presence of higher level features, e.g., lines are combined the gradient is through a process called backpropagation.
into shapes, which are further combined into sets of shapes. Backpropagation, which is a computation derived from the
And finally, given all this information, the network provides a chain rule of calculus, operates by passing values backwards
probability that these high-level features comprise a particular through the network to compute how the loss is affected by
object or scene. This deep feature hierarchy enables DNNs to each weight.
achieve superior performance in many tasks. This backpropagation computation is, in fact, very similar
in form to the computation used for inference as shown in Fig. 4 [10]. 2 Thus, techniques for efficiently performing
C. Inference versus Training
Since DNNs are an instance of a machine learning algorithm, 2 To backpropagate through each filter: (1) compute the gradient of the loss
the basic program does not change as it learns to perform its relative to the weights from the filter inputs (i.e., the forward activations) and
given tasks. In the specific case of DNNs, this learning involves the gradients of the loss relative to the filter outputs; (2) compute the gradient
of the loss relative to the filter inputs from the filter weights and the gradients determining the value of the weights (and bias) in the network, of the loss relative to the filter outputs. 4
inference can sometimes be useful for performing training. DNN Timeline
It is, however, important to note a couple of points. First,
backpropagation requires intermediate outputs of the network 1940s - Neural networks were proposed
to be preserved for the backwards computation, thus training 1960s - Deep neural networks were proposed
has increased storage requirements. Second, due to the gradients 1989 - Neural networks for recognizing digits (LeNet)
use for hill-climbing, the precision requirement for training 1990s - Hardware for shallow neural nets (Intel ETANN)
is generally higher than inference. Thus many of the reduced 2011 - Breakthrough DNN-based speech recognition
(Microsoft)precision techniques discussed in Section VII are limited to
inference only. 2012 - DNNs for vision start supplanting hand-crafted
approaches (AlexNet)A variety of techniques are used to improve the efficiency
and robustness of training. For example, often the loss from 2014+ - Rise of DNN accelerator research (Neuflow,
DianNao...)multiple sets of input data, i.e., abatch, are collected before a
single pass of weight update is performed; this helps to speed Fig. 5. A concise history of neural networks. Deep refers to the number of
up and stabilize the training process. layers in the network.
There are multiple ways to train the weights. The most
common approach, as described above, is called supervised
learning, where all the training samples are labeled (e.g., with amount of available information to train the networks. To learn
the correct class).Unsupervised learning is another approach a powerful representation (rather than using a hand-crafted
where all the training samples are not labeled and essentially approach) requires a large amount of training data. For example,
the goal is to find the structure or clusters in the data.Semi- Facebook receives over 350 millions images per day, Walmart
supervised learning falls in between the two approaches where creates 2.5 Petabytes of customer data hourly and YouTube
only a small subset of the training data is labeled (e.g., use has 300 hours of video uploaded every minute. As a result,
unlabeled data to define the cluster boundaries, and use the the cloud providers and many businesses have a huge amount
small amount of labeled data to label the clusters). Finally, of data to train their algorithms.
reinforcement learning can be used to the train weights such The second factor is the amount of compute capacity
that given the state of the current environment, the DNN can available. Semiconductor device and computer architecture
output what action the agent should take next to maximize advances have continued to provide increased computing
expected rewards; however, the rewards might not be available capability, and we appear to have crossed a threshold where the
immediately after an action, but instead only after a series of large amount of weighted sum computation in DNNs, which
actions. is required for both inference and training, can be performed
Another commonly used approach to determine weights is in a reasonable amount of time.
fine-tuning, where previously-trained weights are available and The successes of these early DNN applications opened the
are used as a starting point and then those weights are adjusted floodgates of algorithmic development. It has also inspired the
for a new dataset (e.g., transfer learning) or for a new constraint development of several (largely open source) frameworks that
(e.g., reduced precision). This results in faster training than make it even easier for researchers and practitioners to explore
starting from a random starting point, and can sometimes result and use DNNs. Combining these efforts contributes to the third
in better accuracy. factor, which is the evolution of the algorithmic techniques that
This article will focus on the efficient processing of DNN have improved application accuracy significantly and broadened
inference rather than training, since DNN inference is often the domains to which DNNs are being applied.
performed on embedded devices (rather than the cloud) where An excellent example of the successes in deep learning can
resources are limited as discussed in more details later. be illustrated with the ImageNet Challenge [14]. This challenge
is a contest involving several different components. One of the
components is an image classification task where algorithmsD. Development History are given an image and they must identify what is in the image,Although neural nets were proposed in the 1940s, the first as shown in Fig. 6. The training set consists of 1.2 millionpractical application employing multiple digital neurons didnt images, each of which is labeled with one of 1000 objectappear until the late 1980s with the LeNet network for hand- categories that the image contains. For the evaluation phase,written digit recognition [11]3 . Such systems are widely used the algorithm must accurately identify objects in a test set ofby ATMs for digit recognition on checks. However, the early images, which it hasnt previously seen.2010s have seen a blossoming of DNN-based applications with Fig. 7 shows the performance of the best entrants in thehighlights such as Microsofts speech recognition system in ImageNet contest over a number of years. One sees that 2011 [2] and the AlexNet system for image recognition in the accuracy of the algorithms initially had an error rate2012 [3]. A brief chronology of deep learning is shown in of 25% or more. In 2012, a group from the University ofFig. 5. Toronto used graphics processing units (GPUs) for their highThe deep learning successes of the early 2010s are believed compute capability and a deep neural network approach, namedto be a confluence of three factors. The first factor is the AlexNet, and dropped the error rate by approximately 10% [3].
Their accomplishment inspired an outpouring of deep learning In the early 1960s, single analog neuron systems were used for adaptive
style algorithms that have resulted in a steady stream of filtering [12, 13]. 5
Speech and LanguageDNNs have significantly improved
the accuracy of speech recognition [21] as well as many
related tasks such as machine translation [2], natural
language processing [22], and audio generation [23]. Machines Learning
MedicalDNNs have played an important role in genomic
to gain insight into the genetics of diseases such as autism,
cancers, and spinal muscular atrophy [2427].
<<FIGURE>> They have also been used in medical imaging to detect skin cancer [5],
brain cancer [28] and breast cancer [29].
Fig. 6. Example of an image classification task.
The machine learning platform takes in an image and outputs the confidence scores for a predefined set of classes.
Game PlayRecently, many of the grand AI challenges
involving game play have been overcome using DNNs.
These successes also required innovations in training
techniques and many rely on reinforcement learning [30].
DNNs have surpassed human level accuracy in playing
Atari [31] as well as Go [6], where an exhaustive search
of all possibilities is not feasible due to the unimaginably
huge number of possible moves.
RoboticsDNNs have been successful in the domain of
<<FIGURE>> robotic tasks such as grasping with a robotic arm [32],
motion planning for ground robots [33], visual navigation [4,34], control to stabilize a quadcopter [35] and
Fig. 7. Results from the ImageNet Challenge [14]. driving strategies for autonomous vehicles [36].
DNNs are already widely used in multimedia applications
today (e.g., computer vision, speech recognition). Looking
improvements. forward, we expect that DNNs will likely play an increasingly
In conjunction with the trend to deep learning approaches important role in the medical and robotics fields, as discussed
for the ImageNet Challenge, there has been a corresponding above, as well as finance (e.g., for trading, energy forecasting,
increase in the number of entrants using GPUs. From 2012 and risk assessment), infrastructure (e.g., structural safety, and
when only 4 entrants used GPUs to 2014 when almost all traffic control), weather forecasting and event detection [37].
the entrants (110) were using them. This reflects the almost The myriad application domains pose new challenges to the
complete switch from traditional computer vision approaches efficient processing of DNNs; the solutions then have to be
to deep learning-based approaches for the competition. adaptive and scalable in order to handle the new and varied
In 2015, the ImageNet winning entry, ResNet [15], exceeded forms of DNNs that these applications may employ.
human-level accuracy with a top-5 error rate 4 below 5%. Since
then, the error rate has dropped below 3% and more focus F. Embedded versus Cloud
is now being placed on more challenging components of the The various applications and aspects of DNN processing competition, such as object detection and localization. These (i.e., training versus inference) have different computational successes are clearly a contributing factor to the wide range needs. Specifically, training often requires a large dataset 5 and of applications to which DNNs are being applied.
significant computational resources for multiple weight-update
iterations. In many cases, training a DNN model still takes several hours to multiple days and thus is typically performed
E. Applications of DNN
Many applications can benefit from DNNs ranging from in the cloud. Inference, on the other hand, can happen either
multimedia to medical space. In this section, we will provide in the cloud or at the edge (e.g., IoT or mobile).
examples of areas where DNNs are currently making an impact In many applications, it is desirable to have the DNN
and highlight emerging areas where DNNs hope to make an inference processing near the sensor. For instance, in computer
impact in the future. vision applications, such as measuring wait times in stores
Image and VideoVideo is arguably the biggest of the or predicting traffic patterns, it would be desirable to extract
big data. It accounts for over 70% of todays Internet meaningful information from the video right at the image
traffic [16]. For instance, over 800 million hours of video sensor rather than in the cloud to reduce the communication
is collected daily worldwide for video surveillance [17]. cost. For other applications such as autonomous vehicles,
Computer vision is necessary to extract meaningful infor- drone navigation and robotics, local processing is desired since
mation from video. DNNs have significantly improved the the latency and security risks of relying on the cloud are
accuracy of many computer vision tasks such as image too high. However, video involves a large amount of data,
classification [14], object localization and detection [18], which is computationally complex to process; thus, low cost
image segmentation [19], and action recognition [20]. hardware to analyze video is challenging yet critical to enabling
4 The top-5 error rate is measured based on whether the correct answer 5 One of the major drawbacks of DNNs is their need for large datasets to
appears in one of the top 5 categories selected by the algorithm. prevent over-fitting during training. 6
attention has been given to hardware acceleration specifically Feed Forward Recurrent Fully-Connected Sparsely-Connected for RNNs.
DNNs can be composed solely offully-connected(FC)
layers (also referred to as multi-layer perceptrons, or MLP)
as shown in the leftmost layer of Fig. 8(b). In a FC layer,
all output activations are composed of a weighted sum of
all input activations (i.e., all outputs are connected to all
inputs). This requires a significant amount of storage and
Thankfully, in many applications, we can remove current) networks some connections between the activations by setting the weights
to zero without affecting accuracy. This results in a sparsely connected layer. A sparsely connected layer is illustrated in
the rightmost layer of Fig. 8(b).these applications. Speech recognition enables us to seamlessly We can also make the computation more efficient by limitinginteract with electronic devices, such as smartphones. While the number of weights that contribute to an output. This sort ofcurrently most of the processing for applications such as Apple structured sparsity can arise if each output is only a functionSiri and Amazon Alexa voice services is in the cloud, it is of a fixed-size window of inputs. Even further efficiency canstill desirable to perform the recognition on the device itself to be gained if the same set of weights are used in the calculationreduce latency and dependency on connectivity, and to improve of every output. This repeated use of the same weight values is privacy and security. calledweight sharingand can significantly reduce the storageMany of the embedded platforms that perform DNN infer- requirements for weights.ence have stringent energy consumption, compute and memory An extremely popular windowed and weight-shared DNNcost limitations; efficient processing of DNNs have thus become layer arises by structuring the computation as a convolution,of prime importance under these constraints. Therefore, in this as shown in Fig. 9(a), where the weighted sum for each outputarticle, we will focus on the compute requirements for inference activation is computed using only a small neighborhood of inputrather than training. activations (i.e., all weights beyond beyond the neighborhood
are set to zero), and where the same set of weights are shared for
every output (i.e., the filter is space invariant). Such convolution-
III. OVERVIEW OF DNN'S
DNNs come in a wide variety of shapes and sizes depending based layers are referred to as convolutional (CONV) layers.
on the application. The popular shapes and sizes are also
evolving rapidly to improve accuracy and efficiency. In all A. Convolutional Neural Networks (CNNs)cases, the input to a DNN is a set of values representing the A common form of DNNs isConvolutional Neural Netsinformation to be analyzed by the network. For instance, these (CNNs), which are composed of multiple CONV layers asvalues can be pixels of an image, sampled amplitudes of an shown in Fig. 10. In such networks, each layer generates aaudio wave or the numerical representation of the state of some successively higher-level abstraction of the input data, calledsystem or game. afeature map(fmap), which preserves essential yet uniqueThe networks that process the input come in two major information. Modern CNNs are able to achieve superior per-forms: feed forward and recurrent as shown in Fig. 8(a). In formance by employing a very deep hierarchy of layers. CNNfeed-forward networks all of the computation is performed as a are widely used in a variety of applications including imagesequence of operations on the outputs of a previous layer. The understanding [3], speech recognition [39], game play [6],final set of operations generates the output of the network, for robotics [32], etc. This paper will focus on its use in imageexample a probability that an image contains a particular object, processing, specifically for the task of image classification [3].the probability that an audio sequence contains a particular Each of the CONV layers in CNN is primarily composed ofword, a bounding box in an image around an object or the high-dimensional convolutions as shown in Fig. 9(b). In thisproposed action that should be taken. In such DNNs, the computation, the input activations of a layer are structured asnetwork has no memory and the output for an input is always a set of 2-Dinput feature maps(ifmaps), each of which isthe same irrespective of the sequence of inputs previously given called achannel. Each channel is convolved with a distinctto the network. 2-D filter from the stack of filters, one for each channel; thisIn contrast, recurrent neural networks (RNNs), of which stack of 2-D filters is often referred to as a single 3-D filter.Long Short-Term Memory networks (LSTMs) [38] are a The results of the convolution at each point are summed acrosspopular variant, have internal memory to allow long-term all the channels. In addition, a 1-D bias can be added to thedependencies to affect the output. In these networks, some filtering results, but some recent networks [15] remove itsintermediate operations generate values that are stored internally usage from parts of the layers. The result of this computationin the network and used as inputs to other operations in is the output activations that comprise one channel ofoutputconjunction with the processing of a later input. In this article, feature map(ofmap). Additional 3-D filters can be used onwe will focus on feed-forward networks since (1) the major
computation in RNNs is still the weighted sum, which is 6 Note: the structured sparsity in CONV layers is orthogonal to the sparsity covered by the feed-forward networks, and (2) to-date little that occurs from network pruning as described in Section VII-B2. 7
after the CONV layers for classification purposes. A FC layer Fully
Connected also applies filters on the ifmaps as in the CONV layers, but
× × the filters are of the same size as the ifmaps. Therefore, it
does not have the weight sharing property of CONV layers. Optional
Eq. (1) still holds for the computation of FC layers with a
Fig. 10. Convolutional Neural Networks. few additional constraints on the shape parameters: <<FORMULA>>,
<<FORMULA>>,<<FORMULA>>, and <<FORMULA>>.
In addition to CONV and FC layers, various optional layers
the same input to create additional output channels. Finally, can be found in a DNN such as the non-linearity, pooling,
multiple input feature maps may be processed together as a and normalization. The function and computations for each of
batchto potentially improve reuse of the filter weights. these layers are discussed next.
Given the shape parameters in Table I, the computation of 1) Non-Linearity:A non-linear activation function is typically
applied after each CONV or FC layer. Various non-linear
functions are used to introduce non-linearity into the DNN as
shown in Fig. 11. These include historically conventional non- <<FORMULA>>
<<FORMULA>> linear functions such as sigmoid or hyperbolic tangent as well
<<FORMULA>> as rectified linear unit (ReLU) [40], which has become popular
<<FORMULA>>; in recent years due to its simplicity and its ability to enable
<<FORMULA>>; fast training. Variations of ReLU, such as leaky ReLU [41], (1) parametric ReLU [42],
and exponential LU [43] have also been O,I,W and B are the matrices of the of_maps, if_maps, filters explored for improved accuracy.
Finally, a non-linearity called and biases, respectively.Uis a given stride size. Fig. 9(b) maxout, which takes the max value of two intersecting linear shows a visualization of this computation (ignoring biases).
functions, has shown to be effective in speech recognition To align the terminology of CNNs with the generic DNN, tasks [44, 45].
filters are composed of weights (i.e., synapses) 2) Pooling: A variety of computations that reduce the
input and output feature maps (if_maps, of_maps) are dimensionality of a feature map are referred to as pooling.
composed of activations (i.e., input and output neurons) Pooling, which is applied to each channel separately, enables
DNN is run only once), which is more consistent with what
would likely be deployed in real-time and/or energy-constrained
LeNet[11] was one of the first CNN approaches introduced
in 1989. It was designed for the task of digit classification in
<<FIGURE>> grayscale images of size 28x28. The most well known version,
LeNet-5, contains two CONV layers and two FC layers [48].
Fig. 12. Various forms of pooling (Figure adopted from Caffe Tutorial [46]). Each CONV layer uses filters of size 5x5 (1 channel per filter)
with 6 filters in the first layer and 16 filters in the second layer.
the network to be robust and invariant to small shifts and Average pooling of 2x2 is used after each convolution and a
distortions. Pooling combines, or pools, a set of values in sigmoid is used for the non-linearity. In total, LeNet requires
its receptive field into a smaller number of values. It can be 60k weights and 341k multiply-and-accumulates (MACs) per
configured based on the size of its receptive field (e.g., 2x2) image. LeNet led to CNNs first commercial success, as it was
and pooling operation (e.g., max or average), as shown in deployed in ATMs to recognize digits for check deposits.
Fig. 12. Typically pooling occurs on non-overlapping blocks AlexNet[3] was the first CNN to win the ImageNet Challenge
(i.e., the stride is equal to the size of the pooling). Usually a in 2012. It consists of five CONV layers and three FC layers.
stride of greater than one is used such that there is a reduction Within each CONV layer, there are 96 to 384 filters and the
in the dimension of the representation (i.e., feature map). filter size ranges from 3x3 to 11x11, with 3 to 256 channels
3) Normalization:Controlling the input distribution across each. In the first layer, the 3 channels of the filter correspond
layers can help to significantly speed up training and improve to the red, green and blue components of the input image.
accuracy. Accordingly, the distribution of the layer input A ReLU non-linearity is used in each layer. Max pooling of
activations <<FORMULA>> are normalized such that it has a zero mean 3x3 is applied to the outputs of layers 1, 2 and 5. To reduce
and a unit standard deviation. In batch normalization (BN), computation, a stride of 4 is used at the first layer of the
the normalized value is further scaled and shifted, as shown network. AlexNet introduced the use of LRN in layers 1 and
in Eq. (2), where the parameters <<FORMULA>> are learned from 2 before the max pooling, though LRN is no longer popular
training [47].X is a small constant to avoid numerical problems. in later CNN models. One important factor that differentiates
Prior to this, local response normalization (LRN) [3] was AlexNet from LeNet is that the number of weights is much
used, which was inspired by lateral inhibition in neurobiology larger and the shapes vary from layer to layer. To reduce the
where excited neurons (i.e., high value activations) should amount of weights and computation in the second CONV layer,
subdue its neighbors (i.e., cause low value activations); however, the 96 output channels of the first layer are split into two groups
BN is now considered standard practice in the design of of 48 input channels for the second layer, such that the filters in
CNNs while LRN is mostly deprecated. Note that while LRN the second layer only have 48 channels. Similarly, the weights
usually is performed after the non-linear function, BN is mostly in fourth and fifth layer are also split into two groups. In total,
performed between the CONV or FC layer and the non-linear AlexNet requires 61M weights and 724M MACs to process
one 227x227 input image.
Overfeat[49] has a very similar architecture to AlexNet with
<<FORMULA>> (2) five CONV layers and three FC layers. The main differences <<FORMULA>>
are that the number of filters is increased for layers 3 (384
to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is notB. Popular DNN Models
split into two groups, the first fully connected layer only has
Many DNN models have been developed over the past 3072 channels rather than 4096, and the input size is 231x231
two decades. Each of these models has a different network rather than 227x227. As a result, the number of weights grows
architecture in terms of number of layers, layer types, layer to 146M and the number of MACs grows to 2.8G per image.
shapes (i.e., filter size, number of channels and filters), and Overfeat has two different models: fast (described here) and
connections between layers. Understanding these variations accurate. The accurate model used in the ImageNet Challenge
and trends is important for incorporating the right flexibility gives a 0.65% lower top-5 error rate than the fast model at the
in any efficient DNN engine. cost of 1.9% more MACs
In this section, we will give an overview of various popular VGG-16[50] goes deeper to 16 layers consisting of 13
DNNs such as LeNet [48] as well as those that competed in CONV layers and 3 FC layers. In order to balance out the
and/or won the ImageNet Challenge [14] as shown in Fig. 7, cost of going deeper, larger filters (e.g., 5x5) are built from
most of whose models with pre-trained weights are publicly multiple smaller filters (e.g., 3x3), which have fewer weights,
available for download; the DNN models are summarized in to achieve the same receptive fields as shown in Fig. 13(a).
Table II. Two results for top-5 error results are reported. In the As a result, all CONV layers have the same filter size of 3x3.
first row, the accuracy is boosted by using multiple crops from In total, VGG-16 requires 138M weights and 15.5G MACs
the image and an ensemble of multiple trained models (i.e., to process one 224224 input image. VGG has two different
the DNN needs to be run several times); these results were models: VGG-16 (described here) and VGG-19. VGG-19 gives
used to compete in the ImageNet Challenge. The second row a 0.1% lower top-5 error rate than VGG-16 at the cost of
reports the accuracy if only a single crop was used (i.e., the 1.27more MACs. 9
<<FIGURE>> <<FIGURE>>
Fig. 13. Decomposing larger filters into smaller filters. Fig. 14. Inception module from GoogleNet [51] with example channel lengths.
GoogLeNet[51] goes even deeper with 22 layers. It in-
troduced an inception module, shown in Fig. 14, which is
composed of parallel connections, whereas previously there
was only a single serial connection. Different sized filters (i.e.,
1x1, 3x3, 5x5), along with 3x3 max-pooling, are used for
each parallel connection and their outputs are concatenated
for the module output. Using multiple filter sizes has the
effect of processing the input at multiple scales. For improved
training speed, GoogLeNet is designed such that the weights
ReLU and the activations, which are stored for backpropagation during <<FORMULA>>
training, could all fit into the GPU memory. In order to reduce
the number of weights, 1x1 filters are applied as a bottleneck
to reduce the number of channels for each filter [52]. The 22
layers consist of three CONV layers, followed by 9 inceptions
layers (each of which are two CONV layers deep), and one FC
layer. Since its introduction in 2014, GoogleNet (also referred <<FIGURE>>
to as Inception) has multiple versions: v1 (described here), v3 7
smaller 1-D filters as shown in Fig. 13(b) to reduce number Fig. 15. Shortcut module from ResNet [15].
Note that ReLU following last
of MACs and weights in order to go deeper to 42 layers. CONV layer in short cut is after the addition.
In conjunction with batch normalization [47], v3 achieves
over 3% lower top-5 error than v1 with 2.5% increase in is used. This is similar to the LSTM networks that are used for computation [53].
Inception-v4 uses residual connections [54], sequential data. ResNet also uses the bottleneck approach of described in the next section,
for a 0.4% reduction in error. using 1x1 filters to reduce the number of weight parameters.ResNet[15], also known as Residual Net, uses residual
As a result, the two layers in the shortcut module are replace d connections to go even deeper (34 layers or more). It was by three layers (1x1, 3x3, 1x1) where the 1x1 reduces and
the first entry DNN in ImageNet Challenge that exceeded then increases (restores) the number of weights. ResNet-50human-level accuracy with a top-5 error rate below 5%.
One consists of one CONV layer, followed by 16 shortcut layers of the challenges with deep networks is the vanishing gradient (each of which are three CONV layers deep), and one FC
during training: as the error backpropagates through the network layer; it requires 25.5M weights and 3.9G MACs per image.the gradient shrinks, which affects the ability to update the There are various versions of ResNet with multiple depths
weights in the earlier layers for very deep networks. Residual (e.g.,without bottleneck:18, 34;with bottleneck:50, 101, 152).net introduces a shortcut module which contains an identity The ResNet with 152 layers was the winner of the ImageNet
connection such that the weight layers (i.e., CONV layers) Challenge requiring 11.3G MACs and 60M weights. Compared can be skipped as shown in Fig. 15. Rather than learning the to ResNet-50, it reduces the top-5 error by around 1% at the
function for the weight layersF(x), the shortcut module learns cost of 2.9% more MACs and 2.5% more weights.the residual mapping <<FORMULA>>. Initially, <<FORMULA>> is
zero and the identity connection is taken; then gradually during Several trends can be observed in the popular DNNs shown
training, the actual forward connection through the weight layer in Table II. Increasing the depth of the network tends to provide
higher accuracy. Controlling for number of weights, a deeper
7 v2 is very similar to v3. network can support a wider range of non-linear functions
that are more discriminative and also provides more levels B. Models
of hierarchy in the learned representation [15,50,51,55]. Pretrained DNN models can be downloaded from variousThe number of filter shapes continues to vary across layers, websites [5659] for the various different frameworks. It shouldthus flexibility is still important. Furthermore, most of the be noted that even for the same DNN (e.g., AlexNet) thecomputation has been placed on CONV layers rather than FC accuracy of these models can vary by around 1% to 2%layers. In addition, the number of weights in the FC layers is depending on how the model was trained, and thus the resultsreduced and in most recent networks (since GoogLeNet) the do not always exactly match the original publication.CONV layers also dominate in terms of weights. Thus, the
focus of hardware implementations should be on addressing
the efficiency of the CONV layers, which in many domains C. Popular Datasets for Classification
are increasingly important. It is important to factor in the difficulty of the task when
comparing different DNN models. For instance, the task of
IV. DNN DEVELOPMENT RESOURCES classifying handwritten digits from the MNIST dataset [62]
is much simpler than classifying an object into one of 1000
One of the key factors that has enabled the rapid development classes as is required for the ImageNet dataset [14](Fig. 16).
of DNNs is the set of development resources that have been It is expected that the size of the DNNs (i.e., number ofmade available by the research community and industry.
These weights) and the number of MACs will be larger for the moreresources are also key to the development of DNN accelerators difficult task than the simpler task and thus
require moreby providing characterizations of the workloads and facilitating energy and have lower throughput. For instance, LeNet-5[48]the exploration of trade-offs in
model complexity and accuracy. is designed for digit classification, while AlexNet[3], VGG-This section will describe these resources such that those who 16[50], GoogLeNet[51],
and ResNet[15] are designed for theare interested in this field can quickly get started.
There are many AI tasks that come with publicly availableA. Frameworks
For ease of DNN development and to enable sharing of Public datasets are important for comparing the accuracy of
trained networks, several deep learning frameworks have been different approaches. The simplest and most common task
developed from various sources. These open source libraries is image classification, which involves being given an entire
contain software libraries for DNNs. Caffe was made available image, and selecting 1 of N classes that the image most likely
in 2014 from UC Berkeley [46]. It supports C, C++, Python belongs to. There is no localization or detection.
and MATLAB. Tensorflow was released by Google in 2015, MNISTis a widely used dataset for digit classification
and supports C++ and python; it also supports multiple CPUs that was introduced in 1998 [62]. It consists of 2828 pixel
and GPUs and has more flexibility than Caffe, with the grayscale images of handwritten digits. There are 10 classes
computation expressed as dataflow graphs to manage the (for 10 digits) and 60,000 training images and 10,000 test
tensors (multidimensional arrays). Another popular framework images. LeNet-5 was able to achieve an accuracy of 99.05%
is Torch, which was developed by Facebook and NYU and when MNIST was first introduced. Since then the accuracy has
supports C, C++ and Lua. There are several other frameworks increased to 99.79% using regularization of neural networks
such as Theano, MXNet, CNTK, which are described in [60]. with dropconnect [63]. Thus, MNIST is now considered a fairly
There are also higher-level libraries that can run on top of easy dataset.
the aforementioned frameworks to provide a more universal CIFARis a dataset that consists of 3232 pixel colored
experience and faster development. One example of such images of of various objects, which was released in 2009 [64].
libraries is Keras, which is written in Python and supports CIFAR is a subset of the 80 million Tiny Image dataset [65].
Tensorflow, CNTK and Theano. CIFAR-10 is composed of 10 mutually exclusive classes. There
The existence of such frameworks are not only a convenient are 50,000 training images (5000 per class) and 10,000 test
aid for DNN researchers and application designers, but they images (1000 per class). A two-layer convolutional deep belief
are also invaluable for engineering high performance or more network was able to achieve 64.84% accuracy on CIFAR-10
efficient DNN computation engines. In particular, because the when it was first introduced [66]. Since then the accuracy has
frameworks make heavy use of a set primitive operations, increased to 96.53% using fractional max pooling [67].
such processing of a CONV layer, they can incorporate use of ImageNetis a large scale image dataset that was first
optimized software or hardware accelerators. This acceleration introduced in 2010; the dataset stabilized in 2012 [14]. It
is transparent to the user of the framework. Thus, for example, contains images of 256256 pixel in color with 1000 classes.
most frameworks can use Nvidias cuDNN library for rapid The classes are defined using the WordNet as a backbone to
execution on Nvidia GPUs. Similarly, transparent incorporation handle ambiguous word meanings and to combine together
of dedicated hardware accelerators can be achieved as was synonyms into the same object category. In otherwords, there
done with the Eyeriss chip [61]. is a hierarchy for the ImageNet categories. The 1000 classes
Finally, these frameworks are a valuable source of workloads were selected such that there is no overlap in the ImageNet
for hardware researchers. They can be used to drive experi- hierarchy. The ImageNet dataset contains many fine-grained
mental designs for different workloads, for profiling different categories including 120 different breeds of dogs. There are
workloads and for exploring hardware-software trade-offs. 1.3M training images (732 to 1300 per class), 100,000 testing 11
<<TABLE>>
TABLE II
SUMMARY OF POPULAR DNN S [3,15,48,50,51]. y ACCURACY IS MEASURED BASED ON TOP -5 ERROR ON IMAGE NET [14]. z THIS VERSION OF LE NET -5
HAS 431 K WEIGHTS FOR THE FILTERS AND REQUIRES 2.3M MAC S PER IMAGE ,AND USES RE LU RATHER THAN SIGMOID .
be localized and classified (out of 1000 classes). The DNN
outputs the top five categories and top five bounding box
locations. There is no penalty for identifying an object that
is in the image but not included in the ground truth. For
object detection, all objects in the image must be localized
and classified (out of 200 classes). The bounding box for all
objects in these categories must be labeled. Objects that are
not labeled are penalized as are duplicated detections. Fig. 16.
MNIST (10 classes, 60k training, 10k testing) [62] vs. ImageNet
(1000 classes, 1.3M training, 100k testing)[14] dataset. Beyond ImageNet, there are also other popular image
datasets for computer vision tasks. For object detection, there
images (100 per class) and 50,000 validation images (50 per is the PASCAL VOC (2005-2012) dataset that contains 11k
class). images representing 20 classes (27k object instances, 7k of
The accuracy of the ImageNet Challenge are reported using which has detailed segmentation) [68]. For object detection,
two metrics: Top-5 and Top-1 error. Top-5 error means that if segmentation and recognition in context, there is the MS COCO
any of the top five scoring categories are the correct category, dataset with 2.5M labeled instances in 328k images (91 object
it is counted as a correct classification. The Top-1 requires categories) [69]; compared to ImageNet, COCO has fewer
that the top scoring category be correct. In 2012, the winner categories but more instances per category, which is useful for
of the ImageNet Challenge (AlexNet) was able to achieve an precise 2-D localization. COCO also has more labeled instances
accuracy of 83.6% for the top-5 (which is substantially better per image to potentially help with contextual information.
than the 73.8% which was second place that year that did not Most recently even larger scale datasets have been made
use DNNs); it achieved 61.9% on the top-1 of the validation available. For instance, Google has an Open Images dataset
set. In 2017, the highest accuracy was 97.7% for the top-5. with over 9M images [70], spanning 6000 categories. There is
In summary of the various image classification datasets, it also a YouTube dataset with 8M videos (0.5M hours of video)
is clear that MNIST is a fairly easy dataset, while ImageNet covering 4800 classes [71]. Google also released an audio
is a challenging one with a wider coverage of classes. Thus dataset comprised of 632 audio event classes and a collection
in terms of evaluating the accuracy of a given DNN, it is of 2M human-labeled 10-second sound clips [72]. These large
important to consider that dataset upon which the accuracy is datasets will be evermore important as DNNs become deeper
measured. with more weight parameters to train.
Undoubtedly, both larger datasets and datasets for new
D. Datasets for Other Tasks domains will serve as important resources for profiling and
exploring the efficiency of future DNN engines.Since the accuracy of the state-of-the-art DNNs are perform-
ing better than human-level accuracy on image classification
tasks, the ImageNet Challenge has started to focus on more V. H ARDWARE FOR DNN P ROCESSING
difficult tasks such as single-object localization and object Due to the popularity of DNNs, many recent hardware
detection. For single-object localization, the target object must platforms have special features that target DNN processing. For 12
instance, the Intel Knights Landing CPU features special vector
instructions for deep learning; the Nvidia PASCAL GP100
GPU features 16-bit floating point (FP16) arithmetic support
to perform two FP16 operations on a single precision core for
faster deep learning computation. Systems have also been built
specifically for DNN processing such as Nvidia DGX-1 and
Facebooks Big Basin custom DNN server [73]. DNN inference
has also been demonstrated on various embedded System-on-
Chips (SoC) such as Nvidia Tegra and Samsung Exynos as
well as FPGAs. Accordingly, its important to have a good
understanding of how the processing is being performed on
these platforms, and how application-specific accelerators can <<FIGURE>>
be designed for DNNs for further improvement in throughput
and energy efficiency. Fig. 17. Highly-parallel compute paradigms.
The fundamental component of both the CONV and FC lay-
ers are the multiply-and-accumulate (MAC) operations, which
can be easily parallelized. In order to achieve high performance,
highly-parallel compute paradigms are very commonly used,
including both temporal and spatial architectures as shown in <<FORMULA>>
Fig. 17. The temporal architectures appear mostly in CPUs
parallelism such as vectors (SIMD) or parallel threads (SIMT).
Such temporal architecture use a centralized control for a large
number of ALUs. These ALUs can only fetch data from the
memory hierarchy and cannot communicate directly with each
other. In contrast, spatial architectures use dataflow processing,
i.e., the ALUs form a processing chain so that they can pass data
from one to another directly. Sometimes each ALU can have
its own control logic and local memory, called a scratchpad or
register file. We refer to the ALU with its own local memory as
a processing engine (PE). Spatial architectures are commonly
used for DNNs in ASIC and FPGA-based designs. In this
section, we will discuss the different design strategies for
efficient processing on these different platforms, without any
impact on accuracy (i.e., all approaches in this section produce
bit-wise identical results); specifically, <<FIGURE>>
* For temporal architectures such as CPUs and GPUs, we
will discuss howcomputational transformson the kernel Fig. 18. Mapping to matrix multiplication for fully connected layers
can reduce the number of multiplications to increase
throughput.
* For spatial architectures used in accelerators, we will
discuss howdataflowscan increase data reuse from low andNin Fig. 18(b)); finally, the height of the output feature
cost memories in the memory hierarchy toreduce energy map matrix is the number of channels in the output feature
consumption. maps (M), and the width is the number of output feature maps
(N), where each output feature map of the FC layer has the
dimension of 1x1 number of output channels (M).
A. Accelerate Kernel Computation on CPU and GPU Platforms The CONV layer in a DNN can also be mapped to a matrix
CPUs and GPUs use parallelizaton techniques such as SIMD multiplication using a relaxed form of the Toeplitz matrix as
or SIMT to perform the MACs in parallel. All the ALUs share shown in Fig. 19. The downside for using matrix multiplication
the same control and memory (register file). On these platforms, for the CONV layers is that there is redundant data in the input
both the FC and CONV layers are often mapped to a matrix feature map matrix as highlighted in Fig. 19(a). This can lead
multiplication (i.e., the kernel computation). Fig. 18 shows how to either inefficiency in storage, or a complex memory access
a matrix multiplication is used for the FC layer. The height of pattern.
the filter matrix is the number of filters and the width is the There are software libraries designed for CPUs (e.g., Open-
number of weights per filter (input channels (C) width (W) BLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN,
height (H), sinceR=WandS=Hin the FC layer); etc.) that optimize for matrix multiplications. The matrix
the height of the input feature maps matrix is the number of multiplication is tiled to the storage hierarchy of these platforms,
activations per input feature map <<FORMULA>>, and the which are on the order of a few megabytes at the higher levels.
<<FIGURE>>
Fig. 21. Read and write access per MAC.
<<FIGURE>>
Fig. 19. Mapping to matrix multiplication for convolutional layers.
for a 3x3 filter, respectively, at the cost of reduced numerical stability, increased storage requirements, and specialized
The matrix multiplications on these platforms can be further processing depending on the size of the filter.
sped up by applying computational transforms to the data to In practice, different algorithms might be used for different
reduce the number of multiplications, while still giving the layer shapes and sizes (e.g., FFT for filters greater than 5x5,
same bit-wise result. Often this can come at a cost of increased and Winograd for filters 3x3 and below). Existing platform
number of additions and a more irregular data access pattern. libraries, such as MKL and cuDNN, dynamically chose the
appropriate algorithm for a given shape and size [77, 78].Fast Fourier Transform (FFT) [10,74] is a well known
approach, shown in Fig. 20 that reduces the number of
multiplications from <<O(N^2 N^2)>> to <<O(N^2 log N)>> B. Energy-Efficient Dataflow for Accelerators <<FORMULA>>, where the
output size is <<FORMULA>> and the filter size is <<FORMULA>>. To For DNNs, the bottleneck for processing is in the memory perform
the convolution, we take the FFT of the filter and access. Each MAC requires three memory reads (for filterinput feature map, and then
perform the multiplication in weight, fmap activation, and partial sum) and one memorythe frequency domain; we then apply an inverse
FFT to the write (for the updated partial sum) as shown in Fig. 21. In theresulting product to recover the output feature map in the
worst case, all of the memory accesses have to go through the spatial domain. However, there are several drawbacks to using off-chip
DRAM, which will severely impact both throughput FFT: (1) the benefits of FFTs decrease with filter size; (2) the and energy efficiency.
For example, in AlexNet, to support itssize of the FFT is dictated by the output feature map size which 724M MACs, nearly 3000M DRAM
accesses will be required. is often much larger than the filter; (3) the coefficients in the Furthermore, DRAM accesses require up to
several orders offrequency domain are complex. As a result, while FFT reduces magnitude higher energy than computation [79].computation,
it requires larger storage capacity and bandwidth. Accelerators, such as spatial architectures as shown inFinally, a popular
approach for reducing complexity is to make Fig. 17, provide an opportunity to reduce the energy cost ofthe weights sparse, which will
be discussed in SectionVII-B2; data movement by introducing several levels of local memoryusing FFTs makes it difficult for this sparsity
to be exploited. hierarchy with different energy cost as shown in Fig. 22. This
Several optimizations can be performed on FFT to make it includes a large global buffer with a size of several hundred
more effective for DNNs. To reduce the number of operations, kilobytes that connects to DRAM, an inter-PE network that
the FFT of the filter can be precomputed and stored. In addition, can pass data directly between the ALUs, and a register file
the FFT of the input feature map can be computed once and (RF) within each processing element (PE) with a size of a
used to generate multiple channels in the output feature map. few kilobytes or less. The multiple levels of memory hierarchy
Finally, since an image contains only real values, its Fourier help to improve energy efficiency by providing low-cost data
Transform is symmetric and this can be exploited to reduce accesses. For example, fetching the data from the RF or
storage and computation cost. neighbor PEs is going to cost 1 or 2 orders of magnitude
Other approaches include Strassen [75] and Winograd [76], lower energy than from DRAM.
which rearrange the computation such that the number of Accelerators can be designed to support specialized process-
multiplications reduce from <<O(N^3)>> to <<FORMULA>> and by 2.25% ing dataflows that leverage this memory hierarchy. The dataflow 14
<<FORMULA>>
Fig. 23. Data reuse opportunities in DNNs [80].
<<FORMULA>>
Fig. 22. Memory hierarchy and data movement energy [80].
decides what data gets read into which level of the memory
hierarchy and when are they getting processed. Since there is
no randomness in the processing of DNNs, it is possible to
design a fixed dataflow that can adapt to the DNN shapes and
sizes and optimize for the best energy efficiency. The optimized
dataflow minimizes access from the more energy consuming <<FORMULA>>
levels of the memory hierarchy. Large memories that can store
a significant amount of data consume more energy than smaller
memories. For instance, DRAM can store gigabytes of data, but
consumes two orders of magnitude higher energy per access
than a small on-chip memory of a few kilobytes. Thus, every
time a piece of data is moved from an expensive level to a Fig. 24. An analogy between the operation of DNN accelerators (texts in
lower cost level in terms of energy, we want to reuse that piece black) and that of general-purpose processors (texts in red). Figure adopted
from [81]. of data as much as possible to minimize subsequent accesses
to the expensive levels. The challenge, however, is that the
storage capacity of these low cost memories are limited. Thus program into machine-readable binary codes for executionwe need to explore different dataflows that maximize reuse given the hardware architecture (e.g., x86 or ARM); in theunder these constraints. processing of DNNs, the mapper translates the DNN shapeFor DNNs, we investigate dataflows that exploit three forms and size into a hardware-compatible computation mappingof input data reuse (convolutional, feature map and filter) as for execution given the dataflow. While the compiler usuallyshown in Fig. 23. For convolutional reuse, the same input optimizes for performance, the mapper optimizes for energyfeature map activations and filter weights are used within efficiency.a given channel, just in different combinations for different The following taxonomy (Fig. 25) can be used to classifyweighted sums. For feature map reuse, multiple filters are the DNN dataflows in recent works [8293] based on their applied to the same feature map, so the input feature map data handling characteristics [80]: activations are used multiple times across filters. Finally, for 1) Weight stationary (WS):The weight stationary dataflow
filter reuse, when multiple input feature maps are processed at is designed to minimize the energy consumption of reading
once (referred to as a batch), the same filter weights are used weights by maximizing the accesses of weights from the register
multiple times across input features maps.
file (RF) at the PE (Fig. 25(a)). Each weight is read from
If we can harness the three types of data reuse by storing DRAM into the RF of each PE and stays stationary for further
the data in the local memory hierarchy and accessing them accesses. The processing runs as many MACs that use the
multiple times without going back to the DRAM, it can save same weight as possible while the weight is present in the RF;
a significant amount of DRAM accesses. For example, in it maximizes convolutional and filter reuse of weights. The
AlexNet, the number of DRAM reads can be reduced by up to inputs and partial sums must move through the spatial array
500in the CONV layers. The local memory can also be used and global buffer. The input fmap activations are broadcast to
for partial sum accumulation, so they do not have to reach all PEs and then the partial sums are spatially accumulated
DRAM. In the best case, if all data reuse and accumulation across the PE array.
can be achieved by the local memory hierarchy, the 3000M One example of previous work that implement weight
DRAM accesses in AlexNet can be reduced to only 61M. stationary dataflow is nn-X, or neuFlow [85], which uses
The operation of DNN accelerators is analogous to that of eight 2-D convolution engines for processing a 1010 filter.
general-purpose processors as illustrated in Fig. 24 [81]. In There are total 100 MAC units, i.e. PEs, per engine with each
conventional computer systems, the compiler translates the PE having a weight that stays stationary for processing. The
<<FIGURE>>
Fig. 26. Variations of output stationary [80].(b) Output Stationary
are [89], [88], and [90], respectively.
No local reuse (NLR): While small register files are
efficient in terms of energy (pJ/bit), they are inefficient in terms Psum
<<FORMULA>> of area (<<FORMULA>>). In order to maximize the storage capacity,
and minimize the off-chip memory bandwidth, no local storage
Fig. 25. Dataflows for DNNs [80]. is allocated to the PE and instead all that area is allocated
to the global buffer to increase its capacity (Fig. 25(c)). The
no local reuse dataflow differs from the previous dataflows in
input fmap activations are broadcast to all MAC units and the that nothing stays stationary inside the PE array. As a result,
partial sums are accumulated across the MAC units. In order to there will be increased traffic on the spatial array and to the
accumulate the partial sums correctly, additional delay storage global buffer for all data types. Specifically, it has to multicast
elements are required, which are counted into the required size the activations, single-cast the filter weights, and then spatially
of local storage. Other weight stationary examples are found accumulate the partial sums across the PE array.
in [8284, 86, 87]. In an example of the no local reuse dataflow from
2) Output stationary (OS):The output stationary dataflow is UCLA [91], the filter weights and input activations are read
designed to minimize the energy consumption of reading and from the global buffer, processed by the MAC units with custom
writing the partial sums (Fig. 25(b)). It keeps the accumulation adder trees that can complete the accumulation in a single cycle,
of partial sums for the same output activation value local in the and the resulting partial sums or output activations are then put
RF. In order to keep the accumulation of partial sums stationary back to the global buffer. Another example is DianNao [92],
in the RF, one common implementation is to stream the input which also reads input activations and filter weights from
activations across the PE array and broadcast the weight to all the buffer, and processes them through the MAC units with
PEs in the array. custom adder trees. However, DianNao implements specialized
One example that implements the output stationary dataflow registers to keep the partial sums in the PE array, which helps
is ShiDianNao [89], where each PE handles the processing for to further reduce the energy consumption of accessing partial
each output activation value by fetching the corresponding input sums. Another example of no local reuse dataflow is found
activations from neighboring PEs. The PE array implements in [93].
dedicated networks to pass data horizontally and vertically. 4) Row stationary (RS): A row stationary dataflow is
Each PE also has data delay registers to keep data around for proposed in [80], which aims to maximize the reuse and
the required amount of cycles. At the system level, the global accumulation at the RF level foralltypes of data (weights,
buffer streams the input activations and broadcasts the weights pixels, partial sums) for the overall energy efficiency. This
into the PE array. The partial sums are accumulated inside differs from WS or OS dataflows, which optimize for only
each PE and then get streamed out back to the global buffer. weights and partial sums, respectively.
Other examples of output stationary are found in [88, 90]. The row stationary dataflow assigns the processing of a
There are multiple possible variants of output stationary as 1-D row convolution into each PE for processing as shown
shown in Fig. 26 since the output activations that get processed in Fig. 27. It keeps the row of filter weights stationary inside
at the same time can come from different dimensions. For the RF of the PE and then streams the input activations into
example, the variantOS A targets the processing of CONV the PE. The PE does the MACs for each sliding window at a
layers, and therefore focuses on the processing of output time, which uses just one memory space for the accumulation
activations from the same channel at a time in order to of partial sums. Since there are overlaps of input activations
maximize data reuse opportunities. The variantOS C targets between different sliding windows, the input activations can
the processing of FC layers, and focuses on generating output then be kept in the RF and get reused. By going through all the
activations from all different channels, since each channel only sliding windows in the row, it completes the 1-D convolution
has one output activation. The variantOS B is something in and maximize the data reuse and local accumulation of data
betweenOS A andOS C . Example of variantsOS A ,OS B , and in this row. 16
<<FIGURE>>
Fig. 27. 1-D Convolutional reuse within PE for Row Stationary Dataflow [80].
<<FIGURE>>
Fig. 29. Multiple rows of different input feature maps, filters and channels are
mapped to same PE within array for additional reuse in the Row Stationary
<<FIGURE>>
Fig. 28. 2-D convolutional reuse within spatial array for Row Stationary
shown in Fig. 28. For example, to generate the first row of
output activations with a filter having three rows, three 1-D Fig. 30. Mapping optimization takes in hardware and DNNs shape constraints
convolutions are required. Therefore, we can use three PEs in to determine optimal energy dataflow [80].
a column, each running one of the three 1-D convolutions. The
partial sums are further accumulated vertically across the three
PEs to generate the first output row. To generate the second different channels are interleaved, and run through the same PE
row of output, we use another column of PEs, where three as a 1-D convolution. The partial sums from different channels
rows of input activations are shifted down by one row, and use then naturally get accumulated inside the PE.
the same rows of filters to perform the three 1-D convolutions. The number of filters, channels, and fmaps that can be
Additional columns of PEs are added until all rows of the processed at the same time is programmable, and there exists an
output are completed (i.e., the number of PE columns equals optimal mapping for the best energy efficiency, which depends
the number of output rows). on the shape configuration of the DNN as well as the hardware
This 2-D array of PEs enables other forms of reuse to reduce resources provided, e.g., the number of PEs and the size of the
accesses to the more expensive global buffer. For example, each memory in the hierarchy. Since all of the variables are known
filter row is reused across multiple PEs horizontally. Each row before runtime, it is possible to build a compiler (i.e., mapper)
of input activations is reused across multiple PEs diagonally. to perform this optimization off-line to configure the hardware
And each row of partial sums are further accumulated across for different mappings of the RS dataflow for different DNNs
the PEs vertically. Therefore, 2-D convolutional data reuse and as shown in Fig. 30.
accumulation are maximized inside the 2-D PE array. One example that implements the row stationary dataflow
To address the high-dimensional convolution of the CONV is Eyeriss [94]. It consists of a 14x12 PE array, a 108KB
layer (i.e., multiple fmaps, filters, and channels), multiple rows global buffer, ReLU and fmap compression units as shown
can be mapped onto the same PE as shown in Fig. 29. The in Fig. 31. The chip communicates with the off-chip DRAM
2-D convolution is mapped to a set of PEs, and the additional using a 64-bit bidirectional data bus to fetch data into the
dimensions are handled by interleaving or concatenating the global buffer. The global buffer then streams the data into the
additional data. For filter reuse within the PE, different rows PE array for processing.
of fmaps are concatenated and run through the same PE In order to support the RS dataflow, two problems need to be
as a 1-D convolution. For input fmap reuse within the PE, solved in the hardware design. First, how can the fixed-size PE
different filter rows are interleaved and run through the same array accommodate different layer shapes? Second, although
PE as a 1-D convolution. Finally, to increase local partial sum the data will be passed in a very specific pattern, it still changes
accumulation within the PE, filter rows and fmap rows from with different shape configurations. How can the fixed design
needs of each dataflow under the same area constraint. For
example, since the no local reuse dataflow does not require any Processing
RF in PE, it is allocated with a much larger global buffer. The If map
simulation uses the layer configurations from AlexNet with a Buffer
batch size of 16. The simulation also takes into account the bits
fact that accessing different levels of the memory hierarchy Enc.
requires different energy cost.
of each dataflow for the CONV layers of AlexNet with a
batch size of 16. The WS and OS dataflows have the lowest
energy consumption for accessing weights and partial sums,
respectively. However, the RS dataflow has the lowest total 13
energy consumption since it optimizes for the overall energy
efficiency instead of only for a certain data type.
Fig. 33(a) shows the same results with breakdown in terms of
memory hierarchy. The RS dataflow consumes the most energy
in the RF, since by design most of the accesses have been
moved to the lowest level of the memory hierarchy. This helps
to achieve the lowest total energy consumption since RF has
the lowest energy per access. The NLR dataflow has the lowest Clock Gated
energy consumption at the DRAM level, since it has a much
larger global buffer and thus higher on-chip storage capacity
compared to others. However, most of the data accesses in
relatively large energy consumption per access compared to
accessing data from RF or inside the PE array. As a result, the
overall energy consumption of the NLR dataflow is still fairly
high. Overall, RS dataflow uses 1.4% to 2.5% lower energy
pass data in different patterns?
<<FIGURE>>
Fig. 32. Mapping uses replication and folding to maximized utilization of the NLR dataflow is from the global buffer, which still has a
PE array [94].
Two mapping strategies can be used to solve the first problem than other dataflows.
as shown in Fig. 32. First, replication can be used to map shapes Fig. 34 shows the energy efficiency between different
that do not use up the entire PE array. For example, in the dataflows in the FC layers of AlexNet with a batch size of 16.
third to fifth layers of AlexNet, each 2-D convolution only uses Since there is not as much data reuse in the FC layers as in a
13x3 PE array. This structure is then replicated four times, the CONV layers, all dataflows spend a significant amount of
and runs different channels and filters in each replication. The energy on reading weights. However, RS dataflow still has the
second strategy is called folding. For example, in the second lowest energy consumption because it optimizes for the energy
layer of AlexNet, it requires a 27x5 PE array to complete the of accessing input activations and partial sums. For the OS2-D
convolution. In order to fit it into the 14x12 physical PE dataflows,OSarray, it is folded into two parts, 14x5 and 13x5, and each
C now consumes lower energy thanOS A since it is designed for the FC layers. Overall, RS still consumesare vertically mapped into
the physical PE array. Since not all 1.3% lower energy compared to other dataflows at the batchPEs are used by the mapping, the
unused PEs can be clock size of 16.gated to save energy consumption.
A custom multicast network is used to solve the second Fig. 35 shows the RS dataflow design with energy breakdown
problem about flexible data delivery. The simplest way to pass in terms of different layers of AlexNet. In the CONV layers, the
data to multiple destinations is to broadcast the data to all PEs energy is mostly consumed by the RF, while in the FC layers,
and let each PE decide if it has to process the data or not. the energy is mostly consumed by DRAM. However, most
However, it is not very energy efficient especially when the of the energy is consumed by the CONV layers, which takes
size of PE array is large. Instead, a multicast network is used around 80% of the energy. As recent DNN models go deeper
to send data to only the places where it is needed. with more CONV layers, the ratio between number of CONV
5) Energy comparison of different dataflows:To evaluate and FC layers only gets larger. Therefore, moving forward,
and compare different dataflows, the same total hardware area significant effort should be placed on energy optimizations for
and number of PEs (256) are used in the simulation of a spatial CONV layers.
architecture for all dataflows. The local memory (register file) at Finally, up until now, we have been looking at architec-
each processing element (PE) is on the order of 0.5 1.0kB and tures with relatively limited storage on the order of a few
a shared memory (global buffer) is on the order of 100 500kB. hundred kilobytes. With much larger storage on the order of
The sizes of these memories are selected to be comparable to a few megabytes, additional dataflows can be considered. For
a typical accelerator for multimedia processing, such as video example, Fused-Layer looks at dataflow optimizations across
coding [95]. The memory sizes are further adjusted for the layers [96]. 18
<<FORMULA>>
Fig. 35. Energy breakdown across layers of the AlexNet [80]. RF energy
dominates in convolutional layers. DRAM energy dominates in the fully
connected layer. Convolutional layer dominate energy consumption.
In this section, we will discuss how moving compute and data Normalized
closer to reduce data movement (i.e., near-data processing) can pixels
be achieved using mixed-signal circuit design and advanced
memory technologies.
Many of these works use analog processing which has the
drawback of increased sensitivity to circuit and device non-
idealities. Consequentially, the computation is often performed
at reduced precision, which can be accounted for during (b) Energy breakdown across data type
the training of the DNNs using the techniques discussed in
Section VII. Another factor to take into consideration is that Fig. 33.
Comparison of energy efficiency between different dataflows in the DNNs are
often trained in the digital domain; thus for analog CONV layers of AlexNet with a batch size of 16 [3]:
(a) breakdown in terms of storage levels and ALU, (b) breakdown in terms of data types. OS
processing, there is an additional overhead cost for analog- A , OS B and OS C are three variants of the
OS dataflow that are commonly seen in to-digital conversion (ADC) and digital-to-analog conversion different implementations [80]. (DAC).
A. DRAM
Advanced memory technology can reduce the access energy
for high density memories such as DRAMs. For instance, psums
embedded DRAM (eDRAM)brings high density memory on-
chip to avoid the high energy cost of switching off-chip pixels
capacitance [97]; eDRAM is 2.85higher density than SRAM 0.5
and 32% more energy efficient than DRAM (DDR3) [93].
eDRAM also offers higher bandwidth and lower latency
compared to DRAM. In DNN processing, eDRAM can be used DNN Dataflows
to store tens of megabytes of weights and activations on-chip
to avoid off-chip access, as demonstrated in DaDianNao [93].
off-chip DRAM and can increase the cost of the chip.
Rather than integrating DRAM into the chip itself, the
DRAM can also be stacked on top of the chip using throughVI. N EAR -D ATA PROCESSING silicon vias (TSV). This technology is often referred to as3-D
The previous section highlighted that data movement domi- memory, and has been commercialized in the form of Hybrid
nates energy consumption. While spatial architectures distribute Memory Cube (HMC) [98] and High Bandwidth Memory
the on-chip memory such that it is closer to the computation (HBM) [99]. 3-D memory delivers an order of magnitude higher
(e.g., into the PE), there have also been efforts to bring the bandwidth and reduces access energy by up to 5relative to
off-chip high density memory closer to the computation or to existing 2-D DRAMs, as TSV have lower capacitance than
integrate the computation into the memory itself; the latter is typical off-chip interconnects. Recent works have explored the
often referred to asprocessing-in-memoryorlogic-in-memory. use of HMC for efficient DNN processing in a variety of ways.
In embedded systems, there have also been efforts to bring the For instance, Neurocube [100] integrates SIMD processors into
computation into the sensor where the data is first collected. the logic die of the HMC to bring the memory and computation 19
voltage as the input, and the current as the output as shown in resistive memory.
<<FIGURE>>
Fig. 36. Analog computation by (a) SRAM bit-cell and (b) non-volatile
Processing with non-volatile resistive memories has several drawbacks as described in [108].
First, it suffers from the
reduced precision and ADC/DAC overhead of analog process-
ing described earlier. Second, the array size is limited by thecloser together. Tetris [101] explores the use of HMC with wires that connect the resistive devices; specifically, wire energythe Eyeriss spatial architecture and row stationary dataflow. dominates for large arrays (e.g., 1k1k), and the IR drop alongIt proposes allocating more area to computation than on-chip wire can degrade the read accuracy. Third, the write energymemory (i.e., larger PE array and smaller global buffer) in to program the resistive devices can be costly, in some casesorder to exploit the low energy and high throughput properties requiring multiple pulses. Finally, the resistive devices can alsoof the HMC. It also adapts the dataflow to account for the suffer from device-to-device and cycle-to-cycle variations withHMC memory and smaller on-chip memory. Tetris achieves non-linear conductance across the conductance range.a 1.5reduction in energy consumption and 4.1increase There have been several recent works that explore the use ofin throughput over a baseline system with conventional 2-D memristors for DNNs. ISAAC [104] replaces the eDRAM inDRAM. DaDianNao with memristors. To address the limited precision
support, ISAAC computes a 16-bit dot product operation with
B. SRAM 8 memristors each storing 2-bits; a 1-bit2-bit multiplication
Rather than bringing the memory near the compute, recent is performed at each memristor, where a 16-bit input requires
work has also investigated bringing the compute into the 16 cycles to complete. In other words, the ISAAC architecture
memory. For instance, the multiply and accumulate operation trades off area and time for increased precision. Finally, ISAAC
can be directly integrated into the bit-cells of an SRAM arranges its 25.1M memristors in a hierarchical structure to
array [102], as shown in Fig. 36(a). In this work, a 5-bit avoid issues with large arrays. PRIME [109] also replaces the
DAC is used to drive the word line (WL) to an analog voltage DRAM main memory with memristors; specifically, it uses
that represents the feature vector, while the bit-cells store the 256256 memristor arrays that can be configured for 4-bit
binary weights1. The bit-cell current (I multi-level cell computation or 1-bit single level cell storage. BC ) is effectively
a product of the value of the feature vector and the value of It should be noted that results from ISAAC and PRIME are
the weight stored in the bit-cell; the currents from the bit- obtained from simulations. The task of actually fabricating
cells within a column add together to discharge the bitline large memristors arrays is still very much a research challenge;
(V for instance, [110] uses a fabricated 1212 memristor array BL ). This approach gives 12energy savings compared to
reading the 1-bit weights from the SRAM and performing the to demonstrate a linear classifier.
computation separately. To counter circuit non-idealities, the
DAC accounts for the non-linear bit-line discharge with respect D. Sensors
to the WL voltage, and boosting is used to combine the weak In certain applications, such as image processing, the dataclassifiers that are susceptible to device variations to form a movement from the sensor itself can account for a significantstrong classifier [103]. portion of the system energy consumption. Thus there has
also been research on performing the computation as close
C. Non-volatile Resistive Memories as possible to the sensor. In particular, much of the work
focuses on moving the computation into the analog domain toThe multiply and accumulate operation can also be directly avoid using the ADC within the sensor, which accounts for aintegrated into advancednon-volatilehigh density memories significant portion of the sensor power. However, as mentionedby using them as programmable resistive elements, commonly
referred to asmemristors[105]. Specifically, a multiplication 8 The resistive devices can be inserted between the cross-point of two wires is performed with the resistors conductance as the weight, the and in certain cases can avoid the need for an access transistor. 20
earlier, lower precision is required for analog computation due
to circuit non-idealities.
In [111], the matrix multiplication is integrated into the
ADC, where the most significant bits of the multiplications
are performed using switched capacitors in an 8-bit successive
approximation format. This is extended in [112] to not only
perform the multiplications, but also the accumulations in the
analog domain. In this work, it is assumed that 3-bits and
6-bits are sufficient to represent the weights and activations,
respectively. This reduces the number of ADC conversions in
the sensor by 21. RedEye [113] takes this approach even
further by performing the entire convolution layer (including
convolution, max pooling and quantization) in the analog
domain at the sensor. It should be noted that [111] and [112]
report measured results from fabricated test chips, while results
in [113] are from simulations. <<FIGURE>>
It is also feasible to embed the computation not just before
the ADC, but into the sensor itself. For instance, in [114] an Fig. 37. Various methods of quantization (Figures from [117, 118]).
Angle Sensitive Pixels sensor is used to compute the gradient
of the input, which along with compression, reduces the data the number of bits. The benefits of reduced precision includemovement from the sensor by 10. In addition, since the reduced storage cost and/or reduced computation requirements.first layer of the DNN often outputs a gradient-like feature
map, it maybe possible to skip the computations in the first There are several ways to map the data to quantization levels.
layer, which further reduces energy consumption as discussed The simplest method is a linear mapping with uniform distance
in [115, 116]. between each quantization level (Fig. 37(a)). Another approach
is to use a simple mapping function such as alog function
(Fig. 37(b)) where the distance between the levels varies; thisVII. C O -DESIGN OF DNN MODELS AND HARDWARE mapping can often be implemented with simple logic such as aIn earlier work, the DNN models were designed to maximize shift. Alternatively, a more complex mapping function can beaccuracy without much consideration of the implementation used where the quantization levels are determined or learnedcomplexity. However, this can lead to designs that are chal- from the data (Fig. 37(c)), e.g., using k-means clustering; forlenging to implement and deploy. To address this, recent this approach, the mapping is usually implemented with a lookwork has shown that DNN models and hardware can be co- up table.designed to jointly maximize accuracy and throughput, while Finally, the quantization can be fixed (i.e., the same methodminimizing energy and cost, which increases the likelihood of of quantization is used for all data types and layers, filters, andadoption. In this section, we will highlight various efforts that channels in the network); or it can be variable (i.e., differenthave been made towards the co-design of DNN models and methods of quantization can be used for weights and activations,hardware. Note that unlike Section V, the techniques discussed and different layers, filters, and channels in the network).in this section can affect the accuracy; thus, the goal is to Reduced precision research initially focused on reducingnot only substantially reduce energy consumption and increase the precision of the weights rather than the activations, sincethroughput, but also to minimize any degradation in accuracy. weights directly increase the storage capacity requirement,The co-design approaches can be loosely grouped into the while the impact of activations on storage capacity depends onfollowing categories: the network architecture and dataflow. However, more recent
Reduce precision of operations and operands.This in- works have also started to look at the impact of quantizationcludes going from floating point to fixed point, reducing on activations. Most reduced precision research also focusesthe bitwidth, non-linear quantization and weight sharing. on reducing the precision for inference rather than training
Reduce number of operations and model size. This (with some exceptions [88,119,120]) due to the sensitivity ofincludes techniques such as compression, pruning and the gradients to quantization.compact network architectures. The key techniques used in recent work to reduce precision
are summarized in Table III; both linear and non-linear
A. Reduce Precision quantization applied to weights and activations are explored.
Quantization involves mapping data to a smaller set of The impact on accuracy is reported relative to a baseline
quantization levels. The ultimate goal is to minimize the error precision of 32-bit floating point, which is the default precision
between the reconstructed data from the quantization levels and used on platforms such as GPUs and CPUs.
the original data. The number of quantization levels reflects the 1) Linear quantization:The first step of reducing precision
precisionand ultimately the number of bits required to represent is usually to convert values and operations from floating point
the data (usuallylog 2 of the number of levels); thus,reduced to fixed point. A 32-bit floating point number, as shown in
precisionrefers to reducing the number of levels, and thus Fig. 38(a), is represented by <<FORMULA>>, wheres
product; that output would need to be accumulated with <<FORMULA>>
bit precision, where M is determined based on the largest filter (b) 8-bit dynamic fixed point examples
size <<FORMULA>> (<<FORMULA>> from Fig. 9(b)), which is in the range of 0 to 16 bits for the popular DNNs described in SectionIII-B.
Fig. 38. Various methods of number representations. 1
After accumulation, the precision of the final output activation
is typically reduced to N-bits [88,121], as shown in Fig. 39.is the sign bit, e is the
8-bit exponent, andmis the 23-bit The reduced output precision does not have a significant impact
mantisa, and covers the range of <<FORMULA>>.
on accuracy if the distribution of the weights and activationsAn N-bit fixed point number is
represented by <<FORMULA>> are centered near zero such that the accumulation would not
2f , wheresis the sign bit,mis the (N-1)-bit mantissa, and move only in one direction;
this is particularly true when batchfdetermines the location of the decimal point and acts as a normalization is used.
scale factor. For instance, for an 8-bit integer, whenf= 0,
The reduced precision is not only explored in research,the dynamic range is -128 to 127,
whereas whenf= 10, the but has been used in recent commercial platforms for DNN
dynamic range is -0.125 to 0.124023438.Dynamicfixed point processing. For instance, Googles
Tensor Processing Unitrepresentation allowsfto vary based on the desired dynamic (TPU)
which was announced in May 2016, was designed forrange as shown in Fig. 38(b).
This is useful for DNNs, since 8-bit integer arithmetic [123]. Similarly, Nvidias PASCAL
the dynamic range of the weights and activations can be quite GPU, which was announced in
April 2016, also has 8-bitdifferent. In addition, the dynamic range can also vary across
\integer instructions for deep learning inference [124]. In generallayers and layer types
(e.g., convolutional vs. fully connected). purpose platforms such as CPUs and GPUs, the main benefit
Using dynamic fixed point, the bitwidth can be reduced to 8 of using 8-bit computation is an increase
in throughput, asbits for the weights and 10 bits for the activations without any four 8-bit
operations rather than one 32-bit operation can befine-tuning of the weights [121]; with fine-tuning,
both weights performed for a given clock cycle.and activations can reach 8-bits [122].
While general purpose platforms usually support 8-bit,Using 8-bit fixed point has the following
impact on energy 16-bit and/or 32-bit operations, it has been shown that theand area [79]:
minimum bit precision for DNNs can actually vary in a more
An 8-bit fixed point add consumes 3.3% less energy fine grained manner. For instance, the weight and activation
(3.8less area) than a 32-bit fixed point add, and 30% precision can vary between 4 and 9 bits for AlexNet across
less energy (116less area) than a 32-bit floating point different layers without significant impact on accuracy (i.e., a
add. The energy and area of a fixed-point add scales change of less than 1%) [125,126]. This fine-grained variation
approximately linearly with the number of bits. can be exploited for increased throughput or reduced energy
An 8-bit fixed point multiply consumes 15.5% less energy consumption with specialized hardware. For instance, if bit-
(12.4% less area) than a 32-bit fixed point multiply, serial processing is used, where the number of clock cycles to
and 18.5% less energy (27.5% less area) than a 32-bit complete an operation is proportional to the bitwidth, adapting
floating point multiply. The energy and area of a fixed- to fine-grain variations in bit precision can result in a 2.24%
point multiply scales approximately quadratically with the speed up versus 16-bits [125]. Alternatively, a multiplier can
number of bits. be designed such that its critical path reduces based on the bit
Reducing the precision also reduces the energy and area cost precision as fewer adders are needed to resolve the product;
for storage, which is important since memory access and data this can be combined with voltage scaling for a 2.56energy
movement dominate energy consumption as described earlier. savings versus 16-bits [126]. While these bit scaling results
The energy and area of the memory scale approximately linearly are reported relative to 16-bit, it would be interesting to see
with number of bits. It should be noted, however, that changing their impact relative to the maximum precision required across
from floating point to fixed point, without reducing bit-width, layers (i.e., 9-bits for [125, 126]).
does not reduce the energy or area cost of the memory. The precision can be reduced even more aggressively to a
For completeness, it should be noted that the precision of single bit; this area of research is often referred to asbinary nets.
the internal values of a fixed-point multiply and accumulate BinaryConnect (BC) [127] introduced the concept of binary
(MAC) operation are typically higher than the weights and weights (i.e., -1 and 1), where using a binary weight reduced
activations. To guarantee no precision loss, weights and input the multiplication in the MAC to addition and subtraction
activations with N-bit fixed-point precision would require an only. This was later extended in Binarized Neural Networks
N-bitxN-bit multiplication which generates a 2N-bit output (BNN) [128] that uses binary weightsandactivations, which
<<FIGURE>>
Fig. 40. Weight sharing hardware.
w, where w is the average of the absolute values of the
weights in the filter) 9 , keeping the first and last layers at 32-bit
floating point precision, and performing normalization before VGG-16 [117]. Furthermore, when weights are quantized to
convolution to reduce the dynamic range of the activations. powers of two, the multiplication can be replaced with a bit-
With these changes, BWN reduced the accuracy loss to 0.8%, shift [122,135]. 10 Incremental Network Quantization (INQ)
while XNOR-Nets reduced the loss to 11%. The loss of XNOR- can be used to further reduce the loss in accuracy by dividing
Net can be further reduced by increasing the precision of the the large and small weights into different groups, and then
activations to be slightly larger than one bit. For instance, iteratively quantizing and re-training the weights [136].
Quantized Neural Networks (QNN) [119], DoReFa-Net [120], Weight Sharingforces several weights to share a single value.
and HWGQ-Net [130] allow the activations to have 2-bits, This reduces the number of unique weights in a filter or a
while the weights remain at 1-bit; in HWGQ-Net, this reduces layer. One example is to group the weights by using a hashing
the accuracy loss to 5.2%. function and use one value for each group [137]. Alternatively,
All the previously described binary nets limit the weights the weights can be grouped by the k-means algorithm [118].
to two values (-wandw); however, there may be benefits Both the shared weights and the indexes indicating which
for allowing weights to be zero (i.e., -w, 0,w). Although weight to use at each position of the filter are stored. This
this requires an additional bit per weight compared to binary leads to a two step process to fetch the weight: (1) read the
weights, the sparsity of the weights can be exploited to reduce weight index; (2) using the weight index, read the shared
computation and storage cost, which can potentially cancel weights. This approach can reduce the cost of reading and
out the cost of the additional bit. This is explored in Ternary storing the weights if the weight index (log 2 of the number of
Weight Nets (TWN) [131] and then extended in Trained Ternary unique weights) is less than the bitwidth of the weight itself.
Quantization (TTQ) where a different scale is trained for each For instance, in Deep Compression [118], the number of
weight (i.e., -w unique weights per layer is reduced to 256 for convolutional 1 , 0,w2 ) for an accuracy loss of 0.6% [132],
assuming 32-bit floating point for the activations. layers and 16 for fully-connected layers in AlexNet, requiring
Hardware implementations for binary/ternary nets have 8-bit and 4-bit weight indexes, respectively. Assuming there
been explored in recent publications. YodaNN [133] uses areUunique weights and the size of the filters in the layer
binary weights, while BRein [134] uses binary weights and is <<FORMULA>> from Fig. 9(b), there will be energy savings
activations. Binary weights are also used in the compute if reading from a CRSM <<log(2)>> U-bit memory plus aU16-
in SRAM work [102] described in Section VI. Finally, the bit memory (as shown in Fig. 40) cost less than reading
nominally spike-inspired TrueNorth chip can implement a from a CRSM 16-bit memory. Note that unlike the previous
reduced precision neural network with binary activations and quantization methods, the weight sharing approach does not
ternary weights using TrueNorths quantized weight table [9]. reduce the precision of the MAC computation itself and only
These works tend not to support state-of-the-art DNN models reduces the weight storage requirement.
(with the exception of YodaNN).
2) Non-linear quantization:The previous works described B. Reduce Number of Operations and Model Size
involve linear quantization where the levels are uniformly In addition to reducing the size of each operation or operandspaced out. It has been shown that the distributions of the (weight/activation), there is also a significant amount of researchweights and activations are not uniform [118,135], and thus on methods to reduce the number of operations and modela non-linear quantization can potentially improve accuracy. size. These techniques can be loosely classified as exploitingSpecifically, there have been two popular approaches taken activation statistics, network pruning, network architecturein recent works: (1) log domain quantization; (2) learned design and knowledge distillation.quantization or weight sharing. 1) Exploiting Activation Statistics: As discussed in Sec-Log domain quantizationIf the quantization levels are tionIII-A1, ReLU is a popular form of non-linearity used inassigned based on a logarithmic distribution as shown in DNNs that sets all negative values to zero as shown in Fig. 41(a). Fig 37(b), the weights and activations are more equally As a result, the output activations of the feature maps after the distributed across the different levels and each level is used ReLU are sparse; for instance, the feature maps in AlexNetmore efficiently resulting in less quantization error. For instance, have sparsity between 19% to 63% as shown in Fig. 41(b).using 4 bits in linear quantization results in a 27.8% loss in This sparsity gives ReLU an implementation advantage overaccuracy versus a 5% loss for log base-2 quantization for other non-linearities such as sigmoid, etc.
9 This can also be thought of as a form of weights sharing, where only two 10 Note however that multiplications do not account for a significant portion
weights are used per filter. of the total energy.
<<FORMULA>>
TABLE III
METHODS TO REDUCE NUMERICAL PRECISION FOR ALEX NET . ACCURACY MEASURED FOR TOP-5 ERROR ON IMAGE NET .
a cost of reduced accuracy.
2) Network Pruning:To make network training easier, the
networks are usually over-parameterized. Therefore, a large
amount of the weights in a network are redundant and can
be removed (i.e., set to zero). This process is called network
pruning. Aggressive network pruning often requires some fine-
tuning of the weights to maintain the original accuracy. This
was first proposed in 1989 through a technique called Optimal
Brain Damage [140]. The idea was to compute the impact of
each weight on the training loss (discussed in SectionII-C),
referred to as the weight saliency. The low-saliency weights (Normalized)
were removed and the remaining weights were fine-tuned; this
process was repeated until the desired weight reduction and
accuracy were reached.
In 2015, a similar idea was applied to modern DNNs in [141].
<<FORMULA>> Rather than using the saliency as a metric, which is too difficult
to compute for the large-scaled DNNs, the pruning was simply
Fig. 41. Sparsity in activations due to ReLU. based on the magnitude of the weights. Small weights were
pruned and the model was fine-tuned to restore the accuracy.
Without fine-tuning the weights, about 50% of the weightsThe sparsity can be exploited for energy and area savings could be pruned. With fine-tuning, over 80% of the weightsusing compression, particularly for off-chip DRAM access were pruned. Overall this approach can reduce the numberwhich is expensive. For instance, a simple run length coding of weights in AlexNet by 9and the number of MACsthat involves signaling non-zero values of 16-bits and then runs by 3. Most of the weight reduction comes from the fully-of zeros up to 31 can reduce the external memory bandwidth connected layers (9.9for fully-connected layers versus 2.7of the activations by 2.1and the overall external bandwidth for convolutional layers).(including weights) by 1.5[61]. 11 In addition to compression,
the hardware can also be modified such that it skips reading the However, the number of weights alone is not a good metric
weights and performing the MAC for zero-valued activations for energy. For instance, in AlexNet, the number of weights
to reduce energy cost by 45% [94]. Rather than just gating the in the fully-connected layers is much larger than in the
read and MAC computation, the hardware could also skip the convolutional layers; however, the energy of the convolutional
cycle to increase the throughput by 1.37%[138]. layers is much higher than the fully-connected layers as shown
The activations can be made to be even more sparse by prun- in Fig. 35 [80]. Rather than using the number of weights
ing the low-valued activations. For instance, if all activations and MAC operations as proxies for energy, the pruning of
with small values are pruned, this can be translated into an the weights can be directly driven by energy itself [142]. An
additional 11% speed up [138] or 2power reduction [139] energy evaluation method can be used to estimate the DNN
with little impact on accuracy. Aggressively pruning more energy that accounts for the data movement from different
activations can provide additional throughput improvement at levels of the memory hierarchy, the number of MACs, and the
data sparsity as shown in Fig. 42; this energy estimation tool
is available at [143]. The resulting energy values for popular This simple run length compression is within 5-10% of the theoretical
entropy limit. DNN models are shown in Fig. 43(a). Energy-aware pruning 24
<<FIGURE>>
Fig. 42. Energy estimation methodology from [142], which estimates the
energy based on data movement from different levels of the memory hierarchy,
<<FIGURE>>
Fig. 43. Energy values estimated with methodology in [142]. a time [144]. The CSC format will provide an overall lower
memory bandwidth than CSR if the output is smaller than the
input, or in the case of DNN, if the number of filters isnot
can then be used to prune weights based on energy to reduce significantly larger than the number of weights in the filter
the overall energy across all layers by 3.7% for AlexNet, which (<<FORMULA>> from Fig. 9(b)). Since this is often true, CSC can
is 1.74more efficient than magnitude-based approaches [141] be an effective format for sparse DNN processing.
as shown in Fig. 43(b). As mentioned previously, it is well Custom hardware has been explored to efficiently supportknown that AlexNet is over-parameterized. The energy-aware pruned DNN models. Many works aim to perform the process-pruning can also be applied to GoogleNet, which is already a ing without decompressing the weights or activations. EIE [145]small DNN model, for a 1.6energy reduction. performs the sparse matrix-vector multiplication specifically for
Recent works have examine how to efficiently support the fully connected layers. It stores the weights in a CSC format
processing of sparse weights in hardware. One area of interest along with the start location of each column, which needs to be
is how to best store the sparse weights after pruning. Similar to stored since the compressed weights have variable length. When
compressing the sparse activations discussed in SectionVII-B1, the input is not zero, the compressed weight column is read and
the sparse weights can be compressed to reduce memory access the output is updated. To handle the sparsity, additional logic
bandwidth by 20 to 30% [118]. is used to keep track of the location of the output that should
When DNN processing is performed as a matrix-vector be updated. SCNN [146] supports processing of convolutional 25
layers in a compressed format. It uses an input stationary weights [154]. It proposes afiremodule that first squeezes
dataflow to deliver the compressed weights and activations to the network with 1x1 convolution filters and then expands
a multiplier array followed by a scatter network to add the it with multiple 1x1 and 3x3 convolution filters. It achieves
scattered partial sums. an overall 50% reduction in number of weights compared to
Recent works have also explored the use of structured AlexNet, while maintaining the same accuracy. It should be
pruning to avoid the need for custom hardware [147,148]. noted, however, that reducing the number of weights does not
Rather than pruning individual weights (also referred to as fine- necessarily reduce energy; for instance, SqueezeNet consumes
grained pruning), structured pruning involves pruning groups more energy than AlexNet, as shown in Fig. 43(a).
of weights (also referred to as coarse-grained pruning). The b) After Training:Tensor decomposition can be used to
benefits of structured pruning are (1) the resulting weights can decompose filters in a trained network without impacting the
better align with the data-parallel architecture (e.g., SIMD) accuracy. It treats weights in a layer as a 4-D tensor and breaks
found in existing general purpose hardware, which results in it into a combination of smaller tensors (i.e., several layers).
more efficient processing [149]; (2) it amortizes the overhead Low-rank approximation can then be applied to further increase
cost required to signal the location of the non-zero weights the compression rate at the cost of accuracy degradation, which
across a group of weights, which improves compression and can be restored by fine-tuning the weights.
thus reduces storage cost. These groups of weights can include This approach is demonstrated using Canonical Polyadic (CP)
a pair of neighboring weights, an entire row or column of a decomposition, a high-order extension of singular value decom-
filter, an entire channel of a filter or the entire filter itself; using position that can be solved by various methods, such as a greedy
larger groups tends to result in higher loss in accuracy [150]. algorithm [155] or a non-linear least-square method [156].
3) Compact Network Architectures:The number of weights Combining CP-decomposition with low-rank approximation
and operations can also be reduced by improving the network achieves a 4.5% speed-up on CPUs [156]. However, CP-
architecture itself. The trend is to replace a large filter with a decomposition cannot be computed in a numerically stable
series of smaller filters, which have fewer weights in total; when way when the dimension of the tensor, which represents the
the filters are applied sequentially, they achieve the same overall weights, is larger than two [156]. To alleviate this problem,
effective receptive field (i.e., the region the filter uses from input Tucker decomposition is adopted instead in [157].
image to compute an output). This approach can be applied 4) Knowledge Distillation:Using a deep network or av-
during the network architecture design (before training) or by eraging the predictions of different models (i.e., ensemble)
decomposing the filters of a trained network (after training). gives a better accuracy than using a single shallower network.
The latter one avoids the hassle of training networks from However, the computational complexity is also higher. To get
scratch. However, it is less flexible than the former one. For the best of both worlds, knowledge distillation transfers the
example, existing methods can only decompose a filter in a knowledge learned by the complex model (teacher) to the
trained network into a series of filters without non-linearity simpler model (student). The student network can therefore
between them. achieve an accuracy that would be unachievable if it was
a) Before Training:In recent DNN models, filters with directly trained with the same dataset [158,159]. For example,
a smaller width and height are used more frequently because [160] shows how using knowledge distillation can improve the
concatenating several of them can emulate a larger filter as speech recognition accuracy of a student net by 2%, which is
shown in Fig. 13. For example, one 5x5 convolution can be similar to the accuracy of a teacher net that is composed of
replaced with two 3x3 convolutions. Alternatively, one NxN an ensemble of 10 networks.
convolution can be decomposed into two 1-D convolutions, one Fig. 45 shows the simplest knowledge distillation
1xN and one Nx1 convolution [53]; this basically imposes method [158]. The softmax layer is commonly used as the
a restriction that the 2-D filter must be separable, which is output layer in the image classification networks to generate
a common constraint in image processing [151]. Similarly, a the class probabilities from the class scores 12 ; it squashes the
3-D convolution can be replaced by a set of 2-D convolutions class scores into values between 0 and 1 that sum up to 1.
(i.e., applied only on one of the input channels) followed by For this knowledge distillation method, soft targets (values
1x1 3-D convolutions as demonstrated in Xception [152] and between 0 and 1) such as the class scores of the teacher DNN
MobileNets [153]. The order of the 2-D convolutions and 1x1 (or an ensemble of teacher DNNs) are used instead of the
3-D convolutions can be switched. hard targets (values of either 0 or 1) such as the labels in the
1x1 convolutional layers can also be used to reduce the dataset. The objective is to minimize the squared difference
number of channels in the output feature map for a given between the soft targets and the class scores of the student DNN.
layer, which reduces the number of filter channels and thus Class scores are used as the soft targets instead of the class
computation cost for the filters in the next layer as demonstrated probabilities because small values in the class scores contain
in [15,51,52]; this is often referred to as a bottleneck as important information that may be eliminated by the softmax.
discussed in SectionIII-B. For this purpose, the number of 1x1 Alternatively, class probabilities after the softmax layer can be
filters has to be less than the number of channels in the 1x1 used as soft targets if the softmax is configured to generate
filter. For example, 32 filters of 1x164 can transform an input softer class probabilities where the smaller values retain more
with 64 channels to an output of 32 channels and reduce the information [160]. Finally, the intermediate representations of
number of filter channels in the next layer to 32. SqueezeNet
uses many 1x1 filters to aggressively reduce the number of 12 Also commonly referred to as logits.
robotics. For data analytics, high throughput means that more
data can be analyzed in a given amount of time. As the amount
of visual data is growing exponentially, high-throughput big
data analytics becomes important, particularly if an action needs
to be taken based on the analysis (e.g., security or terrorist
prevention; medical diagnosis). Try to match
Low latencyis necessary for real-time interactive applications.
Latency measures the time between when the pixel arrives
to a system and when the result is generated. Latency is Simple DNN
measured in terms of seconds, while throughput is measured
in operations/second. Often high throughput is obtained by
batching multiple images/frames together for processing; this Fig. 45.
Knowledge distillation matches the class scores of a small DNN to results
in multiple frame latency (e.g., at 30 frames per second, an ensemble of large DNNs.
a batch of 100 frames results in a 3 second delay). This delay
is not acceptable for real-time applications, such as high-speed
navigation where it would reduce the time available for coursethe teacher DNN can
also be incorporated as the extra hints correction. Thus achieving low latency and
high throughputto train the student DNN [161].
Hardware costis in large part dictated by the amount of
on-chip storage and the number of cores. Typical embedded
processors have limited on-chip storage on the order of a few
simultaneously can be a challenge.
VIII. B ENCHMARKING METRICS FOR DNN EVALUATION AND COMPARISON
As we have seen in this article, there has been a significant hundred kilobytes. Since there is a trade-off between the amount
amount of research on efficient processing of DNNs. We should of on-chip memory and the external memory bandwidth, both
consider several key metrics to compare the various strengths metrics should be reported. Similarly, there is a correlation
and weaknesses of different designs and proposed techniques. between the number of cores and the throughput. In addition,
These metrics should cover important attributes such as accu- while many cores can be built on a chip, the number of cores
racy/robustness, power/energy consumption, throughput/latency that can actually be used at a given time should be reported. It is
and cost. Reporting all these metrics is important in order often unrealistic to assume peak utilization and performance due
to provide a complete picture of the trade-offs made by a to limitations of mapping and memory bandwidth. Accordingly,
proposed design or technique. We have prepared a website to the power and throughput should be reported for running actual
collect these metrics from various publications [162]. DNNs as opposed to only reporting theoretical limits.
In terms ofaccuracyandrobustness, it is important that the
accuracy be reported on widely-accepted datasets as discussed
in Section IV. The difficulty of the dataset and/or task should A. Metrics for DNN Models
be considered when measuring the accuracy. For instance, the To evaluate the properties of a given DNN model, we should
MNIST dataset for digit recognition is significantly easier than consider the following metrics:the ImageNet dataset.
As a result, a DNN that performs well
on MNIST may not necessarily perform well on ImageNet. Theaccuracy of the model in terms of the top-5 error
Thus it is important that the same dataset and task is used when on datasets such as ImageNet. Also, the type of data
comparing the accuracy of different DNN models; currently augmentation used (e.g., multiple crops, ensemble models)
ImageNet is preferred since it presents a challenge for DNNs, should be reported.
as opposed to MNIST, which can also be addressed with simple Thenetwork architectureof the model should be reported,
non-DNN techniques. To demonstrate primarily hardware including number of layers, filter sizes, number of filters
innovations, it would be desirable to report results for widely- and number of channels.
used DNN models (e.g., AlexNet, GoogLeNet) whose accuracy Thenumber of weightsimpact the storage requirement of
and robustness have been well studied and tested. the model and should be reported. If possible, the number
Energyandpowerare important when processing DNNs at of non-zero weights should be reported since this reflects
the edge in embedded devices with limited battery capacity the theoretical minimum storage requirements.
(e.g., smart phones, smart sensors, UAVs, and wearables), or in Thenumber of MACsthat needs to be performed should
the cloud in data centers with stringent power ceilings due to be reported as it is somewhat indicative of the number
cooling costs, respectively. Edge processing is preferred over of operations and potential throughput of the given DNN.
the cloud for certain applications due to latency, privacy or If possible, the number of non-zero MACs should also
communication bandwidth limitations. When evaluating the be reported since this reflects the theoretical minimum
power and energy consumption, it is important to account compute requirements.
for all aspects of the system including the chip and external Table IV shows how these metrics are reported for various
memory accesses. well known DNNs. The accuracy is reported for the case where
High throughputis necessary to deliver real-time perfor- only a single crop for a single model is used for classification,
mance for interactive applications such as navigation and such that the number of weights and MACs in the table are
reported in terms of the core area in squared millimeters
per multiplier along with process technology.
In terms of cost, different platforms will have different
implementation-specific metrics. For instance, for an FPGA, (Number of CONV Layers)
the specific device should be reported, along with the utilization
of resources such as DSP, BRAM, LUT and FF; performance
density such as GOPs/slice can also be reported. Stride
Each processor should report various specifications for each NZ Weights
metric as shown in Table V, using the Eyeriss chip as an
example. It is important that all metrics and specifications are
accounted for in order fairly evaluate all the design trade-offs. Number of Channels
For instance, without the accuracy given for a specific dataset Number of Filters
and task, one could run a simple DNN and easily claim low
power, high throughput, and low cost however, the processor
might not be usable for a meaningful task; alternatively, without Total NZ MACs
reporting the off-chip bandwidth, one could build a processor
with only multipliers and easily claim low cost, high throughput,
high accuracy, and lowchippower however, when evaluating
systempower, the off-chip memory access would be substantial.
Finally, the test setup should also be reported, including whether
the results are measured or obtained from simulation and consistent.
(NZ) operations significantly reduces the number of MACs
In summary, the evaluation process for whether a DNNand weights.
Since the number of NZ MACs depends on the system is a viable solution
for a given application might go asinput data, we propose using the publicly available 50,000 follows:
(1) the accuracy determines if it can perform the givenvalidation images from ImageNet for the
computation. Finally, task; (2) the latency and throughput determine if it can run fast there are
various methods to reduce the weights in a DNN enough and in real-time; (3) the energy and power consumption
(e.g., network pruning in SectionVII-B2). Table IV shows will primarily dictate the form factor of the device
where the another example of these DNN model metrics, by comparing processing can operate; (4) the cost,
which is primarily dictatedsparse DNNs pruned using [142] to dense DNNs.
by the chip area, determines how much one would pay for this
solution.
<<TABLE>>
TABLE IV
METRICS FOR POPULAR DNN MODELS. SPARSITY IS ACCOUNT FOR BY
REPORTING NON-ZERO (NZ) WEIGHTS AND MACS.
B. Metrics for DNN Hardware
To measure the efficiency of the DNN hardware, we should IX. SUMMARY
consider the following additional metrics: The use of deep neural networks (DNNs) has seen explosive
Thepower and energyconsumption of the design should growth in the past few years. They are currently widely used
be reported for various DNN models; the DNN model for many artificial intelligence (AI) applications including
specifications should be provided including which layers computer vision, speech recognition and robotics and are often
and bit precision are supported by the hardware during delivering better than human accuracy. However, while DNNs
measurement. In addition, the amount of off-chip accesses can deliver this outstanding accuracy, it comes at the cost
(e.g., DRAM accesses) should be included since it of high computational complexity. Consequently, techniques
accounts for a significant portion of the system power; it that enable efficient processing of deep neural network to
can be reported in terms of the total amount of data that improveenergy-efficiencyandthroughputwithout sacrificing
is read and written off-chip per inference. accuracywith cost-effective hardware are critical to expanding
Thelatency and throughputshould be reported in terms the deployment of DNNs in both existing and new domains.
of the batch size and the actual run time for various Creating a system for efficient DNN processing should
DNN models, which accounts for mapping and memory begin with understanding the current and future applications
bandwidth effects. This provides a more useful and and the specific computations required both now and the
informative metric than peak throughput. potential evolution of those computations. This article surveys a
Thecostof the chip depends on the area efficiency, which number of the current applications, focusing on computer vision
accounts for the size and type of memory (e.g., registers applications, the associated algorithms, and the data being used
or SRAM) and the amount of control logic. It should be to drive the algorithms. These applications, algorithms and
input data are experiencing rapid change. So extrapolating
13 Data augmentation is often used to increase accuracy. This includes using these trends to determine the degree of flexibility desired to
multiple crops of an image to account for misalignment; in addition, an handle next generation computations, becomes an important ensemble of multiple models can be used where each model has different
weights due to different training settings, such as using different initializations ingredient of any design project.
or datasets, or even different network architectures. If multiple crops and
models are used, then the number of MACs and weights required would
<<TABLE>>
TABLE V
EXAMPLE BENCHMARK METRICS FOR EYERISS [94].
During the design-space exploration process, it is critical to article both reviews a variety of these techniques and discusses
understand and balance the important system metrics. For DNN the frameworks that are available for describing, running and
computation these include the accuracy, energy, throughput training networks.
and hardware cost. Evaluating these metrics is, of course, Finally, DNNs afford the opportunity to use mixed-signal
key, so this article surveys the important components of circuit design and advanced technologies to improve efficiency.
a DNN workload. In specific, a DNN workload has two These include using memristors for analog computation and 3-D
major components. First, the workload is the form of each stacked memory. Advanced technologies can also can facilitate
DNN network including the shape of each layer and the moving computation closer to the source by embedding compu-
interconnections between layers. These can vary both within tation near or within the sensor and the memories. Of course, all
and between applications. Second, the workload consists of of these techniques should also be considered in combination,
the specific the data input to the DNN. This data will vary while being careful to understand their interactions and looking
with the input set used for training or the data input during for opportunities for joint hardware/algorithm co-optimization.
operation for inference. In conclusion, although much work has been done, deep
This article also surveys a number of avenues that prior neural networks remain an important area of research with
work have taken to optimize DNN processing. Since data many promising applications and opportunities for innovation
movement dominates energy consumption, a primary focus at various levels of hardware design.
of some recent research has been to reduce data movement
while maintaining accuracy, throughput and cost. This means ACKNOWLEDGMENTS
selecting architectures with favorable memory hierarchies like Funding provided by DARPA YFA, MIT CICS, and gifts
a spatial array, and developing dataflows that increase data from Nvidia and Intel. The authors thank the anonymous
reuse at the low-cost levels of the memory hierarchy. We reviewers as well as James Noraky, Mehul Tikekar and
have included a taxonomy of dataflows and an analysis of Zhengdong Zhang for providing valuable feedback on this
their characteristics. Other work is presented that aims to save paper.
space and energy by changing the representation of data values
in the DNN. Still other work saves energy and sometimes REFERENCES
increases throughput by exploiting the sparsity of weights [1]Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, and/or activations. vol. 521, no. 7553, pp. 436444, May 2015.
The DNN domain also affords an excellent opportunity [2]L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer,
G. Zweig, X. He, J. Williamset al., “Recent advances in deep for joint hardware/software co-design. For example, various learning for speech research at Microsoft,” inICASSP, 2013. efforts have noted that efficiency can be improved by increasing [3]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
sparsity (increasing the number of zero values) or optimizing Classification with Deep Convolutional Neural Networks,” in
the representation of data by reducing the precision of values NIPS, 2012.
or using more complex mappings of the stored value to the [4]C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving:
Learning affordance for direct perception in autonomous actual value used for computation. However, to avoid losing driving,” inICCV, 2015. accuracy it is often useful to modify the network or fine-tune the [5]A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M.
networks weights to accommodate these changes. Thus, this Blau, and S. Thrun, “Dermatologist-level classification of skin 29
cancer with deep neural networks,”Nature, vol. 542, no. 7639, [25]J. Zhou and O. G. Troyanskaya, “Predicting effects of noncod-
pp. 115118, 2017. ing variants with deep learning-based sequence model,”Nature
[6]D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, methods, vol. 12, no. 10, pp. 931934, 2015.
G. van den Driessche, J. Schrittwieser, I. Antonoglou, [26]B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, “Predicting the sequence specificities of dna-and rna-binding
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, proteins by deep learning,”Nature biotechnology, vol. 33, no. 8,
K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the pp. 831838, 2015.
game of Go with deep neural networks and tree search,”Nature, [27]H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolu-
vol. 529, no. 7587, pp. 484489, Jan. 2016. tional neural network architectures for predicting dnaprotein
[7]F.-F. Li, A. Karpathy, and J. Johnson, “Stanford CS class binding,”Bioinformatics, vol. 32, no. 12, pp. i121i127, 2016.
CS231n: Convolutional Neural Networks for Visual Recogni- [28]M. Jermyn, J. Desroches, J. Mercier, M.-A. Tremblay, K. St-
tion,” http://cs231n.stanford.edu/. Arnaud, M.-C. Guiot, K. Petrecca, and F. Leblond, “Neural net-
[8]P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, works improve brain cancer detection with raman spectroscopy
J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, in the presence of operating room light artifacts,”Journal of
Y. Nakamuraet al., “A million spiking-neuron integrated circuit Biomedical Optics, vol. 21, no. 9, pp. 094002094002, 2016.
with a scalable communication network and interface,”Science, [29]D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck,
vol. 345, no. 6197, pp. 668673, 2014. “Deep learning for identifying metastatic breast cancer,”arXiv
[9]S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, preprint arXiv:1606.05718, 2016.
R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, [30]L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein-
T. Melano, D. R. Barchet al., “Convolutional networks for forcement learning: A survey,”Journal of artificial intelligence
fast, energy-efficient neuromorphic computing,”Proceedings research, vol. 4, pp. 237285, 1996.
of the National Academy of Sciences, 2016. [31]V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
[10]M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of D. Wierstra, and M. Riedmiller, “Playing Atari with Deep
convolutional networks through FFTs,” inICLR, 2014. Reinforcement Learning,” inNIPS Deep Learning Workshop,
[11]Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, 2013.
I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard, [32]S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end
“Handwritten digit recognition: applications of neural network training of deep visuomotor policies,”Journal of Machine
chips and automatic learning,”IEEE Commun. Mag., vol. 27, Learning Research, vol. 17, no. 39, pp. 140, 2016.
no. 11, pp. 4146, Nov 1989. [33]M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena,
[12]B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in “From Perception to Decision: A Data-driven Approach to End-
1960 IRE WESCON Convention Record, 1960. to-end Motion Planning for Autonomous Ground Robots,” in
[13]B. Widrow, “Thinking about thinking: the discovery of the ICRA, 2017.
LMS algorithm,”IEEE Signal Process. Mag., 2005. [34]S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik,
[14]O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, “Cognitive mapping and planning for visual navigation,” in
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, CVPR, 2017.
and L. Fei-Fei, “ImageNet Large Scale Visual Recognition [35]T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep
Challenge,”International Journal of Computer Vision (IJCV), control policies for autonomous aerial vehicles with mpc-guided
vol. 115, no. 3, pp. 211252, 2015. policy search,” inICRA, 2016.
[15]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning [36]S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-
for Image Recognition,” inCVPR, 2016. agent, reinforcement learning for autonomous driving,” inNIPS
[16]“Complete Visual Networking Index (VNI) Forecast,” Cisco, Workshop on Learning, Inference and Control of Multi-Agent
June 2016. Systems, 2016.
[17]J. Woodhouse, “Big, big, big data: higher and higher resolution [37]N. Hemsoth, “The Next Wave of Deep Learning Applications,”
video surveillance,” technology.ihs.com, January 2016. Next Platform, September 2016.
[18]R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich [38]S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Feature Hierarchies for Accurate Object Detection and Semantic Neural computation, vol. 9, no. 8, pp. 17351780, 1997.
Segmentation,” inCVPR, 2014. [39]T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramab-
[19]J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional hadran, “Deep convolutional neural networks for LVCSR,” in
Networks for Semantic Segmentation,” inCVPR, 2015. ICASSP, 2013.
[20]K. Simonyan and A. Zisserman, “Two-stream convolutional [40]V. Nair and G. E. Hinton, “Rectified Linear Units Improve
networks for action recognition in videos,” inNIPS, 2014. Restricted Boltzmann Machines,” inICML, 2010.
[21]G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, [41]A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlin-
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainathet al., “Deep earities improve neural network acoustic models,” inICML,
neural networks for acoustic modeling in speech recognition: 2013.
The shared views of four research groups,”IEEE Signal Process. [42]K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into
Mag., vol. 29, no. 6, pp. 8297, 2012. rectifiers: Surpassing human-level performance on imagenet
[22]R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, classification,” inICCV, 2015.
and P. Kuksa, “Natural language processing (almost) from [43]D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and
scratch,”Journal of Machine Learning Research, vol. 12, no. Accurate Deep Network Learning by Exponential Linear Units
Aug, pp. 24932537, 2011. (ELUs),”ICLR, 2016.
[23]A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, [44]X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and deep neural network acoustic models using generalized maxout
K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” networks,” inICASSP, 2014.
CoRR abs/1609.03499, 2016. [45]Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, , C. Laurent,
[24]H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, Y. Bengio, and A. Courville, “Towards End-to-End Speech
D. Merico, R. K. Yuen, Y. Hua, S. Gueroussov, H. S. Najafabadi, Recognition with Deep Convolutional Neural Networks,” in
T. R. Hugheset al., “The human splicing code reveals new Interspeech, 2016.
insights into the genetic determinants of disease,”Science, vol. [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
347, no. 6218, p. 1254806, 2015. shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
architecture for fast feature embedding,” inACM International [75]J. Cong and B. Xiao, “Minimizing computation in convolutional
Conference on Multimedia, 2014. neural networks,” inICANN, 2014.
[47]S. Ioffe and C. Szegedy, “Batch normalization: Accelerating [76]A. Lavin and S. Gray, “Fast algorithms for convolutional neural
deep network training by reducing internal covariate shift,” in networks,” inCVPR, 2016.
ICML, 2015. [77]“Intel Math Kernel Library,” https://software.intel.com/en-us/
[48]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- mkl.
based learning applied to document recognition,”Proc. IEEE, [78]S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
vol. 86, no. 11, pp. 22782324, Nov 1998. B. Catanzaro, and E. Shelhamer, “cuDNN: Efficient Primitives
[49]P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and for Deep Learning,”arXiv preprint arXiv:1410.0759, 2014.
Y. LeCun, “OverFeat: Integrated Recognition, Localization and [79]M. Horowitz, “Computings energy problem (and what we can
Detection using Convolutional Networks,” inICLR, 2014. do about it),” inISSCC, 2014.
[50]K. Simonyan and A. Zisserman, “Very Deep Convolutional [80]Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Archi-
Networks for Large-Scale Image Recognition,” inICLR, 2015. tecture for Energy-Efficient Dataflow for Convolutional Neural
[51]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, Networks,” inISCA, 2016.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper [81]——, “Using Dataflow to Optimize Energy Efficiency of Deep
With Convolutions,” inCVPR, 2015. Neural Network Accelerators,”IEEE Micros Top Picks from the
[52]M. Lin, Q. Chen, and S. Yan, “Network in Network,” inICLR, Computer Architecture Conferences, vol. 37, no. 3, May-June
2014. 2017.
[53]C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, [82]M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Dur-
“Rethinking the inception architecture for computer vision,” in danovic, E. Cosatto, and H. P. Graf, “A Massively Parallel
CVPR, 2016. Coprocessor for Convolutional Neural Networks,” inASAP,
[54]C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception- 2009.
v4, Inception-ResNet and the Impact of Residual Connections [83]V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, “Towards an
on Learning,” inAAAI, 2017. embedded biologically-inspired machine vision processor,” in
[55]G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, FPT, 2010.
R. Caruana, A. Mohamed, M. Philipose, and M. Richardson, [84]S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi,
“Do Deep Convolutional Nets Really Need to be Deep and “A Dynamically Configurable Coprocessor for Convolutional
Convolutional?”ICLR, 2017. Neural Networks,” inISCA, 2010.
[56]“Caffe LeNet MNIST,” http://caffe.berkeleyvision.org/gathered/ [85]V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello,
examples/mnist.html. “A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks,”
[57]“Caffe Model Zoo,” http://caffe.berkeleyvision.org/modelzoo. inCVPR Workshop, 2014.
html. [86]S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, “A
[58]“Matconvnet Pretrained Models,” http://www.vlfeat.org/ 1.93TOPS/W scalable deep learning/inference processor with
matconvnet/pretrained/. tetra-parallel MIMD architecture for big-data applications,” in
[59]“TensorFlow-Slim image classification library,” https://github. ISSCC, 2015.
com/tensorflow/models/tree/master/slim. [87]L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
[60]“Deep Learning Frameworks,” https://developer.nvidia.com/ L. Benini, “Origami: A Convolutional Network Accelerator,”
deep-learning-frameworks. inGLVLSI, 2015.
[61]Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An [88]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,
Energy-Efficient Reconfigurable Accelerator for Deep Convolu- “Deep Learning with Limited Numerical Precision,” inICML,
tional Neural Networks,”IEEE J. Solid-State Circuits, vol. 51, 2015.
no. 1, 2017. [89]Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,
[62]C. J. B. Yann LeCun, Corinna Cortes, “THE MNIST X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting
DATABASE of handwritten digits,” http://yann.lecun.com/exdb/ Vision Processing Closer to the Sensor,” inISCA, 2015.
mnist/. [90]M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal,
[63]L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Memory-centric accelerator design for Convolutional Neural
“Regularization of neural networks using dropconnect,” inICML, Networks,” inICCD, 2013.
2013. [91]C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti-
[64]A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” mizing FPGA-based Accelerator Design for Deep Convolutional
https://www.cs.toronto.edu/ kriz/cifar.html. Neural Networks,” inFPGA, 2015.
[65]A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny [92]T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and
images: A large data set for nonparametric object and scene O. Temam, “DianNao: A Small-footprint High-throughput
recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, Accelerator for Ubiquitous Machine-learning,” inASPLOS,
no. 11, pp. 19581970, 2008. 2014.
[66]A. Krizhevsky and G. Hinton, “Convolutional deep belief [93]Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li,
networks on cifar-10,”Unpublished manuscript, vol. 40, 2010. T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A
[67]B. Graham, “Fractional max-pooling,” arXiv preprint Machine-Learning Supercomputer,” inMICRO, 2014.
arXiv:1412.6071, 2014. [94]Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An
[68]“Pascal VOC data sets,” http://host.robots.ox.ac.uk/pascal/ Energy-Efficient Reconfigurable Accelerator for Deep Convo-
VOC/. lutional Neural Networks,” inISSCC, 2016.
[69]“Microsoft Common Objects in Context (COCO) dataset,” http: [95]V. Sze, M. Budagavi, and G. J. Sullivan, “High Efficiency Video
//mscoco.org/. Coding (HEVC): Algorithms and Architectures,” inIntegrated
[70]“Google Open Images,” https://github.com/openimages/dataset. Circuit and Systems. Springer, 2014, pp. 1375.
[71]“YouTube-8M,” https://research.google.com/youtube8m/. [96]M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer
[72]“AudioSet,” https://research.google.com/audioset/index.html. CNN accelerators,” inMICRO, 2016.
[73]S. Condon, “Facebook unveils Big Basin, new server geared [97]D. Keitel-Schulz and N. Wehn, “Embedded DRAM develop-
for deep learning,” ZDNet, March 2017. ment: Technology, physical design, and application issues,”
[74] C. Dubout and F. Fleuret, “Exact acceleration of linear object IEEE Des. Test. Comput., vol. 18, no. 3, pp. 715, 2001.
detectors,” inECCV, 2012. [98]J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM
architecture increases density and performance,” inSymp. on and modularized RTL compilation of Convolutional Neural
VLSI, 2012. Networks onto FPGA,” inFPL, 2016.
[99]J. Standard, “High bandwidth memory (HBM) DRAM,” [122]P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented
JESD235, 2013. Approximation of Convolutional Neural Networks,” inICLR,
[100]D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopad- 2016.
hyay, “Neurocube: A programmable digital neuromorphic [123]S. Higginbotham, “Google Takes Unconventional Route with
architecture with high-density 3D memory,” inISCA, 2016. Homegrown Machine Learning Chips,” Next Platform, May
[101]M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, 2016.
“TETRIS: Scalable and Efficient Neural Network Acceleration [124]T. P. Morgan, “Nvidia Pushes Deep Learning Inference With
with 3D Memory,” inASPLOS, 2017. New Pascal GPUs,” Next Platform, September 2016.
[102]J. Zhang, Z. Wang, and N. Verma, “A machine-learning [125]P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and
classifier implemented in a standard 6T SRAM array,” inSymp. A. Moshovos, “Stripes: Bit-serial deep neural network comput-
on VLSI, 2016. ing,” inMICRO, 2016.
[103]Z. Wang, R. Schapire, and N. Verma, “Error-adaptive classifier [126]B. Moons and M. Verhelst, “A 0.32.6 TOPS/W precision-
boosting (EACB): Exploiting data-driven training for highly scalable processor for real-time large-scale ConvNets,” inSymp.
fault-tolerant hardware,” inICASSP, 2014. on VLSI, 2016.
[104]A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, [127]M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect:
J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: Training deep neural networks with binary weights during
A Convolutional Neural Network Accelerator with In-Situ propagations,” inNIPS, 2015.
Analog Arithmetic in Crossbars,” inISCA, 2016. [128]M. Courbariaux and Y. Bengio, “Binarynet: Training deep
[105]L. Chua, “Memristor-the missing circuit element,”IEEE Trans. neural networks with weights and activations constrained to+
Circuit Theory, vol. 18, no. 5, pp. 507519, 1971. 1 or-1,”arXiv preprint arXiv:1602.02830, 2016.
[106]L. Wilson, “International technology roadmap for semiconduc- [129]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-
tors (ITRS),”Semiconductor Industry Association, 2013. Net: ImageNet Classification Using Binary Convolutional
[107]Lu, Darsen, “Tutorial on Emerging Memory Devices,” 2016. Neural Networks,” inECCV, 2016.
[108]S. B. Eryilmaz, S. Joshi, E. Neftci, W. Wan, G. Cauwenberghs, [130]Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with
and H.-S. P. Wong, “Neuromorphic architectures with electronic low precision by half-wave gaussian quantization,” inCVPR,
synapses,” inISQED, 2016. 2017.
[109]P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu, [131]F. Li and B. Liu, “Ternary weight networks,” inNIPS Workshop
Y. Wang, and Y. Xie, “PRIME: A Novel Processing-In-Memory on Efficient Methods for Deep Neural Networks, 2016.
Architecture for Neural Network Computation in ReRAM-based [132]C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary
Main Memory,” inISCA, 2016. Quantization,”ICLR, 2017.
[110]M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. Adam, K. K. [133]R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An
Likharev, and D. B. Strukov, “Training and operation of Ultra-Low Power Convolutional Neural Network Accelerator
an integrated neuromorphic network based on metal-oxide Based on Binary Weights,” inISVLSI, 2016.
memristors,”Nature, vol. 521, no. 7550, pp. 6164, 2015. [134]K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato,
[111]J. Zhang, Z. Wang, and N. Verma, “A matrix-multiplying ADC H. Nakahara, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, and
implementing a machine-learning classifier directly with data M. Kuroda, T.and Motomura, “BRein Memory: A 13-Layer
conversion,” inISSCC, 2015. 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable
[112]E. H. Lee and S. S. Wong, “A 2.5 GHz 7.7 TOPS/W switched- In-Memory Deep Neural Network Accelerator in 65nm CMOS,”
capacitor matrix multiplier with co-designed local memory in inSymp. on VLSI, 2017.
40nm,” inISSCC, 2016. [135]D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional
[113]R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong, Neural Networks using Logarithmic Data Representation,”
“RedEye: analog ConvNet image sensor architecture for contin- arXiv preprint arXiv:1603.01025, 2016.
uous mobile vision,” inISCA, 2016. [136]A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental
[114]A. Wang, S. Sivaramakrishnan, and A. Molnar, “A 180nm Network Quantization: Towards Lossless CNNs with Low-
CMOS image sensor with on-chip optoelectronic image com- precision Weights,” inICLR, 2017.
pression,” inCICC, 2012. [137]W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen,
[115]H. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivaramakrish- “Compressing Neural Networks with the Hashing Trick,” in
nan, A. Veeraraghavan, and A. Molnar, “ASP Vision: Optically ICML, 2015.
Computing the First Layer of Convolutional Neural Networks [138]J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger,
using Angle Sensitive Pixels,” inCVPR, 2016. and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep
[116]A. Suleiman and V. Sze, “Energy-efficient HOG-based object neural network computing,” inISCA, 2016.
detection at 1080HD 60 fps with multi-scale support,” inSiPS, [139]B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K.
2014. Lee, J. M. Hernandez-Lobato, G.-Y. Wei, and D. Brooks,´
[117]E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Minerva: Enabling low-power, highly-accurate deep neural
“Lognet: Energy-Efficient Neural Networks Using Logrithmic network accelerators,” inISCA, 2016.
Computations,” inICASSP, 2017. [140]Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain
[118]S. Han, H. Mao, and W. J. Dally, “Deep Compression: Damage,” inNIPS, 1990.
Compressing Deep Neural Networks with Pruning, Trained [141]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights
Quantization and Huffman Coding,” inICLR, 2016. and connections for efficient neural networks,” inNIPS, 2015.
[119] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Ben- [142]T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing Energy-Efficient
gio, “Quantized neural networks: Training neural networks Convolutional Neural Networks using Energy-Aware Pruning,”
with low precision weights and activations,”arXiv preprint inCVPR, 2017.
arXiv:1609.07061, 2016. [143]“DNN Energy Estimation,” http://eyeriss.mit.edu/energy.html.
[120]S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa- [144]R. Dorrance, F. Ren, and D. Markovic, “A scalable sparse´
Net: Training low bitwidth convolutional neural networks with matrix-vector multiplication kernel for energy-efficient sparse-
low bitwidth gradients,”arXiv preprint arXiv:1606.06160, 2016. blas on FPGAs,” inISFPGA, 2014.
[121]Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable [145]S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,
and W. J. Dally, “EIE: efficient inference engine on compressed
deep neural network,” inISCA, 2016.
[146]A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn:
An accelerator for compressed-sparse convolutional neural
networks,” inISCA, 2017.
[147]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning
structured sparsity in deep neural networks,” inNIPS, 2016.
[148]S. Anwar, K. Hwang, and W. Sung, “Structured pruning of
deep convolutional neural networks,”ACM Journal of Emerging
Technologies in Computing Systems, vol. 13, no. 3, p. 32, 2017.
[149]J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and
S. Mahlke, “Scalpel: Customizing dnn pruning to the underlying
hardware parallelism,” inISCA, 2017.
[150]H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally,
“Exploring the regularity of sparse structure in convolutional
neural networks,” inCVPR Workshop on Tensor Methods In
Computer Vision, 2017.
[151]J. S. Lim, “Two-dimensional signal and image processing,”
Englewood Cliffs, NJ, Prentice Hall, 1990, 710 p., 1990.
[152]F. Chollet, “Xception: Deep Learning With Depthwise Separa-
ble Convolutions,”CVPR, 2017.
[153]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
convolutional neural networks for mobile vision applications,”
arXiv preprint arXiv:1704.04861, 2017.
[154]F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J.
Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy
with 50x fewer parameters and<1MB model size,”ICLR,
2017.
[155]E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
“Exploiting Linear Structure Within Convolutional Networks
for Efficient Evaluation,” inNIPS, 2014.
[156]V. Lebedev, Y. Ganin, M. Rakhuba1, I. Oseledets, and V. Lem-
pitsky, “Speeding-Up Convolutional Neural Networks Using
Fine-tuned CP-Decomposition,”ICLR, 2015.
[157]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin,
“Compression of Deep Convolutional Neural Networks for Fast
and Low Power Mobile Applications,” inICLR, 2016.
[158]C. Bucilu, R. Caruana, and A. Niculescu-Mizil, “Model
Compression,” inSIGKDD, 2006.
[159]L. Ba and R. Caurana, “Do Deep Nets Really Need to be
Deep?”NIPS, 2014.
[160]G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge
in a Neural Network,” inNIPS Deep Learning Workshop, 2014.
[161]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
Y. Bengio, “Fitnets: Hints for Thin Deep Nets,”ICLR, 2015.
[162]“Benchmarking DNN Processors,” http://eyeriss.mit.edu/benchmarking.html.
<<END> <<END>> <<END>>
<<START>> <<START>> <<START>>
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Abstract
Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.
To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https: //github.com/tensorflow/tpu/tree/master/models/official/efficientnet.
1. Introduction
Scaling up ConvNets is widely used to achieve better accuracy. For example, ResNet (He et al., 2016) can be scaled up from ResNet-18 to ResNet-200 by using more layers; Recently, GPipe (Huang et al., 2018) achieved 84.3% Ima.
geNet top-1 accuracy by scaling up a baseline model four
<<FIGURE>>
Figure 1. Model Size vs. ImageNet Accuracy. All numbers are for single-crop, single-model. Our EfficientNets significantly out.perform other ConvNets. In particular, EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 accuracy but being 8.4x smaller and 6.1x faster than GPipe. EfficientNet-B1 is 7.6x smaller and 5.7x faster than ResNet-152. Details are in Table 2 and 4.
time larger. However, the process of scaling up ConvNets has never been well understood and there are currently many ways to do it. The most common way is to scale up Con.vNets by their depth (He et al., 2016) or width (Zagoruyko & Komodakis, 2016). Another less common, but increasingly popular, method is to scale up models by image resolution (Huang et al., 2018). In previous work, it is common to scale only one of the three dimensions <20> depth, width, and image size. Though it is possible to scale two or three dimensions arbitrarily, arbitrary scaling requires tedious manual tuning and still often yields sub-optimal accuracy and efficiency.
In this paper, we want to study and rethink the process of scaling up ConvNets. In particular, we investigate the central question: is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency? Our empirical study shows that it is critical to balance all dimensions of network width/depth/resolution, and surpris.ingly such balance can be achieved by simply scaling each of them with constant ratio. Based on this observation, we propose a simple yet effective compound scaling method. Unlike conventional practice that arbitrary scales these fac.tors, our method uniformly scales network width, depth,
<<FIGURE>>
Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.
and resolution with a set of fixed scaling coefficients. For example, if we want to use 2N times more computational resources, then we can simply increase the network depth by <<FORMULA>>, width by <<FORMULA>> , and image size by <<FORMULA>> are constant coefficients determined by a small grid search on the original small model. Figure 2 illustrates the difference between our scaling method and conventional methods.
Intuitively, the compound scaling method makes sense be.cause if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image. In fact, previous theoretical (Raghu et al., 2017; Lu et al., 2018) and empirical results (Zagoruyko & Komodakis, 2016) both show that there exists certain relationship between network width and depth, but to our best knowledge, we are the first to empirically quantify the relationship among all three dimensions of network width, depth, and resolution.
We demonstrate that our scaling method work well on exist.ing MobileNets (Howard et al., 2017; Sandler et al., 2018) and ResNet (He et al., 2016). Notably, the effectiveness of model scaling heavily depends on the baseline network; to go even further, we use neural architecture search (Zoph & Le, 2017; Tan et al., 2019) to develop a new baseline network, and scale it up to obtain a family of models, called EfficientNets. Figure 1 summarizes the ImageNet performance, where our EfficientNets significantly outperform other ConvNets. In particular, our EfficientNet-B7 surpasses the best existing GPipe accuracy (Huang et al., 2018), but using 8.4x fewer parameters and running 6.1x faster on inference. Compared to the widely used ResNet-50 (He et al., 2016), our EfficientNet-B4 improves the top-1 accuracy from 76.3% to 83.0% (+6.7%) with similar FLOPS. Besides ImageNet, EfficientNets also transfer well and achieve state-of-the-art accuracy on 5 out of 8 widely used datasets, while reducing parameters by up to 21x than existing ConvNets.
2. Related Work
ConvNet Accuracy: Since AlexNet (Krizhevsky et al., 2012) won the 2012 ImageNet competition, ConvNets have become increasingly more accurate by going bigger: while the 2014 ImageNet winner GoogleNet (Szegedy et al., 2015) achieves 74.8% top-1 accuracy with about 6.8M parameters, the 2017 ImageNet winner SENet (Hu et al., 2018) achieves 82.7% top-1 accuracy with 145M parameters. Recently, GPipe (Huang et al., 2018) further pushes the state-of-the-art ImageNet top-1 validation accuracy to 84.3% using 557M parameters: it is so big that it can only be trained with a specialized pipeline parallelism library by partitioning the network and spreading each part to a different accelerator. While these models are mainly designed for ImageNet, recent studies have shown better ImageNet models also per.form better across a variety of transfer learning datasets (Kornblith et al., 2019), and other computer vision tasks such as object detection (He et al., 2016; Tan et al., 2019). Although higher accuracy is critical for many applications, we have already hit the hardware memory limit, and thus further accuracy gain needs better efficiency.
ConvNet efficiency: Deep ConvNets are often over-parameterized. Model compression (Han et al., 2016; He et al., 2018; Yang et al., 2018) is a common way to re.duce model size by trading accuracy for efficiency. As mo.bile phones become ubiquitous, it is also common to hand.craft efficient mobile-size ConvNets, such as SqueezeNets (Iandola et al., 2016; Gholami et al., 2018), MobileNets (Howard et al., 2017; Sandler et al., 2018), and ShuffleNets (Zhang et al., 2018; Ma et al., 2018). Recently, neural architecture search becomes increasingly popular in designing efficient mobile-size ConvNets (Tan et al., 2019; Cai et al., 2019), and achieves even better efficiency than hand-crafted mobile ConvNets by extensively tuning the network width, depth, convolution kernel types and sizes. However, it is unclear how to apply these techniques for larger models that have much larger design space and much more expensive tuning cost. In this paper, we aim to study model efficiency for super large ConvNets that surpass state-of-the-art accuracy. To achieve this goal, we resort to model scaling.
Model Scaling: There are many ways to scale a Con.vNet for different resource constraints: ResNet (He et al., 2016) can be scaled down (e.g., ResNet-18) or up (e.g., ResNet-200) by adjusting network depth (#layers), while WideResNet (Zagoruyko & Komodakis, 2016) and Mo.bileNets (Howard et al., 2017) can be scaled by network width (#channels). It is also well-recognized that bigger input image size will help accuracy with the overhead of more FLOPS. Although prior studies (Raghu et al., 2017; Lin & Jegelka, 2018; Sharir & Shashua, 2018; Lu et al., 2018) have shown that network depth and width are both important for ConvNets<74> expressive power, it still remains an open question of how to effectively scale a ConvNet to achieve better efficiency and accuracy. Our work systematically and empirically studies ConvNet scaling for all three dimensions of network width, depth, and resolutions.
3. Compound Model Scaling
In this section, we will formulate the scaling problem, study different approaches, and propose our new scaling method.
3.1. Problem Formulation
A ConvNet Layer i can be defined as a function: <<FORMULA>>, where Fi is the operator, Yi is output tensor, Xi is input tensor, with tensor shape <<FORMULA>>, where H_i and W_i are spatial dimension and C_i is the channel dimension. A ConvNet N can be represented by a list of composed lay-
<<FORMULA>>
practice, ConvNet layers are often partitioned into multiple stages and all layers in each stage share the same architecture: for example, ResNet (He et al., 2016) has <20>ve stages, and all layers in each stage has the same convolutional type except the first layer performs down-sampling. Therefore, we can define a ConvNet as:
<<FORMULA>>
<<FORMULA>> where <<FORMULA>> denotes layer F_i is repeated L_i times in stage i,
<<FORMULA>> denotes the shape of input tensor X of layer 1For the sake of simplicity, we omit batch dimension.
i. Figure 2(a) illustrate a representative ConvNet, where the spatial dimension is gradually shrunk but the channel dimension is expanded over layers, for example, from initial input shape h224, 224, 3i to final output shape h7, 7, 512i.
Unlike regular ConvNet designs that mostly focus on find.ing the best layer architecture Fi, model scaling tries to expand the network length (Li), width (Ci), and/or resolution (Hi,Wi) without changing Fi predefined in the baseline network. By <20>xing Fi, model scaling simplifies the design problem for new resource constraints, but it still remains a large design space to explore different <<FORMULA>> for each layer. In order to further reduce the design space, we restrict that all layers must be scaled uniformly with constant ratio. Our target is to maximize the model accuracy for any given resource constraints, which can be formulated as an optimization problem:
<<FORMULA>> (2)
where <<FORMULA>> are coefficients for scaling network width, depth, and resolution; <<FORMULA>> are predefined parameters in baseline network (see Table 1 as an example).
3.2. Scaling Dimensions
The main difficulty of problem 2 is that the optimal d, w, r depend on each other and the values change under different resource constraints. Due to this difficulty, conventional methods mostly scale ConvNets in one of these dimensions:
Depth (d): Scaling network depth is the most common way used by many ConvNets (He et al., 2016; Huang et al., 2017; Szegedy et al., 2015; 2016). The intuition is that deeper ConvNet can capture richer and more complex features, and generalize well on new tasks. However, deeper networks are also more difficult to train due to the vanishing gradient problem (Zagoruyko & Komodakis, 2016). Although several techniques, such as skip connections (He et al., 2016) and batch normalization (Ioffe & Szegedy, 2015), alleviate the training problem, the accuracy gain of very deep network diminishes: for example, ResNet-1000 has similar accuracy as ResNet-101 even though it has much more layers. Figure 3 (middle) shows our empirical study on scaling a baseline model with different depth coefficient d, further suggesting the diminishing accuracy return for very deep ConvNets.
Width (w): Scaling network width is commonly used for small size models (Howard et al., 2017; Sandler et al., 2018;
<<FIGURE>>
Figure 3. Scaling Up a Baseline Model with Different Network Width (w), Depth (d), and Resolution (r) coefficients. Bigger networks with larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturate after reaching 80%, demonstrating the limitation of single dimension scaling. Baseline network is described in Table 1.
Tan et al., 2019)2. As discussed in (Zagoruyko & Komodakis, 2016), wider networks tend to be able to capture more fine-grained features and are easier to train. However, extremely wide but shallow networks tend to have dif<69>cul.ties in capturing higher level features. Our empirical results in Figure 3 (left) show that the accuracy quickly saturates when networks become much wider with larger w.
Resolution (r): With higher resolution input images, Con.vNets can potentially capture more fine-grained patterns. Starting from 224x224 in early ConvNets, modern Con.vNets tend to use 299x299 (Szegedy et al., 2016) or 331x331 (Zoph et al., 2018) for better accuracy. Recently, GPipe (Huang et al., 2018) achieves state-of-the-art ImageNet ac.curacy with 480x480 resolution. Higher resolutions, such as 600x600, are also widely used in object detection ConvNets (He et al., 2017; Lin et al., 2017). Figure 3 (right) shows the results of scaling network resolutions, where indeed higher resolutions improve accuracy, but the accuracy gain dimin.ishes for very high resolutions (r =1.0 denotes resolution 224x224 and r =2.5 denotes resolution 560x560).
The above analyses lead us to the first observation:
Observation 1 <20> Scaling up any dimension of network width, depth, or resolution improves accuracy, but the accuracy gain diminishes for bigger models.
3.3. Compound Scaling
We empirically observe that different scaling dimensions are not independent. Intuitively, for higher resolution images, we should increase network depth, such that the larger receptive fields can help capture similar features that include more pixels in bigger images. Correspondingly, we should also increase network width when resolution is higher, in
In some literature, scaling number of channels is called depth multiplier, which means the same as our width coefficient w.
<<FIGURE>>
Figure 4. Scaling Network Width for Different Baseline Net.works. Each dot in a line denotes a model with different width coefficient (w). All baseline networks are from Table 1. The first baseline network <<FORMULA>> has 18 convolutional layers with resolution 224x224, while the last baseline <<FORMULA>> has 36 layers with resolution 299x299.
order to capture more fine-grained patterns with more pixels in high resolution images. These intuitions suggest that we need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling.
To validate our intuitions, we compare width scaling under different network depths and resolutions, as shown in Figure 4. If we only scale network width w without changing depth <<(d=1.0)>> and resolution <<(r=1.0)>>, the accuracy saturates quickly. With deeper (d=2.0) and higher resolution <<(r=2.0)>>, width scaling achieves much better accuracy under the same FLOPS cost. These results lead us to the second observation:
Observation 2 In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling.
In fact, a few prior work (Zoph et al., 2018; Real et al., 2019) have already tried to arbitrarily balance network width and depth, but they all require tedious manual tuning.
In this paper, we propose a new compound scaling method, which use a compound coefficient . to uniformly scales network width, depth, and resolution in a principled way:
<<FORMULA>> (3)
where <<FORMULA>> are constants that can be determined by a small grid search. Intuitively, . is a user-specified coefficient that controls how many more resources are available for model scaling, while <<FORMULA>> specify how to assign these extra resources to network width, depth, and resolution respectively. Notably, the FLOPS of a regular convolution op
is proportional to <<FORMULA>> i.e., doubling network depth will double FLOPS, but doubling network width or resolution will increase FLOPS by four times. Since convolution ops usually dominate the computation cost in ConvNets, scaling a ConvNet with equation 3 will approximately in.
crease total FLOPS by <<FORMULA>> In this paper, we constraint <<FORMULA>> such that for any new <<FORMULA>>, the total FLOPS will approximately3 increase by 2.
4. EfficientNet Architecture
Since model scaling does not change layer operators F_i in baseline network, having a good baseline network is also critical. We will evaluate our scaling method using existing ConvNets, but in order to better demonstrate the effectiveness of our scaling method, we have also developed a new mobile-size baseline, called EfficientNet.
Inspired by (Tan et al., 2019), we develop our baseline net.work by leveraging a multi-objective neural architecture search that optimizes both accuracy and FLOPS. Specifically, we use the same search space as (Tan et al., 2019), and use <<FORMULA>> as the optimization goal, where <<ACC(m)>> and <<FLOPS(m)>> denote the accuracy and FLOPS of model m, T is the target FLOPS and w=-0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOPS. Unlike (Tan et al., 2019; Cai et al., 2019), here we optimize FLOPS rather than latency since we are not targeting any specific hardware de.vice. Our search produces an efficient network, which we name EfficientNet-B0. Since we use the same search space as (Tan et al., 2019), the architecture is similar to <<FORMULA>>.
FLOPS may differ from theoretical value due to rounding.
Table 1. EfficientNet-B0 baseline network <<FORMULA>> Each row describes a stage i with L_i layers, with input resolution <<FORMULA>> and output channels C_i. Notations are adopted from equation 2.
<<FORMULA>>
Net, except our EfficientNet-B0 is slightly bigger due to the larger FLOPS target (our FLOPS target is 400M). Ta.ble 1 shows the architecture of EfficientNet-B0. Its main building block is mobile inverted bottleneck MBConv (Sandler et al., 2018; Tan et al., 2019), to which we also add squeeze-and-excitation optimization (Hu et al., 2018).
Starting from the baseline EfficientNet-B0, we apply our compound scaling method to scale it up with two steps:
STEP 1: we first <<FORMULA>> assuming twice more re.sources available, and do a small grid search of <<FORMULA>> based on Equation 2 and 3. In particular, we find the best values for EfficientNet-B0 are <<FORMULA>>, under constraint of <<FORMULA>>.
STEP 2: we then <<FORMULA>> as constants and scale up baseline network with different . using Equation 3, to obtain EfficientNet-B1 to B7 (Details in Table 2).
Notably, it is possible to achieve even better performance by searching for <<FORMULA>> directly around a large model, but the search cost becomes prohibitively more expensive on larger models. Our method solves this issue by only doing search once on the small baseline network (step 1), and then use the same scaling coefficients for all other models (step 2).
5. Experiments
In this section, we will first evaluate our scaling method on existing ConvNets and the new proposed EfficientNets.
5.1. Scaling Up MobileNets and ResNets
As a proof of concept, we first apply our scaling method to the widely-used MobileNets (Howard et al., 2017; Sandler et al., 2018) and ResNet (He et al., 2016). Table 3 shows the ImageNet results of scaling them in different ways. Compared to other single-dimension scaling methods, our compound scaling method improves the accuracy on all these models, suggesting the effectiveness of our proposed scaling method for general existing ConvNets.
Table 2. EfficientNet Performance Results on ImageNet (Russakovsky et al., 2015). All EfficientNet models are scaled from our baseline EfficientNet-B0 using different compound coefficient . in Equation 3. ConvNets with similar top-1/top-5 accuracy are grouped together for efficiency comparison. Our scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude (up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets.
<<TABLE>>
We omit ensemble and multi-crop models (Hu et al., 2018), or models pretrained on 3.5B Instagram images (Mahajan et al., 2018).
Table 3. Scaling Up MobileNets and ResNet.
<<TABLE>>
Table 5. EfficientNet Performance Results on Transfer Learning Datasets. Our scaled EfficientNet models achieve new state-of-the.art accuracy for 5 out of 8 datasets, with 9.6x fewer parameters on average.
Comparison to best public-available results Comparison to best reported results Model Accuracy.
<<TABLE>>
Figure 6. Model Parameters vs. Transfer Learning Accuracy weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs.
<<FIGURE>>
We also use swish activation (Ramachandran et al., 2018; Elfwing et al., 2018), fixed Au.to Augment policy (Cubuk et al., 2019), and stochastic depth (Huang et al., 2016) with survival probability 0.8. As commonly known that bigger models need more regularization, we linearly increase dropout (Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7.
Table 2 shows the performance of all EfficientNet models that are scaled from the same baseline EfficientNet-B0. Our EfficientNet models generally use an order of magnitude fewer parameters and FLOPS than other ConvNets with similar accuracy. In particular, our EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4x smaller than the previous best GPipe (Huang et al., 2018).
All models are pretrained on ImageNet and fine tuned on new datasets. Figure 1 and Figure 5 illustrates the parameters-accuracy and FLOPS-accuracy curve for representative ConvNets, where our scaled EfficientNet models achieve better accuracy with much fewer parameters and FLOPS than other ConvNets. Notably, our EfficientNet models are not only small, but also computational cheaper. For example, our EfficientNet-B3 achieves higher accuracy than ResNeXt.101 (Xie et al., 2017) using 18x fewer FLOPS.
To validate the computational cost, we have also measured the inference latency for a few representative CovNets on a real CPU as shown in Table 4, where we report average latency over 20 runs. Our EfficientNet-B1 runs 5.7x faster than the widely used ResNet-152 (He et al., 2016), while EfficientNet-B7 runs about 6.1x faster than GPipe (Huang et al., 2018), suggesting our EfficientNets are indeed fast on real hardware.
Figure 7. Class Activation Map (CAM) (Zhou et al., 2016) for Models with different scaling methods-Our compound scaling method allows the scaled model (last column) to focus on more relevant regions with more object details. Model details are in Table 7.
<<FIGURE>>
Table 6. Transfer Learning Datasets.
<<TABLE>>
5.3. Transfer Learning Results for EfficientNet
We have also evaluated our EfficientNet on a list of commonly used transfer learning datasets, as shown in Table 6. We borrow the same training settings from (Kornblith et al., 2019) and (Huang et al., 2018), which take ImageNet pretrained checkpoints and fine tune on new datasets.
Table 5 shows the transfer learning performance: (1) Com.pared to public available models, such as NASNet-A (Zoph et al., 2018) and Inception-v4 (Szegedy et al., 2017), our EfficientNet models achieve better accuracy with 4.7x average (up to 21x) parameter reduction. (2) Compared to state-of-the-art models, including DAT (Ngiam et al., 2018) that dynamically synthesizes training data and GPipe (Huang et al., 2018) that is trained with specialized pipeline parallelism, our EfficientNet models still surpass their accuracy in 5 out of 8 datasets, but using 9.6x fewer parameters
Figure 6 compares the accuracy-parameters curve for a variety of models. In general, our EfficientNets consistently achieve better accuracy with an order of magnitude fewer parameters than existing models, including ResNet (He et al., 2016), DenseNet (Huang et al., 2017), Inception (Szegedy et al., 2017), and NASNet (Zoph et al., 2018).
6. Discussion
Figure 8. Scaling Up EfficientNet-B0 with Different Methods. Table 7. Scaled Models Used in Figure 7.
<<FIGURE>>
To disentangle the contribution of our proposed scaling method from the EfficientNet architecture, Figure 8 com.pares the ImageNet performance of different scaling methods for the same EfficientNet-B0 baseline network. In general, all scaling methods improve accuracy with the cost of more FLOPS, but our compound scaling method can further improve accuracy, by up to 2.5%, than other single-dimension scaling methods, suggesting the importance of our proposed compound scaling.
In order to further understand why our compound scaling method is better than others, Figure 7 compares the class activation map (Zhou et al., 2016) for a few representative models with different scaling methods. All these models are scaled from the same baseline, and their statistics are shown in Table 7. Images are randomly picked from ImageNet validation set. As shown in the figure, the model with com.pound scaling tends to focus on more relevant regions with more object details, while other models are either lack of object details or unable to capture all objects in the images.
7. Conclusion
In this paper, we systematically study ConvNet scaling and identify that carefully balancing network width, depth, and resolution is an important but missing piece, preventing us from better accuracy and efficiency. To address this issue, we propose a simple and highly effective compound scaling method, which enables us to easily scale up a baseline Con.vNet to any target resource constraints in a more principled way, while maintaining model efficiency. Powered by this compound scaling method, we demonstrate that a mobile-size EfficientNet model can be scaled up very effectively, surpassing state-of-the-art accuracy with an order of magnitude fewer parameters and FLOPS, on both ImageNet and five commonly used transfer learning datasets.
Acknowledgements
We thank Ruoming Pang, Vijay Vasudevan, Alok Aggarwal, Barret Zoph, Hongkun Yu, Xiaodan Song, Samy Bengio, Jeff Dean, and Google Brain team for their help.
References
Berg, T., Liu, J., Woo Lee, S., Alexander, M. L., Jacobs,
D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. CVPR, pp. 2011<31>2018, 2014.
Bossard, L., Guillaumin, M., and Van Gool, L. Food-101<30> mining discriminative components with random forests. ECCV, pp. 446<34>461, 2014.
Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. ICLR, 2019.
Chollet, F. Xception: Deep learning with depthwise separa.ble convolutions. CVPR, pp. 1610<31>02357, 2017.
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. CVPR, 2019.
Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3<>11, 2018.
Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P., Zhao, S., and Keutzer, K. Squeezenext: Hardware-aware neural network design. ECV Workshop at CVPR<50>18, 2018.
Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CVPR, pp. 770<37>778, 2016.
He, K., Gkioxari, G., Dollar,<2C> P., and Girshick, R. Mask r-cnn. ICCV, pp. 2980<38>2988, 2017.
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. Amc: Automl for model compression and acceleration on mobile devices. ECCV, 2018.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation net.works. CVPR, 2018.
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,
K. Q. Deep networks with stochastic depth. ECCV, pp. 646<34>661, 2016.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
K. Q. Densely connected convolutional networks. CVPR, 2017.
Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,
Q. V., and Chen, Z. Gpipe: efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1808.07233, 2018.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, pp. 448<34>456, 2015.
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet models transfer better? CVPR, 2019.
Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a large-scale dataset of fine-grained cars. Second Workshop on Fine-Grained Visual Categorizatio, 2013.
Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 2009.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classi<73>cation with deep convolutional neural networks. In NIPS, pp. 1097<39>1105, 2012.
Lin, H. and Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. NeurIPS, pp. 6172<37> 6181, 2018.
Lin, T.-Y., Dollar,<2C> P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. CVPR, 2017.
Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A., Huang, J., and Murphy, K. Progressive neural architecture search. ECCV, 2018.
Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expres.sive power of neural networks: A view from the width. NeurIPS, 2018.
Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shuf<75>enet v2: Practical guidelines for efficient cnn architecture design. ECCV, 2018.
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and van der Maaten, L. Explor.ing the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi,
A. Fine-grained visual classi<73>cation of aircraft. arXiv preprint arXiv:1306.5151, 2013.
Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V., and Pang, R. Domain adaptive transfer learning with spe.cialist models. arXiv preprint arXiv:1811.07056, 2018.
Nilsback, M.-E. and Zisserman, A. Automated <20>ower clas.si<73>cation over a large number of classes. ICVGIP, pp. 722<32>729, 2008.
Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. CVPR, pp. 3498<39>3505, 2012.
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. On the expressive power of deep neural networks. ICML, 2017.
Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2018.
Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regu.larized evolution for image classi<73>er architecture search. AAAI, 2019.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition chal.lenge. International Journal of Computer Vision, 115(3): 211<31>252, 2015.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. CVPR, 2018.
Sharir, O. and Shashua, A. On the expressive power of overlapping architectures of deep learning. ICLR, 2018.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from over<65>tting. The Journal of Machine Learning Research, 15(1):1929<32>1958, 2014.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
A. Going deeper with convolutions. CVPR, pp. 1<>9, 2015.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
Z. Rethinking the inception architecture for computer vision. CVPR, pp. 2818<31>2826, 2016.
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI, 4:12, 2017.
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. MnasNet: Platform-aware neural architecture search for mobile. CVPR, 2019.
Xie, S., Girshick, R., Doll<6C>ar, P., Tu, Z., and He, K. Aggre.gated residual transformations for deep neural networks. CVPR, pp. 5987<38>5995, 2017.
Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sze, V., and Adam, H. Netadapt: Platform-aware neural net.work adaptation for mobile applications. ECCV, 2018.
Zagoruyko, S. and Komodakis, N. Wide residual networks. BMVC, 2016.
Zhang, X., Li, Z., Loy, C. C., and Lin, D. Polynet: A pursuit of structural diversity in very deep networks. CVPR, pp. 3900<30>3908, 2017.
Zhang, X., Zhou, X., Lin, M., and Sun, J. Shuf<75>enet: An ex.tremely efficient convolutional neural network for mobile devices. CVPR, 2018.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba,
A. Learning deep features for discriminative localization. CVPR, pp. 2921<32>2929, 2016.
Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. ICLR, 2017.
Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. CVPR, 2018.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell Ananya Ganesh Andrew McCallum College of Information and Computer Sciences University of Massachusetts Amherst
{strubell, aganesh,
Abstract
Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exception.ally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the car.bon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice.
1 Introduction
Advances in techniques and hardware for train.ing deep neural networks have recently enabled impressive accuracy improvements across many fundamental NLP tasks (Bahdanau et al., 2015; Luong et al., 2015; Dozat and Manning, 2017; Vaswani et al., 2017), with the most computationally-hungry models obtaining the highest scores (Peters et al., 2018; Devlin et al., 2019; Radford et al., 2019; So et al., 2019). As a result, training a state-of-the-art model now re.quires substantial computational resources which demand considerable energy, along with the associated financial and environmental costs. Research and development of new models multiplies these costs by thousands of times by requiring re.training to experiment with model architectures and hyperparameters. Whereas a decade ago most
<<TABLE>>
Table 1: Estimated CO2 emissions from training com.mon NLP models, compared to familiar consumption.
NLP models could be trained and developed on a commodity laptop or server, many now require multiple instances of specialized hardware such as GPUs or TPUs, therefore limiting access to these highly accurate models on the basis of finances.
Even when these expensive computational resources are available, model training also incurs a substantial cost to the environment due to the energy required to power this hardware for weeks or months at a time. Though some of this energy may come from renewable or carbon credit-offset resources, the high energy demands of these models are still a concern since (1) energy is not currently derived from carbon-neural sources in many locations, and (2) when renewable energy is available, it is still limited to the equipment we have to pro.duce and store it, and energy spent training a neural network might better be allocated to heating a family's home. It is estimated that we must cut carbon emissions by half over the next decade to deter escalating rates of natural disaster, and based on the estimated CO2 emissions listed in Table 1,
1Sources: (1) Air travel and per-capita consumption: https://bit.ly/2Hw0xWc; (2) car lifetime: https://bit.ly/2Qbr0w1.
model training and development likely make up a substantial portion of the greenhouse gas emissions attributed to many NLP researchers.
To heighten the awareness of the NLP community to this issue and promote mindful practice and policy, we characterize the dollar cost and carbon emissions that result from training the neural net.works at the core of many state-of-the-art NLP models. We do this by estimating the kilowatts of energy required to train a variety of popular off-the-shelf NLP models, which can be converted to approximate carbon emissions and electricity costs. To estimate the even greater resources re.quired to transfer an existing model to a new task or develop new models, we perform a case study of the full computational resources required for the development and tuning of a recent state-of-the-art NLP pipeline (Strubell et al., 2018). We conclude with recommendations to the community based on our findings, namely: (1) Time to retrain and sensitivity to hyperparameters should be reported for NLP machine learning models; (2) academic Researchers need equitable access to computational resources; and (3) researchers should prioritize developing efficient models and hardware.
2 Methods
To quantify the computational and environmental cost of training deep neural network models for NLP, we perform an analysis of the energy required to train a variety of popular off-the-shelf NLP models, as well as a case study of the complete sum of resources required to develop LISA (Strubell et al., 2018), a state-of-the-art NLP model from EMNLP 2018, including all tuning and experimentation.
We measure energy use as follows. We train the models described in 2.1 using the default settings provided, and sample GPU and CPU power con.sumption during training. Each model was trained for a maximum of 1 day. We train all models on a single NVIDIA Titan X GPU, with the exception of ELMo which was trained on 3 NVIDIA GTX 1080 Ti GPUs. While training, we repeatedly query the NVIDIA System Management Interface to sample the GPU power consumption and report the average over all samples. To sample CPU power consumption, we use Intel's Running Average Power Limit interface.
<<TABLE>>
Table 2: Percent energy sourced from: Renewable (e.g. hydro, solar, wind), natural gas, coal and nuclear for the top 3 cloud compute providers (Cook et al., 2017), compared to the United States,4 China5 and Germany (Burger, 2019).
We estimate the total time expected for models to train to completion using training times and hardware reported in the original papers. We then calculate the power consumption in kilowatt-hours (kWh) as follows. Let pc be the average power draw (in watts) from all CPU sockets during train.ing, let pr be the average power draw from all DRAM (main memory) sockets, let pg be the aver.age power draw of a GPU during training, and let g be the number of GPUs used to train. We esti.mate total power consumption as combined GPU, CPU and DRAM consumption, then multiply this by Power Usage Effectiveness (PUE), which ac.counts for the additional energy required to sup.port the compute infrastructure (mainly cooling). We use a PUE coefficient of 1.58, the 2018 global average for data centers (Ascierto, 2018). Then the total power pt required at a given instance during training is given by:
<<FORMULA>> (1)
The U.S. Environmental Protection Agency (EPA) provides average CO2 produced (in pounds per kilowatt-hour) for power consumed in the U.S. (EPA, 2018), which we use to convert power to estimated CO2 emissions:
<<FORMULA>> (2)
This conversion takes into account the relative pro.portions of different energy sources (primarily nat.ural gas, coal, nuclear and renewable) consumed to produce energy in the United States. Table 2 lists the relative energy sources for China, Ger.many and the United States compared to the top
three cloud service providers. The U.S. break.down of energy is comparable to that of the most popular cloud compute service, Amazon Web Ser.vices, so we believe this conversion to provide a reasonable estimate of CO2 emissions per kilowatt hour of compute energy used.
2.1 Models
We analyze four models, the computational requirements of which we describe below. All models have code freely available online, which we used out-of-the-box. For more details on the models themselves, please refer to the original papers.
Transformer. The Transformer model (Vaswani et al., 2017) is an encoder-decoder architecture primarily recognized for efficient and accurate ma.chine translation. The encoder and decoder each consist of 6 stacked layers of multi-head self-attention. Vaswani et al. (2017) report that the Transformer base model (65M parameters) was trained on 8 NVIDIA P100 GPUs for 12 hours, and the Transformer big model (213M parameters) was trained for 3.5 days (84 hours; 300k steps). This model is also the basis for recent work on neural architecture search (NAS) for ma.chine translation and language modeling (So et al., 2019), and the NLP pipeline that we study in more detail in 4.2 (Strubell et al., 2018). So et al. (2019) report that their full architecture search ran for a total of 979M training steps, and that their base model requires 10 hours to train for 300k steps on one TPUv2 core. This equates to 32,623 hours of TPU or 274,120 hours on 8 P100 GPUs.
ELMo. The ELMo model (Peters et al., 2018) is based on stacked LSTMs and provides rich word representations in context by pre-training on a large amount of data using a language model.ing objective. Replacing context-independent pre.trained word embeddings with ELMo has been shown to increase performance on downstream tasks such as named entity recognition, semantic role labeling, and coreference. Peters et al. (2018) report that ELMo was trained on 3 NVIDIA GTX 1080 GPUs for 2 weeks (336 hours).
BERT. The BERT model (Devlin et al., 2019) provides a Transformer-based architecture for build.ing contextual representations similar to ELMo, but trained with a different language modeling objective. BERT substantially improves accuracy on tasks requiring sentence-level representations such as question answering and natural language inference. Devlin et al. (2019) report that the BERT base model (110M parameters) was trained on 16 TPU chips for 4 days (96 hours). NVIDIA reports that they can train a BERT model in 3.3 days (79.2 hours) using 4 DGX-2H servers, totaling 64 Tesla V100 GPUs (Forster et al., 2019).
GPT-2. This model is the latest edition of OpenAI's GPT general-purpose token encoder, also based on Transformer-style self-attention and trained with a language modeling objective (Rad.ford et al., 2019). By training a very large model on massive data, Radford et al. (2019) show high zero-shot performance on question answering and language modeling benchmarks. The large model described in Radford et al. (2019) has 1542M parameters and is reported to require 1 week (168 hours) of training on 32 TPUv3 chips. 6
3 Related work
There is some precedent for work characterizing the computational requirements of training and inference in modern neural network architectures in the computer vision community. Li et al. (2016) present a detailed study of the energy use required for training and inference in popular convolutional models for image classification in computer vision, including fine-grained analysis comparing different neural network layer types. Canziani et al. (2016) assess image classification model accuracy as a function of model size and gigaflops required during inference. They also measure average power draw required during inference on GPUs as a function of batch size. Neither work analyzes the recurrent and self-attention models that have become commonplace in NLP, nor do they extrapolate power to estimates of carbon and dol.lar cost of training.
Analysis of hyperparameter tuning has been performed in the context of improved algorithms for hyperparameter search (Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012). To our knowledge there exists to date no analysis of the computation required for R&D and hyperparameter tuning of neural network models in NLP.
6Via the authors on Reddit.
7GPU lower bound computed using pre-emptible <<P100/V100>> U.S. resources priced at <<FORMULA>>, upper bound uses on-demand U.S. resources priced at <<FORMULA>>. We similarly use pre-emptible (<<FORMULA>>) and on-demand (<<FORMULA>>) pricing as lower and upper bounds for TPU v2/3; cheaper bulk contracts are available.
<<TABLE>>
Table 3: Estimated cost of training a model in terms of CO2 emissions (lbs) and cloud compute cost (USD).7 Power and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware.
4 Experimental results
4.1 Cost of training
Table 3 lists CO2 emissions and estimated cost of training the models described in 2.1. Of note is that TPUs are more cost-efficient than GPUs on workloads that make sense for that hardware (e.g. BERT). We also see that models emit substantial carbon emissions; training BERT on GPU is roughly equivalent to a trans-American fight. So et al. (2019) report that NAS achieves a new state-of-the-art BLEU score of 29.7 for English to Ger.man machine translation, an increase of just 0.1 BLEU at the cost of at least $150k in on-demand compute time and non-trivial carbon emissions.
4.2 Cost of development: Case study
To quantify the computational requirements of R&D for a new model we study the logs of all training required to develop Linguistically-Informed Self-Attention (Strubell et al., 2018), a multi-task model that performs part-of-speech tagging, labeled dependency parsing, predicate detection and semantic role labeling. This model makes for an interesting case study as a representative NLP pipeline and as a Best Long Paper at EMNLP.
Model training associated with the project spanned a period of 172 days (approx. 6 months). During that time 123 small hyperparameter grid searches were performed, resulting in 4789 jobs in total. Jobs varied in length ranging from a minimum of 3 minutes, indicating a crash, to a maximum of 9 days, with an average job length of 52 hours. All training was done on a combination of NVIDIA Titan X (72%) and M40 (28%) GPUs.8
The sum GPU time required for the project totaled 9998 days (27 years). This averages to
<<TABLE>>
Table 4: Estimated cost in terms of cloud compute and electricity for training: (1) a single model (2) a single tune and (3) all models trained during R&D.
about 60 GPUs running constantly throughout the 6 month duration of the project. Table 4 lists upper and lower bounds of the estimated cost in terms of Google Cloud compute and raw electricity re.quired to develop and deploy this model.9 We see that while training a single model is relatively inexpensive, the cost of tuning a model for a new dataset, which we estimate here to require 24 jobs, or performing the full R&D required to develop this model, quickly becomes extremely expensive.
5 Conclusions
Authors should report training time and sensitivity to hyperparameters.
Our experiments suggest that it would be beneficial to directly compare different models to per.form a cost-bene<6E>t (accuracy) analysis. To ad.dress this, when proposing a model that is meant to be re-trained for downstream use, such as re.training on a new domain or fine-tuning on a new task, authors should report training time and computational resources required, as well as model sensitivity to hyperparameters. This will enable direct comparison across models, allowing subsequent consumers of these models to accurately assess whether the required computational resources
We approximate cloud compute cost using P100 pricing. 9Based on average U.S cost of electricity of $0.12/kWh.
are compatible with their setting. More explicit characterization of tuning time could also reveal inconsistencies in time spent tuning baseline models compared to proposed contributions. Realizing this will require: (1) a standard, hardware-independent measurement of training time, such as gigaflops required to convergence, and (2) a standard measurement of model sensitivity to data and hyperparameters, such as variance with respect to hyperparameters searched.
Academic researchers need equitable access to computation resources.
Recent advances in available compute come at a high price not attainable to all who desire access. Most of the models studied in this paper were developed outside academia; recent improvements in state-of-the-art accuracy are possible thanks to industry access to large-scale compute.
Limiting this style of research to industry labs hurts the NLP research community in many ways. First, it stifles creativity. Researchers with good ideas but without access to large-scale compute will simply not be able to execute their ideas, instead constrained to focus on different problems. Second, it prohibits certain types of Research on the basis of access to financial resources. This even more deeply promotes the already problematic rich get richer cycle of research funding, where groups that are already successful and thus well-funded tend to receive more funding due to their existing accomplishments. Third, the prohibitive start-up cost of building in-house resources forces resource-poor groups to rely on cloud compute services such as AWS, Google Cloud and Microsoft Azure.
While these services provide valuable, flexible, and often relatively environmentally friendly compute resources, it is more cost effective for academic researchers, who often work for nonprofit educational institutions and whose research is funded by government entities, to pool resources to build shared compute centers at the level of funding agencies, such as the U.S. National Science Foundation. For example, an off-the-shelf GPU server containing 8 NVIDIA 1080 Ti GPUs and supporting hardware can be purchased for approximately $20,000 USD. At that cost, the hardware required to develop the model in our case study (approximately 58 GPUs for 172 days) would cost $145,000 USD plus electricity, about half the estimated cost to use on-demand cloud GPUs. Unlike money spent on cloud compute, however, that invested in centralized resources would continue to pay off as resources are shared across many projects. A government-funded academic compute cloud would provide equitable access to all researchers.
Researchers should prioritize computationally efficient hardware and algorithms.
We recommend a concerted effort by industry and academia to promote research of more computationally efficient algorithms, as well as hardware that requires less energy. An effort can also be made in terms of software. There is already a precedent for NLP software packages prioritizing efficient models. An additional avenue through which NLP and machine learning software developers could aid in reducing the energy associated with model tuning is by providing easy.to-use APIs implementing more efficient alternatives to brute-force grid search for hyperparameter tuning, e.g. random or Bayesian hyperparameter search techniques (Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012). While software packages implementing these techniques do exist,10 they are rarely employed in practice for tuning NLP models. This is likely because their interoperability with popular deep learning frameworks such as PyTorch and TensorFlow is not optimized, i.e. there are not simple examples of how to tune TensorFlow Estimators using Bayesian search. Integrating these tools into the work<72>ows with which NLP researchers and practitioners are already familiar could have notable im.pact on the cost of developing and tuning in NLP.
Acknowledgements
We are grateful to Sherief Farouk and the anonymous reviewers for helpful feedback on earlier drafts. This work was supported in part by the Centers for Data Science and Intelligent Information Retrieval, the Chan-Zuckerberg Initiative under the Scientific Knowledge Base Construction project, the IBM Cognitive Horizons Network agreement no. W1668553, and National Science Foundation grant no. IIS-1514053. Any opinions, findings and conclusions or recommendations ex.pressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
For example, the Hyperopt Python library.
References
Rhonda Ascierto. 2018. Uptime Institute Global Data Center Survey. Technical report, Uptime Institute.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben.gio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd Inter.national Conference for Learning Representations (ICLR), San Diego, California, USA.
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281<38>305.
James S Bergstra, R<>emi Bardenet, Yoshua Bengio, and Bal<61>azs K<>egl. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pages 2546<34>2554.
Bruno Burger. 2019. Net Public Electricity Generation in Germany in 2018. Technical report, Fraunhofer Institute for Solar Energy Systems ISE.
Alfredo Canziani, Adam Paszke, and Eugenio Culur.ciello. 2016. An analysis of deep neural network models for practical applications.
Gary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John Deans, Brian Johnson, Elizabeth Jardim, and Brian Johnson. 2017. Clicking Clean: Who is winning the race to build a green internet? Technical report, Greenpeace.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Un.derstanding. In NAACL.
Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency pars.ing. In ICLR.
EPA. 2018. Emissions & Generation Resource Inte.grated Database (eGRID). Technical report, U.S. Environmental Protection Agency.
Christopher Forster, Thor Johnsen, Swetha Man.dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie Bernauer, Allison Gray, Sharan Chetlur, and Raul Puri. 2019. BERT Meets GPUs. Technical report, NVIDIA AI.
Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong. 2016. Evaluating the energy ef<65>ciency of deep con.volutional neural networks on cpus and gpus. 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Comput.ing and Communications (SustainCom) (BDCloud.SocialCom-SustainCom), pages 477<37>484.
Thang Luong, Hieu Pham, and Christopher D. Man.ning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412<31>1421. Associa.tion for Computational Linguistics.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep.resentations. In NAACL.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural informa.tion processing systems, pages 2951<35>2959.
David R. So, Chen Liang, and Quoc V. Le. 2019. The evolved transformer. In Proceedings of the 36th International Conference on Machine Learning (ICML).
Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-Informed Self-Attention for Se.mantic Role Labeling. In Conference on Empir.ical Methods in Natural Language Processing (EMNLP), Brussels, Belgium.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS).
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Finite-Element Neural Networks for Solving Differential Equations
Pradeep Ramuhalli, Member, IEEE, Lalita Udpa, Senior Member, IEEE, and Satish S. Udpa, Fellow, IEEE
Abstract
The solution of partial differential equations (PDE) arises in a wide variety of engineering problems. Solutions to most practical problems use numerical analysis techniques such as finite-element or finite-difference methods. The drawbacks of these approaches include computational costs associated with the modeling of complex geometries. This paper proposes a finite-element neural network (FENN) obtained by embedding a finite-element model in a neural network architecture that enables fast and ac.curate solution of the forward problem. Results of applying the FENN to several simple electromagnetic forward and inverse problems are presented. Initial results indicate that the FENN performance as a forward model is comparable to that of the conventional finite-element method (FEM). The FENN can also be used in an iterative approach to solve inverse problems associated with the PDE. Results showing the ability of the FENN to solve the in.verse problem given the measured signal are also presented. The parallel nature of the FENN also makes it an attractive solution for parallel implementation in hardware and software.
I. INTRODUCTION
Solutions of differential equations arise in a wide variety of engineering applications in electromagnetics, signal processing, computational fluid dynamics, etc. These equations are typically solved using either analytical or numerical methods. Analytical solution methods are however feasible only for simple geometries, which limits their applicability. In most practical problems with complex boundary conditions, numerical analysis methods are required in order to obtain a reasonable solution. An example is the solution of Maxwell's equations in electromagnetics. Solutions to Maxwell's equations are used in a variety of applications for calculating the interaction of electromagnetic (EM) fields with different types of media.
Very often, the solution to differential equations is necessary for solving the corresponding inverse problems. Inverse problems in general are ill-posed, lacking continuous dependence of the measurements on the input. This has resulted in the development of a variety of solution techniques ranging from simple calibration procedures to other direct (analytical) and iterative approaches [1]. Iterative methods typically employ a forward model that simulates the underlying physical process (Fig. 1) [2]. An initial estimate of the solution of the inverse problem (represented by
in Fig. 1) is applied to the forward model,
Manuscript received January 17, 2004; revised April 2, 2005.
<<FIGURE>>
Fig. 1. Iterative inversion method for solving inverse problems.
resulting in the corresponding solution to the forward problem
<<ALGORITHM>>
Although finite-element methods (FEMs) [3], [4] are extremely popular for solving differential equations, their major drawback is computational complexity. This problem becomes more acute when three-dimensional (3-D) finite-element models are used in an iterative algorithm for solving the inverse problem. Recently, several authors have suggested the use of neural networks (MLP or RBF networks [5]) for solving differential equations [6][9].
In these techniques, a neural network is trained using a large database containing the input data and the solution of the differential equation. The neural network during generalization learns the mapping corresponding to the PDE. Alternatively, in [10], the solution to a differential equation is written as a constant term, and an adjustable term with parameters that need to be determined. A neural network is used to determine the optimal values of the parameters. This approach is applicable only to problems with regular boundaries. An extension of the approach to problems with irregular boundaries is given in [11]. Other neural network based differential equation solvers use multilayer perceptron networks or variations on the MLP to approximate the unknown function in a PDE [12][14]. A combination of the PDE and boundary conditions is used to construct an objective function that is minimized during the training process.
A major limitation of these approaches is that the network architecture is selected somewhat arbitrarily. A second drawback is that the performance of the neural networks depends on the data used in training and testing. As long the test data is similar to the training data, the network can interpolate between the training data points to obtain a reasonable prediction. However, when the test signal is no longer similar to the training data, the
network is forced to extrapolate and the performance degrades. One way around this difficulty is to ensure that the training data.base has a diverse set of signals. However, this is difficult to ensure in practice. Alternatively, we have to design neural net.works that are capable of extrapolation. Extrapolation methods are discussed extensively in literature [15][18], but the design of an extrapolation neural network involves several issues particularly for ensuring that the error in the network prediction stays within reasonable bounds during the extrapolation procedure.
An ideal solution to this problem would be to combine the power of numerical models with the computational speed of neural networks, i.e., to embed a numerical model in a neural network structure. One such finite-element neural network (FENN) formulation has been reported by Takeuchi and Kosugi [19]. This approach, based on error minimization, derives the neural network using the energy functional resulting from the finite-element formulation. Other reports of FENN combinations are either similar to the Takeuchi method [20], [21] or use Hopfield neural networks to solve the forward problem [22], [23]. Kalkkuhl et al. [24] provide a description of a FEM-based approach to NARX modeling that may be interpreted both as a local model network, as well as a single layer feedforward network. A slightly different approach to merging numerical methods and neural networks is given in [25], where the finite-difference time domain (FDTD) method is cast in a neural network framework for the purpose of solving electromagnetic forward problems. The related problem of mesh generation in finite-element models has also been tackled using neural networks (for instance, [26]). Generally, these networks are designed to solve the forward problem, and must be modified to solve inverse problems.
This paper proposes a new approach that embeds a finite-element model commonly used in the solution of differential equations in a neural network. The network, called the FENN, can solve the forward problem and can also be used in an iterative algorithm to solve inverse problems. The primary advantage of this approach is that the FEM is represented in a parallel form. Thus, it has the potential to alleviate the computational cost associated with using the FEM in an iterative algorithm for solving inverse problems. More importantly, the FENN does not need any training, and the computation of the weights is a one-time process. The proposed approach is also different in that the neural network architecture developed can be used to solve the forward and inverse problems. The structure of the neural network is also simpler than those reported in the literature, making it easier to implement in parallel in both hardware and software.
The rest of this paper is organized as follows. Section II briefly describes the FEM, and derives the proposed FENN. In this paper, we focus on the problem of solving typical equations encountered in electromagnetic nondestructive evaluation (NDE). However, the same concepts can be easily applied to solve differential equations encountered in other fields. Sections III, IV and V present the application of the FENN to solving forward and inverse problems, along with initial results. A discussion of the advantages and disadvantages of the proposed FENN architecture is given in Section IV. Finally, Section V draws conclusions from the results and presents ideas for future work.
II. THE FENN
This section briefly describes the FEM and proposes its reformulation into a parallel neural network structure. Details about the FEM can be found in [3] and [4].
A. The FEM
Consider a typical boundary value problem with the governing differential equation
<<FORMULA>> (1)
where <<FORMULA>> is a differential operator, <<FORMULA>> is the applied source or forcing function, and
is the unknown quantity. This differential equation can be solved in conjunction with boundary conditions on the boundary
enclosing the domain
The variational formulation used in finite-element analysis determines the unknown
by minimizing the functional [3], [4] (2) with respect to the trial function
The minimization procedure starts by dividing into small subdomains called elements (Fig. 2) and representing in each element by means of basis functions defined over the element (3) where
is the unknown solution in element
<<FORMULA>> (3)
is the basis function associated with node in element , is the value of the unknown quantity at node and is the total number of nodes associated with element <<FORMULA>> In general, the basis functions (also referred to as interpolation functions or shape functions) can be linear, quadratic, or of higher order. Typically, finite-element models use either linear or polynomial spline basis functions.
The functional within an element is expressed as
<<FORMULA>> (4)
By substituting (3) in (4), we obtain the discrete version of the functional within each element
<<FORMULA>> (5)
where is the transpose of a matrix, mental matrix with elements is the ele.
<<FORMULA>> (6)
and is an vector with elements
<<FORMULA>> (7)
Combining the values in (5) for each of the elements (8) where is the global matrix derived from the terms of the elemental matrices for different elements, and
is the total number of nodes, also called the stiffness matrix, is a sparse, banded matrix. Equation (8) is the discrete version of the functional and can be minimized with respect to the nodal parameters
by taking the derivative of with respect to <<FORMULA>> and setting it equal to zero, which results in the matrix equation
<<FORMULA>> (9)
Boundary conditions for these problems are usually of two types: natural boundary conditions and essential boundary conditions. Essential boundary conditions (also referred to as Dirichlet boundary conditions) impose constraints on the value of the unknown
at several nodes. Natural boundary conditions (of which Neumann boundary conditions are a special case) impose constraints on the change in
across a boundary. Dirichlet boundary conditions are imposed on the functional minimization (9), by deleting the rows and columns of the matrix corresponding to the nodes on the Dirichlet boundary and modifying
in (9).
Natural boundary conditions are applied in the FEM by adding an additional term to the functional. These boundary conditions are then incorporated into the functional and are satisfied automatically during the solution procedure. As an example, consider the natural boundary condition represented by the following equation [3] on
<<FORMULA>> (10)
where <<FORMULA>> represents the Neumann boundary, is its outward normal unit vector, is some constant, and , <<FORMULA>>, and are known parameters associated with the boundary. Assuming that the boundary
is made up of segments, we can define boundary matrices and with elements
<<FORMULA>> (11)
where <<FORMULA>>are basis functions defined over segment and is the length of the segment. The elements of <<FORMULA>> are added to the elements of that correspond to the nodes on the boundary. Similarly, the elements of <<FORMULA>> are added to the corresponding elements of
<<FORMULA>> The global matrix (9) is thus modified as follows before solving for
<<FORMULA>> (12)
<<FIGURE>>
Fig. 3. FEM domain discretization using two elements and four nodes.
This process ensures that natural boundary conditions are implicitly and automatically satisfied during the FEM solution procedure.
B. The FENN
This section describes how the finite-element model can be converted into a parallel network form. We focus on solving typical inverse problems arising in electromagnetic NDE, but the basic idea is applicable to other areas as well. NDE inverse problems can be formulated as the problem of finding the material properties (such as the conductivity or the permeability) within the domain of the problem. Since the domain is discretized in the FEM method by a large number of elements, the problem can be posed as one of finding the material properties in each of these elements. These properties are usually embedded in the differential operator <<FORMULA>> or equivalently, in the global matrix
<<FORMULA>> Thus, in order to be able to iteratively estimate these properties from the measurements, the material properties need to be separated out from
<<FORMULA>> This separation is easier to achieve at the element matrix level. For nodes <<FORMULA>> and in element
<FORMULA>> (13)
where <<FORMULA>> is the parameter representing the material property in element <<FORMULA>> and <<FORMULA>> represents the differential operator at the
<<FIGURE>>
Fig. 4. FENN.
element level without embedded in it. Substituting (13) into the functional, we get
<<FORMULA>> (14)
If we define
<<FORMULA>> (15)
where
<<FORMULA>> (16)
<<FORMULA>> (17)
Equation (17) expresses the functional explicitly in terms of <<FORMULA>> The assumption that is constant within each element is implicit in this expression. This assumption is usually satisfied in problems in NDE where each element in the FEM mesh is defined within the confines of a domain, and at no time does a single element cross domain boundaries. Furthermore, each element is small enough that minor variations in
within an element may be ignored. Equation (17) can be easily converted into a parallel network form. The neural network comprises an input, output and hidden layer. In the general case with
<<FORMULA>> elements and <<FORMULA>> nodes in the FEM mesh, the input layer with network inputs takes the values in each element as input. The hidden layer has
neurons arranged in groups of neurons, corresponding to the members of the global <<FORMULA>> matrix
. The output of each group of hidden layer neurons is the corresponding row vector of
. The weights from the input to the hidden layer are set to the appropriate values of
. Each neuron in the hidden layer acts as a summation unit, (equivalent to a summation followed by a linear activation function [5]). The outputs of the hidden layer neurons are the elements of the global matrix
as given in (15). Each group of hidden neurons is connected to one output neuron (giving a total of output neurons) by a set of weights with each element of
representing the nodal values. Note that the set of weights
between the first group of hidden neurons and the first output neuron are the same as the set of weights between the second group of hidden neurons and the second output neuron (as well as between successive groups of hidden neurons and the corresponding output neuron). Each output neuron is also a summation unit followed by a linear activation function, and the output of each neuron is equal to
<<FORMULA>> (18)
where the second part of (18) is obtained by using (15). As an example, the FENN architecture for a two-element, four-node FEM mesh (Fig. 3) is shown in Fig. 4. In this case, the FENN has two input neurons, 16 hidden layer neurons and four output neurons. The <20>gure illustrates the grouping of the hidden layer neurons, as well as the similarity inherent in the weights that connect each group of hidden layer neurons to the corresponding output neuron. To simplify the <20>gure, the weights between the network input and hidden layer neurons are depicted by means of vectors
(for , 2, 3, 4 and , 2), where the individual weight values <<FORMULA>> are defined as in (16).
1) Boundary Conditions in the FENN: Note that the elements of <<FORMULA>> and in (11) do not depend on the material properties <<FORMULA>> and need to be added appropriately to the global matrix
and the source vector as shown in (12).
<<FIGURE>>
Fig. 5. Geometry of mesh for 1-D FEM.
<<FIGURE>>
Fig. 6. Flowchart (with example) for designing the FENN for a general PDE.
Equation (12) thus implies that natural boundary conditions can be ap-layer neurons. These weights will be referred to as the clamped plied in the FENN as bias inputs to the hidden layer neurons weights, while the remaining weights will be referred to as the that are a part of the boundary, and the corresponding output free weights. An example of these weights is presented later. neurons. Dirichlet boundary conditions are applied by clamping The FENN architecture was derived without consideration of the corresponding weights between the hidden layer and output the dimensionality of the problem at hand, and thus can be used for 1-, 2-, 3-, or higher dimensional problems. The number of nodes and elements in the FEM mesh dictates the number of neurons in the different layers. The weights between the input and hidden layer change depending on node-element connectivity information.
The major drawback of the FENN is the number of neurons and weights necessary. However, the memory requirements can be reduced considerably, since most of the weights between the input and hidden layer are zero. These weights, and the corresponding connections, can be discarded. Similarly, most of the elements of the
matrix are also zero (is a banded matrix). The corresponding neurons in the hidden layer can also be discarded, reducing memory and computation requirements considerably. Furthermore, the weights between each group of hidden layer neurons and the output layer are the same
. Weight-sharing approaches can be used here to further reduce the storage requirements.
C. A 1-D Example
Consider the 1-D equation
<<FORMULA>> (19)
on the boundary <<FORMULA>> defined by <<FORMULA>> and
are constants depending on the material and
is the applied source. Laplace's equation and Poisson's equation are special cases of this equation. The FENN formulation for this problem starts by discretizing the domain of interest with <<FORMULA>> elements and
nodes. In one dimension, each element is defined by two nodes (Fig. 5). define basis functions <<FORMULA>> and <<FORMULA>> over each element <<FORMULA>> and let
is the value of <<FORMULA>> on node <<FORMULA>> in element <<FORMULA>> An example of the basis functions is shown in Fig. 5. For these basis functions, i.e.,
<<FORMULA>> (20)
the element matrices are given by [3]
<<FORMULA>> (21)
<<FORMULA>> (22)
Here, <<FORMULA>> is the length of element <<FORMULA>> The global matrix
is then constructed by selectively adding the element matrices based on the nodes that form an element. Specifically,
is a sparse tridiagonal matrix, and its nonzero elements are given by
<<FORMULA>> (23)
Fig. 7. Shielded microstrip geometry. (a) Complete problem description. (b) Problem description using symmetry considerations.
The network implementation of (23) can be derived as fol.lows. If <<FORMULA>> and <<FORMULA>> values at each element are the inputs to the network,
<<FORMULA>> and <<FORMULA>> form the weights between the input and hidden layers. The network thus uses input neurons and
hidden neurons. The values of <<FORMULA>> at each of the nodes are assigned as weights between the hidden and output layers, and the source
is the desired output of this network (corresponding to the output neurons). Dirichlet boundary conditions on
are applied as explained earlier.
D. General Case
Fig. 6 shows a flowchart of the general scheme for converting a differential equation into the FENN structure. An example in two dimensions is also provided next to the flowchart. We start with the differential equation and the boundary conditions and formulate the FEM using the variational method. This in.volves discretizing the domain of interest with
elements and
nodes, selecting basis functions, writing the functional for each element and obtaining the element matrices and the source vector. The example presented uses the FEM mesh shown in Fig. 3, with
elements, and <<FORMULA>> nodes, and linear basis functions. The unknown solution to the differential equation
is represented by its values at each of the nodes in the finite-element mesh <<FORMULA>> The element matrices
are then separated into two parts, with one part dependent on the material properties <<FORMULA>> and
while the other is independent of them. The FENN is then designed to have input neurons, hidden neurons, and output neurons, where <<FORMULA>> is the number of material property parameters. In the example under consideration, <<FORMULA>>, since we have two
material property parameters ( and ). The first group of input neurons takes in the values while the second group takes in the
values in each element. The weights from the input to the hidden layer are set to the appropriate values of
<<FORMULA>> In the example, since nodes 1, 2, and 3 are part of element 1 (see Fig. 3), the weights from the first input node
to the first group of four neurons in the hidden layer are given by
<<FORMULA>> (24)
The last weight is zero since node 4 is not a part of element 1. Each group of hidden neurons is connected to one output neuron (giving a total of
output neurons) by a set of weights <<FORMULA>> with each element of representing the nodal values. The output of each neuron in the output layer is equal to
<<FIGURE>>
Fig. 8. Forward problem solutions for shielded microstrip problem show the contours of constant potential for: (a) FEM solution and (b) FENN solution. (c) Error between (a) and (b). The x-and y-axes show the nodes in the FEM discretization of the domain, and the z-axis in (c) shows the error at each of these nodes in volts.
III. FORWARD AND INVERSE PROBLEM FORMULATION USING FENN.
where is the output of the FENN based approach, then, for the gradients of the error with respect to the free hidden layer weights is given by the FENN architecture and algorithm lends itself to solving
<<FORMULA>> (27)
both the forward and inverse problems. The forward problem involves determining the weights
given the material parameters Equation (27) can be used to solve the forward problem.
Similarly, the applied source to solve the inverse problem,
while the inverse problem the gradients of the error involves determining and (input of the FENN) are necessary, and approach can be used to solve both these problems. Suppose we are given by define the error at the output of the FENN as
<<TABLE>>
TABLE I SUMMARY OF PERFORMANCE OF THE FENN ALGORITHM FOR VARIOUS PDES
For the forward problem, such an approach is equivalent to the iterative approaches used to solve for the unknown nodal values in the FEM [4].
IV. RESULTS
A. Forward Model Results
The FENN was tested using both 1-and 2-D versions of Poisson<6F>s equation
<<FORMULA>> (30)
where represents the material property, and is the applied source. For instance, in electromagnetics may represent the permittivity while represents the charge density.
As the first example, consider the following 2-D equation
<<FORMULA>> (31)
with boundary conditions and <<FORMULA>> on <<FORMULA>> (32)
on <<FORMULA>> (33)
This is the governing equation for the shielded microstrip trans.mission line problem shown in Fig. 7. The forward problem computes the electric potential due to the shielded microstrip shown in Fig. 7(a). The potentials are zero on the shielding con.ductor. Since the geometry is symmetric, we can solve the equiv.alent problem shown in Fig. 7(b), by applying the homogeneous Neumann condition on the plane of symmetry. The inner con.ductor (microstrip) is held at a constant potential of volts. Finally, we also assume that the material inside the shielding conductor has a permittivity , where K is a constant. The permittivity in this case corresponds to the material property . Specifically, and . The homogeneous Neu.mann boundary condition is equivalent to setting . The microstrip and the shielding conductor correspond to the Dirichlet boundary, with <<FORMULA>> on the microstrip and
on the outer boundary [Fig. 7(b)]. Finally, there is no source term in this example (the source term would correspond to a charge distribution in the domain of interest), i.e., <<FORMULA>> In this ex.ample, we assume that volts. Further, we assume that the domain of interest is
The solution to the forward problem is presented in Fig. 8, with the FEM solution using 11 nodes in each direction shown in Fig. 8(a) and the corresponding FENN solution in Fig. 8(b). These <20>gures show contours of constant potential. The error be.tween the FEM and FENN solutions is presented in Fig. 8(c). As seen from the <20>gure, the FENN is seen to match the FEM solu.tion accurately, with the peak error at any node on the order of
Several other examples were also used to test the FENN and the results are summarized in Table I. Column 1 shows the PDE used to evaluate the FENN performance, while column 2 shows the boundary conditions used. The analytic solution to the problem is indicated in Column 3. The FENN structure and the number of iterations for convergence using a gradient de.scent approach are indicated in Columns 4 and 5, respectively. The FENN structure, as explained earlier, has an
are the number of elements and nodes in the FEM mesh, respectively, and
is the number of hidden neurons, and corresponds to the number of nonzero elements in the FEM global matrix
Finally, Columns 6 and 7 present the sum-squared error (SSE) and the maximum error in the solution, respectively, where the errors are computed with respect to the analytical solution. These results indicate that the FENN is capable of accurately deter.mining the potential
One advantage of the FENN approach is that the computation of the input-hidden layer weights is a one-time process, as long as the differential equation does not change. The only changes necessary to solve the different problems are changes in the input
and the desired output.
B. Inverse Model Results
The FENN was also used to solve several simple inverse problems based on (30). In all cases, the objective was to determine
<<FIGURE>>
Fig. 9. FENN inversion results for Poisson's equation with initial solutions (a)
the value of <<FORMULA>> and <<FORMULA>> for given values of <<FORMULA>> and
The <<FORMULA>> first example is a 1-D problem that involves determining
given and <<FORMULA>>
for the differential equation
<<FORMULA>> (34)
with boundary conditions <<FORMULA>> and <<FORMULA>>. The analytical solution to this inverse problem is
<<FORMULA>> and
<<FORMULA>> (35)
As seen from (35), the problem has an infinite number of solutions and we expect the solution procedure to converge to one of these solutions depending on the initial value.
Fig. 9(a) and (b) shows two solutions to this inverse problem for two different initializations (shown using triangles). In both cases, the FENN solution (in stars) is seen to match the analytical solution (squares). The SSE in both cases was on the order of
<<FORMULA>>
In order to obtain a unique solution, we need to constrain the value of at the boundary as well. Consider the same differen.
tial equation as (34), but with and specified as follows:
and
(36)
The analytical solution for this equation is .To solve this problem, we set and clamp the value of at and as follows: , . The results of the constrained inversion obtained using 11 nodes and 10 elements in the corresponding finite-element mesh are shown in Fig. 10. Fig. 10(a) shows the comparison between the analytical solution (solid line with squares) and the FENN result (solid line with stars). The initial value of is shown in the figure as a dashed line. Fig. 10(b) shows the comparison between the actual and desired forcing function at the FENN
output. This result indicates that the SSE in the forcing function, as well as the SSE in the inversion result, is fairly large (0.0148 and 0.0197, respectively). The reason for this was traced back to the mesh discretization. Fig. 11 shows the SSE in the output of the FENN and the SSE in the inverse problem solution as a function of FEM discretization. It is seen that increasing the discretization significantly improves the solution. Similar results were observed for other problems.
V. DISCUSSION AND CONCLUSION
The FENN is closely related to the finite-element model used to solve differential equations. The FENN architecture has a weight structure that allows both the forward and inverse problems to be solved using simple gradient-based algorithms. Initial results indicate that the proposed FENN algorithm is capable of accurately solving both the forward and inverse problems. In addition, the forward problem solution from the FENN is seen to exactly match the FEM solution, indicating that the FENN represents the finite-element model exactly in a parallel configuration.
The major advantage of the FENN is that it represents the finite-element model in a parallel form, enabling parallel implementation in either hardware or software. Further, computing gradients in the FENN is very simple. This is an advantage in solving both forward and inverse problems using gradient-based methods. The gradients can also be computed in parallel and the lack of nonlinearities in the neuron activation functions makes the computation of gradients simpler. A major advantage of this approach for solving inverse problems is that it avoids inverting the global matrix in each iteration. The FENN also does not require any training, since most of its weights can be computed in advance and stored. The weights depend on the governing differential equation and its associated boundary conditions, and as long as these two factors do not change, the weights do not change. This is especially an advantage in solving inverse problems in electromagnetic NDE. This approach also reduces the computational effort associated with the network.
Future work will concentrate on applying the FENN to 3-D electromagnetic NDE problems. The robustness of the approach will also be tested, since the ability of these approaches to in.vert practical noisy measurements is important. Furthermore, the use of better optimization algorithms, like conjugate gradient methods, is expected to improve the solution speed. In addition, parallel implementation of the FENN in both hardware and software is under investigation. The approach described in this paper is very general in that it can be applied to a variety of inverse problems in fields other than electromagnetic NDE. Some of these other applications will also be investigated to show the general nature of the proposed method.
REFERENCES
[1] L. Udpa and S. S. Udpa, <20>Application of signal processing and pattern recognition techniques to inverse problems in NDE,<2C> Int. J. Appl. Elec.tromagn. Mechan., vol. 8, pp. 99<39>117, 1997.
[2] M. Yan, M. Afzal, S. Udpa, S. Mandayam, Y. Sun, L. Udpa, and P. Sacks, <20>Iterative algorithms for electromagnetic NDE signal inversion,<2C> in ENDE <20>97, Reggio Calabria, Italy, Sep. 14<31>16, 1997.
[3] J. Jin, The Finite Element Method in Electromagnetics. New York: Wiley, 1993.
[4] P. Zhou, Numerical Analysis of Electromagnetic Fields. Berlin, Ger.many: Springer-Verlag, 1993.
[5] S. Haykin, Neural Networks: A Comprehensive Foundation. Upper Saddle River, NJ: Prentice-Hall, 1994.
[6] C. A. Jensen et al., <20>Inversion of feedforward neural networks: algo.rithms and applications,<2C> Proc. IEEE, vol. 87, no. 9, pp. 1536<33>1549, 1999.
[7] P. Ramuhalli, L. Udpa, and S. Udpa, <20>Neural network algorithm for elec.tromagnetic NDE signal inversion,<2C> in ENDE 2000, Budapest, Hungary, Jun. 2000.
[8] C. H. Barbosa, A. C. Bruno, M. Vellasco, M. Pacheco, J. P. Wikswo Jr., and A. P. Ewing, <20>Automation of SQUID nondestructive evaluation of steel plates by neural networks,<2C> IEEE Trans. Appl. Supercond., vol. 9, no. 2, pp. 3475<37>3478, 1999.
[9] W. Qing, S. Xueqin, Y. Qingxin, and Y. Weili, <20>Using wavelet neural net.works for the optimal design of electromagnetic devices,<2C> IEEE Trans. Magn., vol. 33, no. 2, pp. 1928<32>1930, 1997.
[10] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis, <20>Arti<74>cial neural networks for solving ordinary and partial differential equations,<2C> IEEE Trans. Neural Netw., vol. 9, no. 5, pp. 987<38>1000, 1998.
[11] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou, <20>Neural-network methods for boundary value problems with irregular boundaries,<2C> IEEE Trans. Neural Netw., vol. 11, no. 5, pp. 1041<34>1049, 2000.
[12] B. P. Van Milligen, V. Tribaldos, and J. A. Jimenez, <20>Neural network differential equation and plasma equilibrium solver,<2C> Phys. Rev. Lett., vol. 75, no. 20, pp. 3594<39>3597, 1995.
[13] M. W. M. G. Dissanayake and N. Phan-Thien, <20>Neural-network-based approximations for solving partial differential equations,<2C> Commun. Numer. Meth. Eng., vol. 10, pp. 195<39>201, 1994.
[14] R. Masuoka, <20>Neural networks learning differential data,<2C> IEICE Trans. Inform. Syst., vol. E83-D, no. 6, pp. 1291<39>1300, 2000.
[15] D. C. Youla, <20>Generalized image restoration by the method of alternating orthogonal projections,<2C> IEEE Trans. Circuits Syst., vol. CAS-25, no. 9, pp. 694<39>702, 1978.
[16] D. C. Youla and H. Webb, <20>Image restoration by the method of convex projections: part I<>theory,<2C> IEEE Trans. Med. Imag., vol. MI-1, no. 2, pp. 81<38>94, 1982.
[17] A. Lent and H. Tuy, <20>An iterative method for the extrapolation of band-limited functions,<2C> J. Math. Analysis and Applicat., vol. 83, pp. 554<35>565, 1981.
[18] W. Chen, <20>A new extrapolation algorithm for band-limited signals using the regularization method,<2C> IEEE Trans. Signal Process., vol. 41, no. 3, pp. 1048<34>1060, 1993.
[19] J. Takeuchi and Y. Kosugi, <20>Neural network representation of the finite element method,<2C> Neural Netw., vol. 7, no. 2, pp. 389<38>395, 1994.
[20] R. Sikora, J. Sikora, E. Cardelli, and T. Chady, <20>Arti<74>cial neural net.work application for material evaluation by electromagnetic methods,<2C> in Proc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 4027<32>4032.
[21] G. Xu, G. Littlefair, R. Penson, and R. Callan, <20>Application of FE-based neural networks to dynamic problems,<2C> in Proc. Int. Conf. Neural Infor.mation Processing, vol. 3, 1999, pp. 1039<33>1044.
[22] F. Guo, P. Zhang, F. Wang, X. Ma, and G. Qiu, <20>Finite element anal.ysis-based Hop<6F>eld neural network model for solving nonlinear elec.tromagnetic <20>eld problems,<2C> in Proc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 4399<39>4403.
[23] H. Lee and I. S. Kang, <20>Neural algorithm for solving differential equations,<2C> J. Computat. Phys., vol. 91, pp. 110<31>131, 1990.
[24] J. Kalkkuhl, K. J. Hunt, and H. Fritz, <20>FEM-based neural-network approach to nonlinear modeling with application to longitudinal vehicle dynamics control,<2C> IEEE Trans. Neural Netw., vol. 10, no. 4, pp. 885<38>897, 1999.
[25] R. K. Mishra and P. S. Hall, <20>NFDTD concept,<2C> IEEE Trans. Neural Netw., vol. 16, no. 2, pp. 484<38>490, 2005.
[26] D. G. Triantafyllidis and D. P. Labridis, <20>A finite-element mesh gener.ator based on growing neural networks,<2C> IEEE Trans. Neural Netw., vol. 13, no. 6, pp. 1482<38>1496, 2002.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Floating Point Operations in Matrix-Vector Calculus
(Version 1.3)
Raphael Hunger
Technical Report 2007
Technische Universit<69>t Mchen Associate Institute for Signal Processing
Univ.-Prof. Dr.-Ing. Wolfgang Utschick
History
Version 1.00: October 2005 -Initial version
Version 1.01: 2006 -Rewrite of sesquilinear form with a reduced amount of FLOPs -Several Typos fixed concerning the number of FLOPS required for the Cholesky decomposition Version 1.2: November 2006 -Conditions for the existence of the standard <<FORMULA>> Cholesky decomposition specified (positive definiteness) -Outer product version of <<FORMULA>> Cholesky decomposition removed -FLOPs required in Gaxpy version of <<FORMULA>> Cholesky decomposition updated -<<FORMULA>> Cholesky decomposition added -Matrix-matrix product LC added with L triangular -Matrix-matrix product <<FORMULA>>C added with L triangular and <<FORMULA>> not known a priori -Inverse L. 11 of a lower triangular matrix with ones on the main diagonal added
Version 1.3: September 2007 -First globally accessible document version
ToDo: (unknown when) -QR-Decomposition -LR-Decomposition
Please report any bug and suggestion to hunger@tum.de
Contents
1. Introduction 4
2. Flop Counting 5
2.1 MatrixProducts .................................... 5
2.1.1 Scalar-Vector Multiplication .a ....................... 5
2.1.2 Scalar-Matrix Multiplication .A ...................... 5
2.1.3 Inner Product aHb ofTwo Vectors ...................... 5
2.1.4 Outer Product ac H ofTwo Vectors ...................... 5
2.1.5 Matrix-Vector Product Ab .......................... 6
2.1.6 Matrix-Matrix Product AC ......................... 6
2.1.7 Matrix Diagonal Matrix Product AD .................... 6
2.1.8 Matrix-Matrix Product LD ......................... 6
2.1.9 Matrix-Matrix Product L1D ......................... 6
2.1.10 Matrix-Matrix Product LC with L Lower Triangular ............ 6
2.1.11 Gram AHA of A ............................... 6
2.1.12 Squared Frobenius Norm kAk2F = tr(AHA) ................ 7
2.1.13 Sesquilinear Form cHAb ........................... 7
2.1.14 Hermitian Form aHRa ............................ 7
2.1.15 Gram LHL of a Lower Triangular Matrix L ................. 7
2.2 Decompositions.................................... 8
2.2.1 Cholesky Decomposition R = <<FORMULA>> (GaxpyVersion) ........... 8
2.2.2 Cholesky Decomposition R = L1DL1H ................... 10
2.3 Inverses ofMatrices .................................. 11
2.3.1 Inverse <<FORMULA>> of a Lower Triangular Matrix L ................ 11
2.3.2 Inverse L. 11 of a Lower Triangular Matrix L1 with Ones on the Main Diagonal..................................... 12
2.3.3 Inverse R.1 of a Positive definite Matrix R ................. 13
2.4 Solving Systems of Equations ............................ 13
2.4.1 Product <<FORMULA>>C with <<FORMULA>> not known a priori. ................ 13
3. Overview 14
Appendix 15
Bibliography 16
1. Introduction
For the design of efficient und low-complexity algorithms in many signal-processing tasks, a de.tailed analysis of the required number of floating-point operations (FLOPs) is often inevitable. Most frequently, matrix operations are involved, such as matrix-matrix products and inverses of matrices. Structures like Hermiteness or triangularity for example can be exploited to reduce the number of needed FLOPs and will be discussed here. In this technical report, we derive expressions for the number of multiplications and summations that a majority of signal processing algorithms in mobile communications bring with them.
Acknowledgments:
The author would like to thank Dipl.-Ing. David A. Schmidt and Dipl.-Ing. Guido Dietl for the fruitful discussions on this topic.
2. Flop Counting
In this chapter, we offer expressions for the number of complex multiplications and summations required for several matrix-vector operations. A floating-point operation (FLOP) is assumed to be
either a complex multiplication or a complex summation here, despite the fact that a complex multiplication requires 4 real multiplications and 2 real summations whereas a complex summations consists of only 2 real summations, making a multiplication more expensive than a summation. However, we count each operation as one FLOP.
Throughout this report, we assume <<FORMULA>> to be a scalar, the vectors <<FORMULA>>, and <<FORMULA>> to have dimension N, N, and M, respectively. The matrices <<FORMULA>>, and <<FORMULA>> are assumed to have no special structure, whereas <<FORMULA>> is Hermitian and <<FORMULA>> is diagonal. L is a lower triangular <<FORMULA>> matrix, en denotes the unit vector with a 1 in the n-th row and zeros elsewhere. Its dimensionality is chosen such that the respective matrix-vector product exists. Finally, [A]a,b denotes the element in the a-th row and b-th column of A, <<FORMULA>> selects the submatrix of A consisting of rows a to b and columns c to
d. 0a.b is the a . b zero matrix. Transposition, Hermitian transposition, conjugate, and real-part operator are denoted by <<FORMULA>>, and <<FORMULA>>, respectively, and require no FLOP.
2.1 Matrix Products
Frequently arising matrix products and the amount of FLOPs required for their computation will be discussed in this section.
2.1.1 Scalar-Vector Multiplication <<FORMULA>>
A simple multiplication .a of a vector a with a scalar <<FORMULA>> requires N multiplications and no summation.
2.1.2 Scalar-Matrix Multiplication <<FORMULA>>
Extending the result from Subsection 2.1.1 to a scalar matrix multiplication <<FORMULA>> requires NM multiplications and again no summation.
2.1.3 Inner Product aHb of Two Vectors
An inner product aHb requires N multiplications and <<FORMULA>> summations, i.e., <<FORMULA>> FLOPs.
2.1.4 Outer Product <<FORMULA>> of Two Vectors
An outer product acH requires NM multiplications and no summation.
2. Flop Counting
2.1.5 Matrix-Vector Product <<FORMULA>>
Computing Ab corresponds to applying the inner product rule <<FORMULA>> from Subsection 2.1.3 M times. Obviously, <<FORMULA>> and <<FORMULA>> represents the i-th row of A. Hence, its computation costs MN multiplications and <<FORMULA>> summations, i.e., <<FORMULA>> FLOPs.
2.1.6 Matrix-Matrix Product <<FORMULA>>
Repeated application of the matrix-vector rule Aci from Subsection 2.1.5 with ci being the i-th column of C yields the overall matrix-matrix product AC. Since <<FORMULA>>, the matrix-matrix product has the L-fold complexity of the matrix-vector product. Thus, it needs MNL multiplications and <FORMULA> summations, altogether <<FORMULA>> FLOPs.
2.1.7 Matrix Diagonal Matrix Product AD
If the right hand side matrix D of the matrix product AD is diagonal, the computational load reduces to M multiplications for each of the N columns of A, since the n-th column of A is scaled by the n-th main diagonal element of D. Thus, MN multiplications in total are required for the computation of AD, no summations are needed.
2.1.8 Matrix-Matrix Product LD
When multiplying a lower triangular matrix L by a diagonal matrix D, column n of the matrix product requires <<FORMULA>> multiplications and no summations. With <<n =1,...,N>>, we get
<<FORMULA>> multiplications.
2.1.9 Matrix-Matrix Product L1D
When multiplying a lower triangular matrix L1 with ones on the main diagonal by a diagonal matrix D, column n of the matrix product requires <<<<FORMULA>>>> multiplications and no summations. With <<n =1,...,N>>, we get
<<FORMULA>> multiplications.
2.1.10 Matrix-Matrix Product LC with L Lower Triangular
Computing the product of a lower triangular matrix <<FORMULA>> and <<FORMULA>> is done column-wise. The nth element in each column of LC requires n multiplications and <<<<FORMULA>>>> summations,
so the complete column needs <<FORMULA>> multiplications and <<FORMULA>> summations. The complete matrix-matrix product is obtained from computing L columns. We have
<<FORMULA>> multiplications and <<FORMULA>> summations, yielding a total amount of <<FORMULA>> FLOPs.
2.1.11 Gram <<FORMULA>> of A
In contrast to the general matrix product from Subsection 2.1.6, we can make use of the Hermitian structure of the product <<FORMULA>>. Hence, the strictly lower triangular part of <<FORMULA>> need not be computed, since it corresponds to the Hermitian of the strictly upper triangular part. For
this reason, we have to compute only the N main diagonal entries of <<AHA>> and the <<N2^2>> upper <<FORMULA>> off-diagonal elements, so only <<FORMULA>> different entries have to be evaluated. Each element requires an inner product step from Subsection 2.1.3 costing M multiplications and <<FORMULA>> summations. Therefore,
<<FORMULA>> multiplications and <<FORMULA>> summations are needed, making up a total amount of <<FORMULA>> FLOPs.
2.1 Matrix Products
2.1.12 Squared Frobenius Norm <<FORMULA>>
The squared Hilbert-Schmidt norm <<FORMULA>> follows from summing up the MN squared entries from A. We therefore have MN multiplications and <<FORMULA>> summations, yielding a total of <<FORMULA>> FLOPs.
2.1.13 Sesquilinear Form <<FORMULA>>
The sesquilinear form cHAb should be evaluated by computing the matrix-vector product Ab in a first step and then multiplying with the row vector cH from the left hand side. The matrix vector product requires MN multiplications and <<FORMULA>> summations, whereas the inner product needs M multiplications and <<FORMULA>> summations. Altogether, <<FORMULA>> multiplications and <<FORMULA>> summations have to be computed for the sesquilinear form <<FORMULA>>, yielding a total number of <<FORMULA>> flops.
2.1.14 Hermitian Form a <<FORMULA>>
With the Hermitian matrix <<FORMULA>>, the product <<FORMULA>> can be expressed as
<<FORMULA>>
with <<FORMULA>>, and <<FORMULA>>. The first sum accumulates the weighted main diagonal entries and requires 2N multiplications and <<FORMULA>> summations. The second part of (2.1) accumulates all weighted off-diagonal entries from A. The last two summations sum up 2 terms2. Consequently, the second part of (2.1) requires <<FORMULA>> summations and <<FORMULA>> products. Finally, the two parts have to be added accounting for an additional summation and yielding an overall amount of <<FORMULA>> products and
<<FORMULA>> summations, corresponding to <<FORMULA>> FLOPs.
2.1.15 Gram <<FORMULA>> of a Lower Triangular Matrix L
During the computation of the inverse of a positive definite matrix, the Gram matrix of a lower triangular matrix occurs when Cholesky decomposition is applied. Again, we make use of the Hermitian structure of the Gram <<FORMULA>>, so only the main diagonal entries and the upper right off-diagonal entries of the product have to be evaluated. The a-th main-diagonal entry can be expressed <FORMULA>>.
We made use of (A1) in the Appendix for the computation of the last sum accumulating subsequent integers.
We do not exploit the fact that only real-valued summands are accumulated as we only account for complex flops.
The scaling with the factor 2 does not require a FLOP, as it can be implemented by a simple bit shift.
Clearly, if <<FORMULA>>, we have to subtract one summation from the calculation since no off-diagonal entries exist.
2. Flop Counting
<<FORMULA>> (2.2)
with <<FORMULA>>, requiring <<FORMULA>> multiplications and <<FORMULA>> summations. Hence, all main diagonal elements need <<FORMULA>> multiplications and
<<FORMULA>> summations. The upper right off-diagonal entry <<FORMULA>> in row a and column b with <<FORMULA>> reads as
<<FORMULA>>, (2.3)
again accounting for <<FORMULA>> multiplications and <<FORMULA>> summations. These two expressions have to be summed up over all <<FORMULA>> and <<FORMULA>>, and for the number of multiplications, we find
<<FORMULA>> (2.4)
Again, we made use of (A1) for the sum of subsequent integers and (A2) for the sum of subsequent squared integers. For the number of summations, we evaluate
<<FORMULA>>
Computing all necessary elements of the Gram LHL thereby requires <<FORMULA>> multiplications and <<FORMULA>> summations. Altogether, <<FORMULA>> FLOPs result. The same result of course holds for the Gram of two upper triangular matrices.
2.2 Decompositions
2.2.1 Cholesky Decomposition <<FORMULA>> (Gaxpy Version)
Instead of computing the inverse of a positive definite matrix R directly, it is more efficient to start with the Cholesky decomposition <<FORMULA>> and then invert the lower triangular matrix L and compute its Gram. In this section, we count the number of FLOPs necessary for the Cholesky decomposition.
2.2 Decompositions
The implementation of the Generalized Ax plus y (Gaxpy) version of the Cholesky decomposition, which overwrites the lower triangular part of the positive definite matrix R is listed in Algorithm 2.1, see [1]. Note that R needs to be positive definite for the <<FORMULA>> decomposition!
Algorithm 2.1 Algorithm for the Gaxpy version of the Cholesky decomposition.
<<ALGORITHM>>
The computation of the first column of L in Line 1 of Algorithm 2.1 requires <<FORMULA>> multiplications, a single square-root operation, and no summations. Column <<FORMULA>> takes a matrix vector product of dimension <<FORMULA>> which is subtracted from another <<FORMULA>> dimensional vector involving <<FORMULA>> summations, see Line 3. Finally, <<FORMULA> multiplications6 and a single square-root operation are necessary in Line 4. In short, row n with <<FORMULA>> needs <<FORMULA>> multiplications, .<<FORMULA>> summations (see Subsection 2.1.5), and one square root operation, which we classify as an additional FLOP. Summing up the multiplications for rows <<FORMULA>>, we obtain
<<FORMULA>> The number of summations for rows <<FORMULA>> reads as
<<FORMULA>> (2.6)
<<FORMULA>> (2.7)
The first element need not be computed twice, since the result of the division is the square root of the denominator.
Again, the first element need not be computed twice, since the result of the division is the square root of the denominator.
2. Flop Counting
Algorithm 2.2 Algorithm for the Cholesky decomposition <<FORMULA>>
<<ALGORITHM>>
and finally, <<FORMULA>> square-root operations are needed for the <<FORMULA>> rows. Including the <<FORMULA>> multiplications for column <<FORMULA>> and the additional square root operation, <<FORMULA>> multiplications, <<FORMULA>> summations, and N square-root operations occur,
<<FORMULA>> FLOPs in total.
2.2.2 Cholesky Decomposition <<FORMULA>>
The main advantage of the <<FORMULA>> decomposition compared to the standard <<FORMULA>> decomposition is that no square root operations are needed, which may require more than one FLOP depending on the given hardware platform. Another bene<6E>t of the <<FORMULA>> decomposition is that it does not require a positive definite matrix R, the only two conditions for the unique existence are that R is Hermitian and all but the last principle minor (i.e., the determinant) of R need to be different from zero [2]. Hence, R may also be rank de<64>cient to a certain degree. If R is not positive semidefinite, then D may contain negative main diagonal entries.
The outcome of the decomposition is a lower triangular matrix L1 with ones on the main diagonal and a diagonal matrix D.
Algorithm 2.2 overwrites the strictly lower left part of the matrix R with the strictly lower part of L1 (i.e., without the ones on the main diagonal) and overwrites the main diagonal of R with the main diagonal of D. It is taken from [1] and slightly modi<64>ed, such that is also applicable to complex matrices (see the conjugate in Line 4) and no existing scalar should be re-computed (see case distinction in Line 4 for i =1).
Line 1 needs <<FORMULA>> multiplications. Lines 3 to 5 require <<FORMULA>> multiplications and are executed for <<FORMULA>>, yielding <<FORMULA>> multiplications. Line 6 takes <<FORMULA>>
multiplications and <<FORMULA>> summations, again with n =2,...,N, yielding n=2(<<FORMULA>>) = 2 multiplications and the same amount of summations. Line 7 does not require any FLOP. In Line 8, the matrix-vector product needs <<FORMULA>> multiplications, and additional <<FORMULA>> multiplications arise when the complete numerator is divided by the denominator. Hence, we have <<FORMULA>> multiplications. For <<FORMULA>> we get <<FORMULA>> multiplications.
The number of summations in Line 8 is <<FORMULA>> for the matrix vector product and <<FORMULA>> for the subtraction in the numerator. Together, we have <<FORMULA>> summations. With
<<FORMULA>> summations. Summing up, this algorithm requires <<FORMULA>> multiplications, and <<FORMULA>> summations, yielding a total amount of <<FORMULA>> FLOPs. (Note that this formula is also valid for N =1.)
2.3 Inverses of Matrices
2.3.1 Inverse <<FORMULA>> of a Lower Triangular Matrix L
Let <<FORMULA>> denote the inverse of a lower triangular matrix L. Then, X is again lower triangular which means that <<FORMULA>> for <<FORMULA>>. The following equation holds:
<<FORMULA>>. (2.8)
Via forward substitution, above system can easily be solved. Row <<FORMULA>> from (2.8) can be expressed as
<<FORMULA>>, (2.9)
with <<FORMULA>> denoting the Kronecker delta which vanishes for <<FORMULA>>, and <<FORMULA>>. Starting from <<FORMULA>>, the xb,n are computed successively, and we find
<<FORMULA>> (2.10)
with all <<FORMULA>> having been computed in previous steps. Hence, if <<FORMULA>> and a single multiplication is required, no summations are needed. For <<FORMULA>> multiplications and <<FORMULA>> summations are required, as the Kronecker-delta vanishes. All main diagonal entries can be computed by means of N multiplications The lower left off-diagonal entries
Actually, it is a division rather than a multiplication.
2. Flop Counting
require
<<FORMULA>> (2.11)
multiplications, and
<<FORMULA>> (2.12)
summations. Including the N multiplications for the main-diagonal entries, <<FORMULA>> multiplications and <<FORMULA>> summations have to be implemented, yielding a total amount
<<FORMULA>> FLOPs.
2.3.2 Inverse <<FORMULA>> of a Lower Triangular Matrix L1 with Ones on the Main Diagonal
The inverse of a lower triangular matrix L1 turns out to require N2 FLOPs less than the inverse of L with arbitrary nonzero diagonal elements. Let X denote the inverse of L1. Clearly, X is again a lower triangular matrix with ones on the main diagonal. We can exploit this fact in order to compute only the unknown entries.
The mth row and nth column of the system of equations <<FORMULA>> with <<FORMULA>> reads as
<<FORMULA>>
or, equivalently,
<<FORMULA>>
Hence, X is computed via forward substitution. To compute <<FORMULA>>, we need <<FORMULA>> multiplications and <<FORMULA>> summations. Remember that <<FORMULA>>. The total number of multiplications/summations is obtained from
<<FORMULA>>) (2.13)
We only have to consider <<FORMULA>>, since the equations resulting from m<n +1 are automatically fulfilled due to the structure of L1 and X.
2.4 Solving Systems of Equations
Summing up, <<FORMULA>> FLOPs are needed.
2.3.3 Inverse R.1 of a Positive definite Matrix R
The inverse of a matrix can for example be computed via Gaussian-elimination [1]. However, this approach is computationally expensive and does not exploit the Hermitian structure of R. Instead, it is more efficient to start with the Cholesky decomposition of <<FORMULA>> (see Subsection 2.2.1),
invert the lower triangular matrix L (see Subsection 2.3.1), and then build the Gram <<FORMULA>> of <<FORMULA>> (see Subsection 2.1.15). Summing up the respective number of operations, this procedure requires <<FORMULA>> multiplications, <<FORMULA>> summations, and N square-root operations, which yields a total amount of <<FORMULA>> FLOPs.
2.4.1 Product <<FORMULA>> with <<FORMULA>> not known a priori.
A naive way of computing the solution <<FORMULA>> of the equation <<FORMULA>> is to find <<FORMULA>> first and afterwards multiply it by C. This approach needs <<FORMULA>> FLOPs as shown in Sections 2.3.1 and 2.1.10. However, doing so is very expensive since we are not interested in the inverse of L in general. Hence, there must be a computationally cheaper variant. Again, forward substitution plays a key role.
It is easy to see, that X can be computed column-wise. Let <<FORMULA>> and <<FORMULA>>. Then, from <<FORMULA>>, we get for the element xb,a in row b and column a of X:
<<FORMULA>>
Its computation requires b multiplications and <<FORMULA>> summations. A complete column of X can therefore the computed with<<FORMULA>> multiplications and <<FORMULA>> summations. The complete matrix X with L columns thus needs <<FORMULA>> FLOPs, so the forward substitution saves <<FORMULA>> FLOPs compared to the direction inversion of L and a subsequent matrix matrix product. Interestingly, computing <<FORMULA>> with <<FORMULA>> unknown is as expensive as computing LC, see Section 2.1.10.
3. Overview
<<FORMULA>> and <<FORMULA>> are arbitrary matrices.<<FORMULA>> is a diagonal matrix, <<FORMULA>> is lower triangular, <<FORMULA>> is lower triangular with ones on the main diagonal, <<FORMULA>>, and <<FORMULA>> is positive definite.
<<TABLE>>
Appendix
A frequently occurring summation in FLOP counting is the sum of subsequent integers. By complete induction, we find
<<FORMULA>> (A1)
Above result can easily be verified by recognizing that the sum of the n-th and the <<FORMULA>> summand is equal to <<FORMULA>>, and we have <<FORMULA>> such pairs.
Another sum of relevance is the sum of subsequent squared integers. Again, via complete induction, we find
<<FORMULA>> (A2)
Bibliography
[1] G. H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1991.
[2] Kh.D. Ikramov and N.V. Savel<65>eva, <20>Conditionally definite Matrices, Journal of Mathematical Sciences, vol. 98, no. 1, pp. 150, 2000.
<<END> <<END>> <END>>
<<START>> <<START>> <<START>>
Green AI
Roy Schwartz Jesse Dodge Noah A. Smith Oren Etzioni
Allen Institute for AI, Seattle, Washington, USA
Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
University of Washington, Seattle, Washington, USA
Abstract
The computations required for deep learning research have been doubling every few months, resulting in an
estimated 300,000x increase from 2012 to 2018 [2]. These computations have a surprisingly large carbon footprint
[40]. Ironically, deep learning was inspired by the human brain, which is remarkably energy efficient. Moreover, the
financial cost of the computations can make it difficult for academics, students, and researchers, in particular those
from emerging economies, to engage in deep learning research.
This position paper advocates a practical solution by making efficiency an evaluation criterion for research along-
side accuracy and related measures. In addition, we propose reporting the financial cost or “price tag” of developing,
training, and running models to provide baselines for the investigation of increasingly efficient methods. Our goal is
to make AI both greener and more inclusive—enabling any inspired undergraduate with a laptop to write high-quality
research papers. Green AI is an emerging focus at the Allen Institute for AI.
1 Introduction and Motivation
Since 2012, the field of artificial intelligence has reported remarkable progress on a broad range of capabilities in-
cluding object recognition, game playing, machine translation, and more [36]. This progress has been achieved by
increasingly large and computationally-intensive deep learning models. 1 Figure 1 reproduced from [2] plots training
cost increase over time for state-of-the-art deep learning models starting with AlexNet in 2012 [20] to AlphaZero in
2017 [38]. The chart shows an overall increase of 300,000x, with training cost doubling every few months. An even
sharper trend can be observed in NLP word embedding approaches by looking at ELMo [29] followed by BERT [8],
openGPT-2 [30], and XLNet [48]. An important paper [40] has estimated the carbon footprint of several NLP models
and argued that this trend is both environmentally unfriendly (which we refer to as Red AI ) and expensive, raising
barriers to participation in NLP research.
This trend is driven by the strong focus of the AI community on obtaining “state-of-the-art” results, 2 as exemplified
by the rising popularity of leaderboards [46, 45], which typically report accuracy measures but omit any mention of
cost or efficiency (see, for example,leaderboards.allenai.org). Despite the clear benefits of improving
model accuracy in AI, the focus on this single metric ignores the economic, environmental, or social cost of reaching
the reported accuracy.
We advocate increasing research activity in Green AI —AI research that is more environmentally friendly and
inclusive. We emphasize that Red AI research has been yielding valuable contributions to the field of AI, but its been
overly dominant. We want to shift the balance towards the Green AI option—to ensure that any inspired undergraduate
with a laptop has the opportunity to write high-quality papers that could be accepted at premier research conferences.
1 For brevity, we refer to AI throughout this paper, but our focus is on AI research that relies on deep learning methods.
2 Meaning, in practice, that a systems accuracy on some benchmark is greater than any previously reported systems accuracy.
<<FIGURE>>
Figure 1: The amount of compute used to train deep learning models has increased 300,000x in 6 years. Figure taken
from [2].
Specifically, we propose making efficiency a more common evaluation criterion for AI papers alongside accuracy and
related measures.
AI research can be computationally expensive in a number of ways, but each provides opportunities for efficient
improvements; for example, papers could be required to plot accuracy as a function of computational cost and of
training set size, providing a baseline for more data-efficient research in the future. Reporting the computational price
tag of finding, training, and running models is a key Green AI practice (see Equation 1). In addition to providing
transparency, price tags are baselines that other researchers could improve on.
Our empirical analysis in Figure 2 suggests that the AI research community has paid relatively little attention to
computational efficiency. In fact, as Figure 1 illustrates, the computational cost of research is increasing exponentially,
at a pace that far exceeds Moores Law [28]. Red AI is on the rise despite the well-known diminishing returns of
increased cost (e.g., Figure 3). This paper identifies key factors that contribute to Red AI and advocates the introduction
of a simple, easy-to-compute efficiency metric that could help make some AI research greener, more inclusive, and
perhaps more cognitively plausible. Green AI is part of a broader, long-standing interest in environmentally-friendly
scientific research (e.g., see the journalGreen Chemistry). Computer science, in particular, has a long history of
investigating sustainable and energy-efficient computing (e.g., see the journalSustainable Computing: Informatics
and Systems).
The remainder of this paper is organized as follows. Section 2 analyzes practices that move deep-learning research
into the realm of Red AI . Section 3 discusses our proposals for Green AI. Section 4 considers related work, and we
conclude with a discussion of directions for future research.
2 Red AI
Red AI refers to AI research that seeks to obtain state-of-the-art results in accuracy (or related measures) through
the use of massive computational power—essentially “buying” stronger results. Yet the relationship between model
performance and model complexity (measured as number of parameters or inference time) has long been understood
to be at best logarithmic; for a linear gain in performance, an exponentially larger model is required [18]. Similar
trends exist with increasing the quantity of training data [41, 13] and the number of experiments [9]. In each of these
cases, diminishing returns come at increased computational cost.
This section analyzes the factors contributing to Red AI and shows how it is resulting in diminishing returns over
time (see Figure 3). We note again that Red AI work is valuable, and in fact, much of it contributes to what we know
<<FIGURE>>
Figure 2: AI papers tend to target accuracy rather than efficiency. The figure shows the proportion of papers that
target accuracy, efficiency, both or other from a sample of 60 papers from top AI conferences.
by pushing the boundaries of AI. Our exposition here is meant to highlight areas where computational expense is high,
and to present each as an opportunity for developing more efficient techniques.
To demonstrate the prevalence of Red AI , we sampled 60 papers from top AI conferences (ACL, 3 NeurIPS, 4 and
CVPR 5 ). For each paper we noted whether the authors claim their main contribution to be (a) an improvement to
accuracy or some related measure, (b) an improvement to efficiency, (c) both, or (d) other. As shown in Figure 2, in all
conferences we considered, a large majority of the papers target accuracy (90% of ACL papers, 80% of NeurIPS papers
and 75% of CVPR papers). Moreover, for both empirical AI conferences (ACL and CVPR) only a small portion (10%
and 20% respectively) argue for a new efficiency result. 6 This highlights the focus of the AI community on measures
of performance such as accuracy, at the expense of measures of efficiency such as speed or model size. In this paper
we argue that a larger weight should be given to the latter.
To better understand the different ways in which AI research can be red, consider an AI result reported in a scientific
paper. This result typically includes a model trained on a training dataset and evaluated on a test dataset. The process
of developing that model often involves multiple experiments to tune its hyperparameters. When considering the
different factors that increase the computational and environmental cost of producing such a result, three factors come
to mind: the cost of executing the model on a single (E)xample (either during training or at inference time); the size
of the training (D)ataset, which controls the number of times the model is executed during training, and the number of
(H)yperparameter experiments, which controls how many times the model is trained during model development. The
total cost of producing a (R)esult in machine learning increases linearly with each of these quantities. This cost can
be estimated as follows:
<<FORMULA>>
Equation 1: The equation of Red AI : The cost of an AI (R)esult grows linearly with the cost of processing a single
(E)xample, the size of the training (D)ataset and the number of (H)yperparameter experiments.
Equation 1 is a simplification (e.g., different hyperparameter assignments can lead to different costs for processing
a single example). It also ignores other factors such as the number of training epochs. Nonetheless, it illustrates three
quantities that are each an important factor in the total cost of generating a result. Below, we consider each quantity
separately. Interestingly, many NeurIPS papers included convergence rates or regret bounds which describe performance as a function of examples or
iterations, thus targeting efficiency (55%). This indicates an increased awareness of the importance of this concept, at least in theoretical analyses.
.
Expensive processing of one example Our focus is on neural models, where it is common for each training step
to require inference, so we discuss training and inference cost together as “processing” an example. Some works
have used increasingly expensive models which require great amounts of resources, and as a result, in these models,
performing inference can require a lot of computation, and training even more so. For instance, Googles BERT-large
[8] contains roughly 350 million parameters. openAIs openGPT2-XL model [30] contains 1.5 billion parameters.
AI2, our home organization, recently released Grover [49], also containing 1.5 billion parameters. In the computer
vision community, a similar trend is observed (Figure 1).
Such large models have high costs for processing each example, which leads to large training costs. BERT-large
was trained on 64 TPU chips for 4 days. Grover was trained on 256 TPU chips for two weeks, at an estimated cost of
$25,000. XLNet had a similar architecture to BERT-large, but used a more expensive objective function (in addition
to an order of magnitude more data), and was trained on 512 TPU chips for 2.5 days. 7 It is impossible to reproduce
the best BERT-large results 8 or XLNet results 9 using a single GPU. Specialized models can have even more extreme
costs, such as AlphaGo, the best version of which required 1,920 CPUs and 280 GPUs to play a single game of Go
[37] at a cost of over $1,000 per hour. 10
When examining variants of a single model (e.g., BERT-small and BERT-large) we see that larger models can have
stronger performance, which is a valuable scientific contribution. However, this implies the financial and environmental
cost of increasingly large AI models will not decrease soon, as the pace of model growth far exceeds the resulting
increase in model performance [16]. As a result, more and more resources are going to be required to keep improving
AI models by simply making them larger.
Processing many examples Another way state-of-the-art performance has recently been progressing in AI is by
successively increasing the amount of training data models are trained on. BERT-large had top performance in 2018
across many NLP tasks after training on 3 billion word-pieces. XLNet recently outperformed BERT after training
on 32 billion word-pieces, including part of Common Crawl; openGPT-2-XL trained on 40 billion words; FAIRs
RoBERTa [23] was trained on 160GB of text, roughly 40 billion word-pieces, requiring around 25,000 GPU hours
to train. In computer vision, researchers from Facebook [25] pretrained an image classification model on 3.5 billion
images from Instagram, three orders of magnitude larger than existing labelled image datasets such as Open Images. 11
The use of massive data creates barriers for many researchers for reproducing the results of these models, or
training their own models on the same setup (especially as training for multiple epochs is standard). For example, the
June 2019 Common Crawl contains 242 TB of uncompressed data, 12 so even storing the data is expensive. Finally,
as in the case of model size, relying on more data to improve performance is notoriously expensive because of the
diminishing return of adding more data [41]. For instance, Figure 3, taken from [25], shows a logarithmic relation
between the object recognition top-1 accuracy and the number of training examples.
Massive number of experiments Some projects have poured large amounts of computation into tuning hyperparameters
or searching over neural architectures, well beyond the reach of most researchers. For instance, researchers
from Google [51] trained over 12,800 neural networks in their neural architecture search to improve performance on
object detection and language modeling. With a fixed architecture, researchers from DeepMind [26] evaluated 1,500
hyperparameter assignments to demonstrate that an LSTM language model [15] can reach state-of-the-art perplexity
results. Despite the value of this result in showing that the performance of an LSTM does not plateau after only a few
hyperparameter trials, fully exploring the potential of other competitive models for a fair comparison is prohibitively
expensive.
7 Some estimates for the cost of this process reach $250,000 (twitter.com/eturner303/status/1143174828804857856).
8 Seehttps://github.com/google-research/bert
9 Seehttps://github.com/zihangdai/xlnet
10 Recent versions of AlphaGo are far more efficient [39].
11 https://opensource.google.com/projects/open-images-dataset
12 http://commoncrawl.org/2019/07/
<<FIGURE>>
Figure 3: Diminishing returns of training on more data: object detection accuracy increases linearly as the number of
training examples increases exponentially [25].
The topic of massive number of experiments is not as well studied as the first two discussed above. In fact, the
number of experiments performed during model construction is often under reported. Nonetheless, evidence for a
logarithmic relation exists here as well, between the number of experiments and performance gains [9].
Discussion The benefits of pouring more resources into models are certainly of interest to the AI community. Indeed,
there is value in pushing the limits of model size, dataset size, and the hyperparameter search space. Currently, despite
the massive amount of resources put into recent AI models, such investment still pays off in terms of downstream
performance (albeit at an increasingly lower rate). Finding the point of saturation (if such exists) is an important
question for the future of AI.
Our goal in this paper is to raise awareness of the cost of Red AI , as well as encourage the AI community to
recognize the value of work by researchers that take a different path, optimizing efficiency rather than accuracy. Next
we turn to discuss concrete measures for making AI more green.
3 Green AI
The term Green AI refers to AI research that yields novel results without increasing computational cost, and ideally
reducing it. Whereas Red AI has resulted in rapidly escalating computational (and thus carbon) costs, Green AI has the
opposite effect. If measures of efficiency are widely accepted as important evaluation metrics for research alongside
accuracy, then researchers will have the option of focusing on the efficiency of their models with positive impact on
both the environment and inclusiveness. This section reviews several measures of efficiency that could be reported
and optimized, and advocates one particular measure—FPO—which we argue should be reported when AI research
findings are published.
3.1 Measures of Efficiency
To measure efficiency, we suggest reporting the amount of work required to generate a result in AI, that is, the amount
of work required to train a model, and if applicable, the sum of works for all hyperparameter tuning experiments. As
the cost of an experiment decomposes into the cost of a processing a single example, the size of the dataset, and the
number of experiments (Equation 1), reducing the amount of work in each of these steps will result in AI that is more
green.
When reporting the amount of work done by a model, we want to measure a quantity that allows for a fair comparison
between different models. As a result, this measure should ideally be stable across different labs, at different
times, and using different hardware.
Carbon emission Carbon emission is appealing as it is a quantity we want to directly minimize. Nonetheless it
is impractical to measure the exact amount of carbon released by training or executing a model, and accordingly—
generating an AI result, as this amount depends highly on the local electricity infrastructure. As a result, it is not
comparable between researchers in different locations or even the same location at different times.
Electricity usage Electricity usage is correlated with carbon emission while being time- and location-agnostic.
Moreover, GPUs often report the amount of electricity each of their cores consume at each time point, which facilitates
the estimation of the total amount of electricity consumed by generating an AI result. Nonetheless, this measure is
hardware dependent, and as a result does not allow for a fair comparison between different models.
Elapsed real time The total running time for generating an AI result is a natural measure for efficiency, as all other
things being equal, a faster model is doing less computational work. Nonetheless, this measure is highly influenced
by factors such as the underlying hardware, other jobs running on the same machine, and the number of cores used.
These factors hinder the comparison between different models, as well as the decoupling of modeling contributions
from hardware improvements.
Number of parameters Another common measure of efficiency is the number of parameters (learnable or total)
used by the model. As with run time, this measure is correlated with the amount of work. Unlike the other measures
described above, it does not depend on the underlying hardware. Moreover, this measure also highly correlates with the
amount of memory consumed by the model. Nonetheless, different algorithms make different use of their parameters,
for instance by making the model deeper vs. wider. As a result, different models with a similar number of parameters
often perform different amounts of work.
FPO As a concrete measure, we suggest reporting the total number of floating point operations (FPO) required to
generate a result. 13 FPO provides an estimate to the amount of work performed by a computational process. It is
computed analytically by defining a cost to two base operations, ADD and MUL . Based on these operations, the FPO
cost of any machine learning abstract operation (e.g., a tanh operation, a matrix multiplication, a convolution operation,
or the BERT model) can be computed as a recursive function of these two operations. FPO has been used in the past
to quantify the energy footprint of a model [27, 43, 12, 42], but is not widely adopted in AI.
FPO has several appealing properties. First, it directly computes the amount of work done by the running machine
when executing a specific instance of a model, and is thus tied to the amount of energy consumed. Second, FPO is
agnostic to the hardware on which the model is run. This facilitates fair comparisons between different approaches,
unlike the measures described above. Third, FPO is strongly correlated with the running time of the model [4]. Unlike
asymptotic runtime, FPO also considers the amount of work done at each time step.
Several packages exist for computing FPO in various neural network libraries, 14 though none of them contains all
the building blocks required to construct all modern AI models. We encourage the builders of neural network libraries
to implement such functionality directly.
13 Floating point operations are often referred to as FLOP(s), though this term is not uniquely defined [12]. To avoid confusion, we use the term FPO.
14 E.g.,https://github.com/Swall0w/torchstat;https://github.com/Lyken17/pytorch-OpCounter
<<FIGURE>>
Figure 4: Increase in FPO results in diminishing return for object detection top-1 accuracy. Plots (bottom to top):
model parameters (in million), FPO (in billions), top-1 accuracy on ImageNet. (4a): Different models: AlexNet
[20], ResNet [14], ResNext [47], DPN107 [5], SENet154 [17]. (4b): Comparison of different sizes (measured by the
number of layers) of the ResNet model [14].
Discussion Efficient machine learning approaches have received attention in the research community, but are generally
not motivated by being green. For example, a significant amount of work in the computer vision community has
addressed efficient inference, which is necessary for real-time processing of images for applications like self-driving
cars [24, 31, 22], or for placing models on devices such as mobile phones [16, 34]. Most of these approaches target efficient
model inference [32, 50, 12], 15 and thus only minimize the cost of processing a single example, while ignoring
the other two red practices discussed in Section 2. 16
The above examples indicate that the path to making AI green depends on how it is used. When developing a new
model, much of the research process involves training many model variants on a training set and performing inference
on a small development set. In such a setting, more efficient training procedures can lead to greater savings, while in
a production setting more efficient inference can be more important. We advocate for a holistic view of computational
savings which doesnt sacrifice in some areas to make advances in others.
FPO has some limitations. First, it targets the electricity consumption of a model, while ignoring other potential
limiting factors for researchers such as the memory consumption by the model, which can often lead to additional
energy and monetary costs [24]. Second, the amount of work done by a model largely depends on the model implementation,
as two different implementations of the same model could result in very different amounts of processing
work. Due to the focus on the modeling contribution, the AI community has traditionally ignored the quality or efficiency
of models implementation. We argue that the time to reverse this norm has come, and that exceptionally
good implementations that lead to efficient models should be credited by the AI community.
3.2 FPO Cost of Existing Models
To demonstrate the importance of reporting the amount of work, we present FPO costs for several existing models.
A few trends are observable. First, as discussed in Section 2, models get more expensive with time, but the increase
in FPO does not lead to similar performance gains. For instance, an increase of almost 35% in FPO between ResNet and
ResNext (second and third points in graph) resulted in a 0.5% top-1 accuracy improvement. Similar patterns are observed
when considering the effect of other increases in model work. Second, the number of model parameters does not tell
the whole story: AlexNet (first point in the graph) actually has more parameters than ResNet (second point), but
dramatically less FPO, and also much lower accuracy.
Figure 4b shows the same analysis for a single object recognition model, ResNet [14], while comparing different
versions of the model with different number of layers. This creates a controlled comparison between the different
models, as they are identical in architecture, except for their size (and accordingly, their FPO cost). Once again, we
notice the same trend: the large increase in FPO cost does not translate to a large increase in performance.
14 Figure 4a shows the number of parameters and FPO of several leading object recognition models, as well as their performance on the ImageNet dataset [6].
15 Some very recent work also targeted efficient training [7].
16 In fact, creating smaller models often results in longer running time, so mitigating the different trends might be at odds [44].
17 We consider this exclusive focus on the final prediction another symptom of Red AI .
18 These numbers represent FPO per inference, i.e., the work required to process a single example.
3.3 Additional Ways to Promote Green AI
In addition to reporting the FPO cost of the final reported number, we encourage researchers to report the bud-
get/accuracy curve observed during training. In a recent paper [9], we observed that selecting the better performing
model on a given task depends highly on the amount of compute available during model development. We introduced
a method for computing the expected best validation performance of a model as a function of the given budget. We
argue that reporting this curve will allow users to make wiser decisions about their selection of models and highlight
the stability of different approaches.
We further advocate for making efficiency an official contribution in major AI conferences, by advising reviewers
to recognize and value contributions that do not strictly improve state of the art, but have other benefits such as
efficiency. Finally, we note that the trend of releasing pretrained models publicly is a green success, and we would like
to encourage organizations to continue to release their models in order to save others the costs of retraining them.
4 Related Work
Recent work has analyzed the carbon emissions of training deep NLP models [40] and concluded that computationally
expensive experiments can have a large environmental and economic impact. With modern experiments using such
large budgets, many researchers (especially those in academia) lack the resources to work in many high-profile areas;
increased value placed on computationally efficient approaches will allow research contributions from more diverse
groups. We emphasize that the conclusions of [40] are the result of long-term trends, and are not isolated within NLP,
but hold true across machine learning.
While some companies offset electricity usage by purchasing carbon credits, it is not clear that buying credits is
as effective as using less energy. In addition, purchasing carbon credits is voluntary; Google cloud 20 and Microsoft
Azure 21 purchase carbon credits to offset their spent energy, but Amazons AWS 22 (the largest cloud computing plat-
form 23 ) only covered fifty percent of its power usage with renewable energy.
The push to improve state-of-the-art performance has focused the research communitys attention on reporting the
single best result after running many experiments for model development and hyperparameter tuning. Failure to fully
report these experiments prevents future researchers from understanding how much effort is required to reproduce a
result or extend it [9].
Our focus is on improving efficiency in the machine learning community, but machine learning can also be used
as a tool for work in areas like climate change. For example, machine learning has been used for reducing emissions
of cement plants [1] and tracking animal conservation outcomes [11], and is predicted to be useful for forest fire
management [33]. Undoubtedly these are important applications of machine learning; we recognize that they are
orthogonal to the content of this paper.
19 Numbers taken fromhttps://github.com/sovrasov/flops-counter.pytorch
20 https://cloud.google.com/sustainability/
21 https://www.microsoft.com/en-us/environment/carbon
22 https://aws.amazon.com/about-aws/sustainability/
23 https://tinyurl.com/y2kob969
8 5 Conclusion
The vision of Green AI raises many exciting research directions that help to overcome the inclusiveness challenges of
Red AI . Progress will reduce the computational expense with a minimal reduction in performance, or even improve
performance as more efficient methods are discovered. Also, it would seem that Green AI could be moving us in a
more cognitively plausible direction as the brain is highly efficient.
Its important to reiterate that we see Green AI as a valuable option not an exclusive mandate—of course, both
Green AI and Red AI have contributions to make. We want to increase the prevalence of Green AI by highlighting its
benefits, advocating a standard measure of efficiency. Below, we point to a few important green research directions,
and highlight a few open questions.
Research on building space or time efficient models is often motivated by fitting a model on a small device (such
as a phone) or fast enough to process examples in real time, such as image captioning for the blind (see Section 3.1).
Some modern models dont even fit on a single GPU (see Section 2). Here we argue for a far broader approach.
Data efficiency has received significant attention over the years [35, 19]. Modern research in vision and NLP often
involves first pretraining a model on large “raw” (unannotated) data then fine-tuning it to a task of interest through
supervised learning. A strong result in this area often involves achieving similar performance to a baseline with
fewer training examples or fewer gradient steps. Most recent work has addressed fine-tuning data [29], but pretraining
efficiency is also important. In either case, one simple technique to improve in this area is to simply report performance
with different amounts of training data. For example, reporting performance of contextual embedding models trained
on 10 million, 100 million, 1 billion, and 10 billion tokens would facilitate faster development of new models, as they
can first be compared at the smallest data sizes. Research here is of value not just to make training less expensive, but
because in areas such as low resource languages or historical domains it is extremely hard to generate more data, so to
progress we must make more efficient use of what is available.
Finally, the total number of experiments run to get a final result is often underreported and underdiscussed [9]. The
few instances researchers have of full reporting of the hyperparameter search, architecture evaluations, and ablations
that went into a reported experimental result have surprised the community [40]. While many hyperparameter optimization
algorithms exist which can reduce the computational expense required to reach a given level of performance
[3, 10], simple improvements here can have a large impact. For example, stopping training early for models which are
clearly underperforming can lead to great savings [21].
References
[1]Prabal Acharyya, Sean D Rosario, Roey Flor, Ritvik Joshi, Dian Li, Roberto Linares, and Hongbao Zhang.
Autopilot of cement plants for reduction of fuel consumption and emissions, 2019. ICML Workshop on Climate
Change.
[2]Dario Amodei and Danny Hernandez. AI and compute, 2018. Blog post.
[3]James S. Bergstra, Remi Bardenet, Yoshua Bengio, and Bal´ azs K´ egl. Algorithms for hyper-parameter optimiza-´
tion. InProc. of NeurIPS, 2011.
[4]Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for
practical applications. InProc. of ISCAS, 2017.
[5]Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. In
Proc. of NeurIPS, 2017.
[6]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical
image database. InProc. of CVPR, 2009.
[7]Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance,
2019. arXiv:1907.04840.
[8]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional
transformers for language understanding. InProc. of NAACL, 2019.
[9]Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved
reporting of experimental results. InProc. of EMNLP, 2019.
[10]Jesse Dodge, Kevin Jamieson, and Noah A. Smith. Open loop hyperparameter optimization and determinantal
point processes. InProc. of AutoML, 2017.
[11]Clement Duhart, Gershon Dublon, Brian Mayton, Glorianna Davenport, and Joseph A. Paradiso. Deep learning
for wildlife conservation and restoration efforts, 2019. ICML Workshop on Climate Change.
[12]Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. MorphNet: Fast &
simple resource-constrained structure learning of deep networks. InProc. of CVPR, 2018.
[13]Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent
Systems, 24:812, 2009.
[14]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proc. of CVPR, 2016.
[15]Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.¨ Neural computation, 9(8):17351780,
1997.
[16]Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
dreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications,
2017. arXiv:1704.04861.
[17]Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProc. of CVPR, 2018.
[18]Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbig-
niew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convo-
lutional object detectors. InProc. of CVPR, 2017.
[19]Sanket Kamthe and Marc Peter Deisenroth. Data-efficient reinforcement learning with probabilistic model pre-
dictive control. InProc. of AISTATS, 2018.
[20]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural
networks. InProc. of NeurIPS, 2012.
[21]Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: Bandit-
based configuration evaluation for hyperparameter optimization. InProc. of ICLR, 2017.
[22]Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C.
Berg. Ssd: Single shot multibox detector. InProc. of ECCV, 2016.
[23]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining approach, 2019.
arXiv:1907.11692.
[24]Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient
cnn architecture design. InProc. of ECCV, 2018.
[25]Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin
Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. InProc. ECCV,
2018.
[26]Gabor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. In´
Proc. of EMNLP, 2018.
[27]Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks
for resource efficient inference. InProc. of ICLR, 2017.
[28]Gordon E. Moore. Cramming more components onto integrated circuits, 1965.
[29]Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-
moyer. Deep contextualized word representations. InProc. of NAACL, 2018.
[30]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are
unsupervised multitask learners, 2019. OpenAI Blog.
[31]Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification
using binary convolutional neural networks. InProc. of ECCV, 2016.
[32]Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object
detection. InProc. of CVPR, 2016.
[33]David Rolnick, Priya L. Donti, Lynn H. Kaack, Kelly Kochanski, Alexandre Lacoste, Kris Sankaran, An-
drew Slavin Ross, Nikola Milojevic-Dupont, Natasha Jaques, Anna Waldman-Brown, Alexandra Luccioni, Tegan
Maharaj, Evan D. Sherwin, S. Karthik Mukkavilli, Konrad P. Kording, Carla Gomes, Andrew Y. Ng, Demis Has-¨
sabis, John C. Platt, Felix Creutzig, Jennifer Chayes, and Yoshua Bengio. Tackling climate change with machine
learning, 2019. arXiv:1905.12616.
[34]Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2:
Inverted residuals and linear bottlenecks. InProc. of CVPR, 2018.
[35]Roy Schwartz, Sam Thomson, and Noah A. Smith. SoPa: Bridging CNNs, RNNs, and weighted finite-state
machines. InProc. of ACL, 2018.
[36]Yoav Shoham, Raymond Perrault, Erik Brynjolfsson, Jack Clark, James Manyika, Juan Carlos Niebles, Terah
Lyons, John Etchemendy, and Z Bauer. The AI index 2018 annual report. AI Index Steering Committee,
Human-Cente Red AI Initiative, Stanford University. Available athttp://cdn.aiindex.org/2018/AI%
20Index%202018%20Annual%20Report.pdf, 202018, 2018.
[37]David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe,
John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore
Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search.Nature,
529(7587):484, 2016.
[38]David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis
Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017.
arXiv:1712.01815.
[39]David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas
Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre,
George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human
knowledge.Nature, 550(7676):354, 2017.
[40]Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in
NLP. InProc. of ACL, 2019.
[41]Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of
data in deep learning era. InProc. of ICCV, 2017.
[42]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. InProc. of NeurIPS, 2017.
[43]Tom Veniat and Ludovic Denoyer. Learning time/memory-efficient deep architectures with budgeted super net-
works. InProc. of CVPR, 2018.
[44]Aaron Walsman, Yonatan Bisk, Saadia Gabriel, Dipendra Misra, Yoav Artzi, Yejin Choi, and Dieter Fox. Early
fusion for goal directed robotic vision. InProc. of IROS, 2019.
[45]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and
Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems,
2019. arXiv:1905.00537.
[46]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A
multi-task benchmark and analysis platform for natural language understanding. InProc. of ICLR, 2019.
[47]Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations
for deep neural networks. InProc. of CVPR, 2017.
[48]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet:
Generalized autoregressive pretraining for language understanding, 2019. arXiv:1906.08237.
[49]Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi.
Defending against neural fake news, 2019. arXiv:1905.12616.
[50]Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An extremely efficient convolutional
neural network for mobile devices. InProc. of CVPR, 2018.
[51]Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. InProc. of ICLR, 2017.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication
Herbert Jaeger* and Harald Haas
We present a method for learning nonlinear systems, echo state networks (ESNs). ESNs employ artificial recurrent neural networks in a way that has recently been proposed independently as a learning mechanism in biological brains. The learning method is computationally efficient and easy to use. On a benchmark task of predicting a chaotic time series, accuracy is improved by a factor of 2400 over previous techniques. The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude.
Nonlinear dynamical systems abound in the sciences and in engineering. If one wishes to simulate, predict, filter, classify, or control such a system, one needs an executable system model. However, it is often infeasible to obtain analytical models. In such cases, one has to resort to black-box models, which ignore the internal physical mechanisms and instead reproduce only the outwardly observable input-output behavior of the target system.
If the target system is linear, efficient methods for black-box modeling are available. Most technical systems, however, become nonlinear if operated at higher operational points (that is, closer to saturation). Although this might lead to cheaper and more energy-efficient designs, it is not done be.cause the resulting nonlinearities cannot be harnessed. Many biomechanical systems use their full dynamic range (up to saturation) and thereby become lightweight, energy efficient, and thoroughly nonlinear.
Here, we present an approach to learn.ing black-box models of nonlinear systems, echo state networks (ESNs). An ESN is an artificial recurrent neural network (RNN). RNNs are characterized by feedback (recurrent) loops in their synaptic connection pathways. They can maintain an ongoing activation even in the absence of input and thus exhibit dynamic memory. Biological neural networks are typically recurrent. Like biological neural networks, an artificial RNN can learn to mimic a target system in principle, with arbitrary accuracy (1). Several learning algorithms are known (24) that incrementally adapt the synaptic weights of an RNN in order to tune it toward the target system. These algorithms have not been widely employed in technical applications because of slow
International University Bremen, Bremen D-28759, Germany.
convergence and suboptimal solutions (5, 6). The ESN approach differs from these methods in that a large RNN is used (on the order of 50 to 1000 neurons; previous techniques typically use 5 to 30 neurons) and in that only the synaptic connections from the RNN to the output readout neurons are modified by learning; previous techniques tune all synaptic connections (Fig. 1). Be.cause there are no cyclic dependencies be.tween the trained readout connections, training an ESN becomes a simple linear regression task.
We illustrate the ESN approach on a task of chaotic time series prediction (Fig.
2) (7). The Mackey-Glass system (MGS)
(8) is a standard benchmark system for time series prediction studies. It generates a sub.tly irregular time series (Fig. 2A). The prediction task has two steps: (i) using an initial teacher sequence generated by the original MGS to learn a black-box model M of the generating system, and (ii) using M to predict the value of the sequence some steps ahead.
First, we created a random RNN with 1000 neurons (called the reservoir) and one output neuron. The output neuron was equipped with random connections that project back into the reservoir (Fig. 2B). A 3000-step teacher sequence <<FORMULA>> was generated from the MGS equation and fed into the output neuron. This excited the internal neurons through the output feedback connections. After an initial transient period, they started to exhibit systematic individual variations of the teacher sequence (Fig. 2B).
The fact that the internal neurons display systematic variants of the exciting external signal is constitutional for ESNs: The internal neurons must work as echo functions for the driving signal. Not every randomly generated RNN has this property, but it can effectively be built into a reservoir (support.ing online text).
It is important that the echo signals be richly varied. This was ensured by a sparse interconnectivity of 1% within the reservoir. This condition lets the reservoir decompose into many loosely coupled subsystems, establishing a richly structured reservoir of excitable dynamics.
After time <<n=3000>>, output connection weights wi (i  1, . . . , 1000) were computed (dashed arrows in Fig. 2B) from the last 2000 steps n=1001, . . . , 3000 of the training run such that the training error
<<FORMULA>>
was minimized [<<xi(n)>>, activation of the ith internal neuron at time n]. This is a simple linear regression.
With the new wi in place, the ESN was disconnected from the teacher after step 3000 and left running freely. A bidirectional dynamical interplay of the network-generated output signal with the internal signals <<FORMULA>> unfolded. The output signal <<FORMULA>> was created from the internal neuron activation signals <<FORMULA>> through the trained connections wi,by <<FORMULA>>. Conversely, the internal signals were echoed from that output signal through the fixed output feedback connections (supporting online text).
For testing, an 84-step continuation <<d(3001), ... , d(3084)>> of the original signal was computed for reference. The network output y(3084) was compared with the cor.rect continuation d(3084). Averaged over 100 independent trials, a normalized root mean square error
<<FORMULA>>
was obtained <<FORMULA>> and <<FORMULA>> teacher and network
<<FIGURE>>
Fig. 1. (A) Schema of previous approaches to RNN learning. (B) Schema of ESN approach. Solid synaptic connections; dotted arrows, adjustable connections. Both approaches aim at minimizing the error <<FORMULA>>, where <<FORMULA>> is the network output and d(n) is the teacher time series observed from the target system.
output in trial j, 2 variance of MGS signal), improving the best previous techniques (9 15), which used training sequences of length 500 to 10,000, by a factor of 700. If the prediction run was continued, deviations typically became visible after about 1300 steps (Fig. 2A). With a refined variant of the learn.ing method (7), the improvement factor rises to 2400. Models of similar accuracy were also obtained for other chaotic systems (supporting online text).
The main reason for the jump in modeling accuracy is that ESNs capitalize on a massive short-term memory. We showed analytically
(16) that under certain conditions an ESN of size N may be able to "remember" a number of previous inputs that is of the same order of magnitude as N. This information is more massive than the information used in other techniques (supporting online text).
We now illustrate the approach in a task of practical relevance, namely, the equalization of a wireless communication channel (7). The essentials of equalization are as fol.lows: A sender wants to communicate a sym.bol sequence s(n). This sequence is first transformed into an analog envelope signal d(n), then modulated on a high-frequency carrier signal and transmitted, then received and demodulated into an analog signal u(n), which is a corrupted version of d(n). Major sources of corruption are noise (thermal or due to interfering signals), multipath propagation, which leads to a superposition of adjacent symbols (intersymbol interference), and nonlinear distortion induced by operating the senders power amplifier in the high-gain region. To avoid the latter, the actual power amplification is run well below the maximum amplification possible, thereby incurring a substantial loss in energy efficiency, which is clearly undesirable in cell-phone and satellite
Fig. 2. (A) Prediction output of the trained ESN (dotted) overlaid with the correct continuation (solid). (B) Learning the MG attractor. Three sample activation traces of internal neurons are shown. They echo the teacher signal d(n). After training, the desired output is recreated from the echo signals through output connections (dotted arrows) whose weights wi are the result of the training procedure.
communications. The corrupted signal u(n)is then passed through an equalizing filter whose output y(n) should restore u(n)as closely as possible to d(n). Finally, the equalized signal y(n) is converted back into a symbol sequence. The quality measure for the entire process is the fraction of incorrect symbols finally obtained (symbol error rate).
To compare the performance of an ESN equalizer with standard techniques, we took a channel model for a nonlinear wireless transmission system from a study (17) that compared three customary nonlinear equalization methods: a linear decision feedback equalizer (DFE), which is actually a non.linear method; a Volterra DFE; and a bilinear DFE. The model equation featured inter symbol interference across 10 consecutive symbols, a second-order and a third-order nonlinear distortion, and additive white Gaussian noise. All methods investigated in that study had 47 adjustable parameters and used sequences of 5000 symbols for training. To make the ESN equalizer comparable with the equalizers studied in (17), we took ESNs with a reservoir of 46 neurons (which is small for the ESN approach), which yielded 47 adjust.able parameters. (The 47th comes from a direct connection from the input to the output neuron.)
We carried out numerous learning trials (7) to obtain ESN equalizers, using an online learning method (a version of the recursive least square algorithm known from linear adaptive filters) to train the output weights on 5000-step training sequences. We chose an online adaptation scheme here because the methods in (17) were online adaptive, too, and because wireless communication channels mostly are time-varying, such that an equalizer must adapt to changing system characteristics. The entire learning-testing procedure was repeated for signal-to-noise
<<FIGURE>>
Fig. 3. Results of using an ESN for nonlinear channel equalization. Plot shows signal error rate (SER) versus signal-to-noise ratio (SNR).
(a) Linear DFE. (b) Volterra DFE. (c) Bilinear DFE. [(a) to (c) taken from (20)]. (d) Blue line represents average ESN performance with randomly generated reservoirs. Error bars, variation across networks. (e) Green line indicates performance of best network chosen from the networks averaged in (d). Error bars, variation across learning trials.
REPORTS
ratios ranging from 12 to 32 db. Figure 3 compares the average symbol error rates obtained with the results reported in (17), show.ing an improvement of two magnitudes for high signal-to-noise ratios.
For tasks with multichannel input and/or output, the ESN approach can be accommodated simply by adding more input or output neurons (16, 18).
ESNs can be applied to all basic tasks of signal processing and control, including time series prediction, inverse modeling, pattern generation, event detection and classification, modeling distributions of stochastic process.es, filtering, and nonlinear control (16, 18, 19, 20). Because a single learning run takes only a few seconds (or minutes, for very large data sets and networks), engineers can test out variants at a high turnover rate, a crucial factor for practical usability.
ESNs have been developed from a mathematical and engineering perspective, but exhibit typical features of biological RNNs: a large number of neurons, recurrent pathways, sparse random connectivity, and local modification of synaptic weights. The idea of using randomly connected RNNs to represent and memorize dynamic input in network states has frequently been explored in specific contexts, for instance, in artificial intelligence models of associative memory (21), models of prefrontal cortex function in sensory-motor sequencing tasks (22), models of birdsong (23), models of the cerebellum (24), and general computational models of neural oscillators (25). Many different learning mechanisms were considered, mostly within the RNN itself. The contribution of the ESN is to elucidate the mathematical properties of large RNNs such that they can be used with a linear, trainable readout mechanism for general black-box modeling. An approach essentially equivalent to ESNs, liquid state networks (26, 27), has been developed independently to model computations in cortical microcircuits. Recent findings in neurophysiology suggest that the basic ESN/liquid state network principle seems not uncommon in biological networks (28,30) and could eventually be exploited to control prosthetic devices by signals collected from a collective of neurons (31).
References and Notes
1. K.-I. Funahashi, Y. Nakamura, Neural Netw. 6, 801 (1993).
2. D. Zipser, R. J. Williams, Neural Comput. 1, 270 (1989).
3. P. J. Werbos, Proc. IEEE 78, 1550 (1990).
4. L. A. Feldkamp, D. V. Prokhorov, C. F. Eagen, F. Yuan, in Nonlinear Modeling: Advanced Black-Box techniques , J. A. K. Suykens, J. Vandewalle, Eds. (Kluwer, Dordrecht, Netherlands, 1998), pp. 29<32>54.
5. K. Doya, in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. (MIT Press, Cambridge, MA, 1995), pp. 796<39>800.
6. H. Jaeger, <20>Tutorial on training recurrent neural networks<6B> (GMD-Report 159, German National Re.search Institute for Computer Science, 2002); ftp:// borneo.gmd.de/pub/indy/publications_herbert/ CompleteTutorialTechrep.pdf.
REPORTS
7. Materials andmethods are available as supporting material on Science Online.
8. M. C. Mackey, L. Glass, Science 197, 287 (1977).
9. J. Vesanto, in Proc. WSOM <20>97 (1997); www.cis.hut.<2E>/ projects/monitor/publications/papers/wsom97.ps.
10. L. Chudy, I. Farkas, Neural Network World 8, 481 (1998).
11. H. Bersini, M. Birattari, G. Bontempi, in Proc. IEEE World Congr. on Computational Intelligence (IJCNN <20>98) (1997), pp. 2102<30>2106; ftp://iridia.ulb.ac.be/ pub/lazy/papers/IridiaTr1997-13_2.ps.gz.
12. T. M. Martinetz, S. G. Berkovich, K. J. Schulten, IEEE Trans. Neural Netw. 4, 558 (1993).
13. X. Yao, Y. Liu, IEEE Trans. Neural Netw. 8, 694 (1997).
14. F. Gers, D. Eck, J. F. Schmidhuber, <20>Applying LSTM to time series predictable through time-window ap.proaches<65> (IDSIA-IDSIA-22-00, 2000); www.idsia.ch/ felix/Publications.html.
15. J. McNames, J. A. K. Suykens, J. Vandewalle, Int. J. Bifurcat. Chaos 9, 1485 (1999).
16. H. Jaeger, <20>Short term memory in echo state net.works<6B> (GMD-Report 152, German National Re.search Institute for Computer Science, 2002); ftp:// borneo.gmd.de/pub/indy/publications_herbert/ STMEchoStatesTechRep.pdf.
17. V. J. Mathews, J. Lee, in Advanced Signal Processing: Algorithms, Architectures, and Implementations V (Proc. SPIE Vol. 2296), (SPIE, San Diego, CA, 1994), pp. 317<31>327.
18. J. Hertzberg, H. Jaeger, F. Scho<68>nherr, in Proc. 15th Europ. Conf. on Art. Int. (ECAI 02), F. van Harmelen, Ed. (IOS Press, Amsterdam, 2002), pp. 708<30>712; www. ais.fhg.de/schoenhe/papers/ECAI02.pdf.
19. H. Jaeger, <20>The echo state approach to analysing and training recurrent neural networks<6B> (GMD-Report 148, German National Research Institute for Com.puter Science, 2001); ftp://borneo.gmd.de/pub/indy/ publications_herbert/EchoStatesTechRep.pdf.
20. H. Jaeger, in Advances in Neural Information Process.ing Systems 15, S. Becker, S. Thrun, K. Obermayer, Eds. (MIT Press, Cambridge, MA, 2003) pp. 593<39>600.
21. G. E. Hinton, in Parallel Models of Associative Mem.ory, G. E. Hinton, J. A. Anderson, Eds. (Erlbaum, Hills.dale, NJ, 1981), pp. 161<36>187.
22. D. G. Beiser, J. C. Houk, J. Neurophysiol. 79, 3168 (1998).
23. S. Dehaene, J.-P. Changeux, J.-P. Nadal, Proc. Natl. Acad. Sci. U.S.A. 84, 2727 (1987).
24. M. Kawato, in The Handbook of Brain Theory and Neural Networks, M. Arbib, Ed. (MIT Press, Cam.bridge, MA, 1995), pp. 172<37>178.
25. K. Doya, S. Yoshizawa, Neural Netw. 2, 375 (1989).
Ultrafast Electron Crystallography of Interfacial Water
Chong-Yu Ruan, Vladimir A. Lobastov, Franco Vigliotti, Songye Chen, Ahmed H. Zewail*
We report direct determination of the structures and dynamics of interfacial water on a hydrophilic surface with atomic-scale resolution using ultrafast electron crystallography. On the nanometer scale, we observed the coexistence of ordered surface water and crystallite-like ice structures, evident in the superposition of Bragg spots and Debye-Scherrer rings. The structures were determined to be dominantly cubic, but each undergoes different dynamics after the ultrafast sub.strate temperature jump. From changes in local bond distances (OHOand OO) with time, we elucidated the structural changes in the far-from-equilibrium regime at short times and near-equilibration at long times.
The nature of interfacial molecular assemblies of nanometer scale is of fundamental impor.tance to chemical and biological phenomena (1<>4). For water, the directional molecular fea.tures of hydrogen bonding (5, 6) and the dif.ferent structures possible, from amorphous (7) to crystalline (8), make the interfacial (9) col.lective assembly on the mesoscopic (10) scale much less understood. Structurally, the nature of water on a substrate is determined by forces of orientation at the interface and by the net charge density, which establishes the hydro.philic or hydrophobic character of the substrate. However, the transformation from ordered to dis.ordered structure and their coexistence critically depends on the time scales for the movements of atoms locally and at long range. Therefore, it is essential to elucidate the nature of these structures and the time scales for their equilibration.
Laboratory for Molecular Sciences, Arthur Amos Noyes Laboratory of Chemical Physics, California Institute of Technology, Pasadena, CA 91125, USA.
*To whom correspondence should be addressed. E.mail: zewail@caltech.edu
Here, we report direct determination of the structures of interfacial water with atomic-scale resolution, using diffraction and the dynamics following ultrafast infrared (IR) laser-initiated
26. W. Maass, T. Natschla<6C>ger, H. Markram, Neural Com-put. 14, 2531 (2002).
27. W. Maass, T. Natschla<6C>ger, H. Markram, in Compu.tational Neuroscience: A Comprehensive Approach, J. Feng, Ed. (Chapman & Hall/CRC, 2003), pp. 575<37> 605.
28. G. B. Stanley, F. F. Li, Y. Dan, J. Neurosci. 19, 8036 (1999).
29. G. B. Stanley, Neurocomputing 38<33>40, 1703 (2001).
30. W. M. Kistler, Ch. I. de Zeeuw, Neural Comput. 14, 2597 (2002). 31. S. Mussa-Ivaldi, Nature 408, 361 (2000).
32. The <20>rst author thanks T. Christaller for unfaltering support andW. Maass for friendly cooperation. Inter.national patents are claimedby Fraunhofer AIS (PCT/ EP01/11490).
Supporting Online Material
www.sciencemag.org/cgi/content/full/304/5667/78/DC1 Materials andMethods SOM Text Figs. S1 to S4 References
temperature jump. Interfacial water is formed on a hydrophilic surface (silicon, chlorine-terminated) under controlled ultrahigh vacuum (UHV) conditions (Fig. 1). With these atomic-scale spatial, temporal, and energy resolutions, the evolution of nonequilibrium structures was monitored, their ordered or disordered nature was established, and the time scale for the breakage of long-range bonding and formation of new structures was determined. We identi.fied the structured and ordered interfacial water from the Bragg diffraction and the layered crys.tallite structure from the Debye-Scherrer rings. The temporal evolution of interfacial water and layered ice after the temperature jump was studied with submonolayer sensitivity. We compared these results with those obtained on hydrophobic surfaces, such as hydrogen-terminated silicon or silver substrate.
Spectroscopic techniques, such as internal reflection (11) and nonlinear [second-harmonic generation (12) and sum-frequency generation
<<FIGURE>>
Fig. 1. Structured water at the hydrophilic interface. The chlo.rine termination on a <<FORMULA>> substrate forms a hydrophilic layer that orients the water bilayer. The closest packing dis.tance (4.43) be.tween oxygen atoms in the bottom layer of water is similar to the distance (4.50) be.tween the on-top and interstitial sites of the chlorine layer, result.ing in specific bilayer orientations (30) with respect to the silicon substrate. This ordered stacking persists for three to four bilayers (1 nm) before disorientation takes place andresults in crystallite islands, forming the layered structure. The size of atoms is not to scale for the van der Waals radii.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Identity Mappings in Deep Residual Networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
Microsoft Research
Abstract
Deep residual networks [1] have emerged as a family of ex-
tremely deep architectures showing compelling accuracy and nice con-
vergence behaviors. In this paper, we analyze the propagation formu-
lations behind the residual building blocks, which suggest that the for-
ward and backward signals can be directly propagated from one block
to any other block, when using identity mappings as the skip connec-
tions and after-addition activation. A series of ablation experiments sup-
port the importance of these identity mappings. This motivates us to
propose a new residual unit, which makes training easier and improves
generalization. We report improved results using a 1001-layer ResNet
on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet
on ImageNet. Code is available at:https://github.com/KaimingHe/
resnet-1k-layers.
1 Introduction
Deep residual networks (ResNets) [1] consist of many stacked \Residual Units".
Each unit (Fig.1(a)) can be expressed in a general form:
<<FORMULA>>
where xl and <<FORMULA>> are input and output of the l-th unit, andFis a residual
function. In [1],<<FORMULA>> is an identity mapping and is a ReLU [2] function.
ResNets that are over 100-layer deep have shown state-of-the-art accuracy for
several challenging recognition tasks on ImageNet [3] and MS COCO [4] compe-
titions. The central idea of ResNets is to learn the additive residual functionF
with respect to <<FORMULA>>, with a key choice of using an identity mapping <<FORMULA>> .
This is realized by attaching an identity skip connection shortcut.
In this paper, we analyze deep residual networks by focusing on creating a
direct path for propagating information not only within a residual unit,
but through the entire network. Our derivations reveal that if both <<FORMULA>> and
<<FORMULA>> are identity mappings, the signal could be directly propagated from one
unit to any other units, in both forward and backward passes. Our experiments
empirically show that training in general becomes easier when the architecture
is closer to the above two conditions.
To understand the role of skip connections, we analyze and compare various
types of <<FORMULA>>. We find that the identity mapping <<FORMULA>> chosen in [1]
<<FIGURE>>
Figure 1. Left: (a) original Residual Unit in [1]; (b) proposed Residual Unit. The grey
arrows indicate the easiest paths for the information to propagate, corresponding to
the additive term \xl " in Eqn.(4) (forward propagation) and the additive term \1" in
Eqn.(5) (backward propagation).Right: training curves on CIFAR-10 of1001-layer
ResNets. Solid lines denote test error (y-axis on the right), and dashed lines denote
training loss (y-axis on the left). The proposed unit makes ResNet-1001 easier to train.
achieves the fastest error reduction and lowest training loss among all variants
we investigated, whereas skip connections of scaling, gating [5,6,7], and 1x1
convolutions all lead to higher training loss and error. These experiments suggest
that keeping a clean information path (indicated by the grey arrows in Fig.1,2,
and4) is helpful for easing optimization.
To construct an identity mapping <<FORMULA>>, we view the activation func-
tions (ReLU and BN [8]) as pre-activation of the weight layers, in contrast
to conventional wisdom of post-activation. This point of view leads to a new
residual unit design, shown in (Fig.1(b)). Based on this unit, we present com-
petitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier
to train and generalizes better than the original ResNet in [1]. We further report
improved results on ImageNet using a 200-layer ResNet, for which the counter-
part of [1] starts to overfit. These results suggest that there is much room to
exploit the dimension ofnetwork depth, a key to the success of modern deep
learning.
2 Analysis of Deep Residual Networks
The ResNets developed in [1] are modularized architectures that stack building
blocks of the same connecting shape. In this paper we call these blocks \Residual 3
Units". The original Residual Unit in [1] performs the following computation:
<<FORMULA>>; (1)
<<FORMULA>>. (2)
Here xl is the input feature to the l-th Residual Unit. <<FORMULA>> is a
set of weights (and biases) associated with the l-th Residual Unit, andKis the
number of layers in a Residual Unit (Kis 2 or 3 in [1]). F denotes the residual
function,e.g., a stack of two 3x3 convolutional layers in [1]. The function f is
the operation after element-wise addition, and in [1] f is ReLU. The function h
is set as an identity mapping:<<FORMULA>> If f is also an identity mapping: <<FORMULA>>,
we can put Eqn.(2) into Eqn.(1)
and obtain:
<<FORMULA>>. (3)
Recursively <<FORMULA>>, etc. we will have:
<<FORMULA>>; (4)
for any deeper unit L and any shallower unit l. Eqn.(4) exhibits some nice
properties.
(i) The feature xL of any deeper unit L can be represented as the
P feature xl of any shallower unit l plus a residual function in a form of <<FORMULA>>
indicating that the model is in a residual fashion between any units L and l.
(ii)The feature <<FORMULA>>, of any deep unit L, is the summation
of the outputs of all preceding residual functions (<<FORMULA>>). This is in contrast to
Qa plain network here a feature xL is a series of matrix-vector products, say, <<FORMULA>>
(ignoring BN and ReLU).
Eqn.(4) also leads to nice backward propagation properties. Denoting the
loss function as E, from the chain rule of backpropagation [9] we have:
<<FORMULA>> (5)
Eqn.(5) indicates that the gradient @E can be decomposed into two additive <<FORMULA>>
terms: a term of <<FORMULA>> that propagates information directly without concerning
any weight layers, and another term of <<FORMULA>> that propagates <<FORMULA>>
through the weight layers. The additive term of @E ensures that information is directly propagated back to
any shallower unIt l. Eqn.(5) also suggests that it is unlikely for the gradient @E to be canceled out for
a mini-batch, because in general the term <<FORMULA>> cannot be always -1 for all samples in a mini-batch.
This implies that the gradient of a layer does not vanish even when the weights are arbitrarily small.
1 It is noteworthy that there are Residual Units for increasing dimensions and reducing
feature map sizes [1] in which h is not identity. In this case the following derivations
do not hold strictly. But as there are only a very few such units (two on CIFAR and
three on ImageNet, depending on image sizes [1]), we expect that they do not have
the exponential impact as we present in Sec.3. One may also think of our derivations
as applied to all Residual Units within the same feature map size.
Discussions
Eqn.(4) and Eqn.(5) suggest that the signal can be directly propagated from
any unit to another, both forward and backward. The foundation of Eqn.(4) is
two identity mappings: (i) the identity skip connection <<FORMULA>> , and (ii) the
condition that f is an identity mapping.
These directly propagated information flows are represented by the grey ar-
rows in Fig.1,2, and4. And the above two conditions are true when these grey
arrows cover no operations (expect addition) and thus are clean. In the fol-
lowing two sections we separately investigate the impacts of the two conditions.
3 On the Importance of Identity Skip Connections
Lets consider a simple modification, <<FORMULA>>, to break the identity shortcut:
<<FORMULA>>, (6)
where l is a modulating scalar (for simplicity we still assume f is identity).
Recursively applying this formulation we obtain an equation similar to Eqn. (4):
<<FORMULA>>, or simply:
<<FORMULA>>; (7)
where the notationF^absorbs the scalars into the residual functions. Similar to
Eqn.(5), we have backpropagation of the following form:
<<FORMULA>> (8)
Unlike Eqn.(5), in Eqn.(8) the first additive term is modulated by a factor <<FORMULA>>
the factor can be exponentially large; if <<FORMULA>> for all i, this factor can be
exponentially small and vanish, which blocks the backpropagated signal from the
shortcut and forces it to flow through the weight layers. This results in optimization
difficulties as we show by experiments.
In the above analysis, the original identity skip connection in Eqn.(3) is re-
placed with a simple scaling <<FORMULA>>. If the skip connection <<FORMULA>> represents
more complicated transforms (such as gating and 1x1 convolutions), in Eqn.(8) Q the first
term becomes <<FORMULA>> where h0 is the derivative of h. This product <<FORMULA>> may
also impede information propagation and hamper the training procedure
as witnessed in the following experiments.
<<FIGURE>>
Figure 2.Various types of shortcut connections used in Table1. The grey arrows
indicate the easiest paths for the information to propagate. The shortcut connections
in (b-f) are impeded by different components. For simplifying illustrations we do not
display the BN layers, which are adopted right after the weight layers for all units here.
3.1 Experiments on Skip Connections
We experiment with the 110-layer ResNet as presented in [1] on CIFAR-10 [10].
This extremely deep ResNet-110 has 54 two-layer Residual Units (consisting of
3x3 convolutional layers) and is challenging for optimization. Our implementation
details (see appendix) are the same as [1]. Throughout this paper we report
the median accuracy of 5 runs for each architecture on CIFAR, reducing the
impacts of random variations.
Though our above analysis is driven by identity f, the experiments in this
section are all based onf= ReLU as in [1]; we address identity f in the next
section. Our baseline ResNet-110 has 6.61% error on the test set. The comparisons
of other variants (Fig.2 and Table1) are summarized as follows:
Constant scaling. We set <<FORMULA>> for all shortcuts (Fig.2(b)). We further
study two cases of scalingF: (i)Fis not scaled; or (ii)Fis scaled by a constant
scalar of <<FORMULA>>, which is similar to the highway gating [6,7] but with frozen
gates. The former case does not converge well; the latter is able to converge,
but the test error (Table1, 12.35%) is substantially higher than the original
ResNet-110. Fig3(a) shows that the training error is higher than that of the
original ResNet-110, suggesting that the optimization has difficulties when the
shortcut signal is scaled down. 6
Table 1.Classification error on the CIFAR-10 test set using ResNet-110 [1], with
different types of shortcut connections applied to all Residual Units. We report \fail"
when the test error is higher than 20%.
<<TABLE>>
Exclusive gating. Following the Highway Networks [6,7] that adopt a gating
mechanism [5], we consider a gating function <<FORMULA>> where a
transform is represented by weights W g and biases <<bg>> followed by the sigmoid
function <<FORMULA>>. In a convolutional network <<g(x)>> is realized by a <<FORMULA>>
convolutional layer. The gating function modulates the signal by element-wise
multiplication.
We investigate the exclusive gates as used in [6,7] the F path is scaled
byg(x) and the shortcut path is scaled by <<FORMULA>>. See Fig2(c). We find that the
initialization of the biases <<bg>> is critical for training gated models, and following
the guidelines 2 in [6,7], we conduct hyper-parameter search on the initial value of
<<bg>> in the range of 0 to -10 with a decrement step of -1 on the training set by cross-
validation. The best value (6 here) is then used for training on the training
set, leading to a test result of 8.70% (Table1), which still lags far behind the
ResNet-110 baseline. Fig 3(b) shows the training curves. Table1also reports the
results of using other initialized values, noting that the exclusive gating network
does not converge to a good solution when <<bg>> is not appropriately initialized.
The impact of the exclusive gating mechanism is two-fold. When <<FORMULA>>
approaches 1, the gated shortcut connections are closer to identity which helps
information propagation; but in this case <<g(x)>> approaches 0 and suppresses the
functionF. To isolate the effects of the gating functions on the shortcut path
alone, we investigate a non-exclusive gating mechanism in the next.
Shortcut-only gating. In this case the functionFis not scaled; only the
shortcut path is gated by <<FORMULA>>. See Fig2(d). The initialized value of<<bg>> is still
essential in this case. When the initialized<<bg>> is 0 (so initially the expectation
of <<FORMULA>> is 0.5), the network converges to a poor result of 12.86% (Table1).
This is also caused by higher training error (Fig 3(c)).
<<FIGURE>>
Figure 3.Training curves on CIFAR-10 of various shortcuts. Solid lines denote test
error (y-axis on the right), and dashed lines denote training loss (y-axis on the left).
When the initialized <<bg>> is very negatively biased (e.g.,6), the value of
<<FORMULA>> is closer to 1 and the shortcut connection is nearly an identity mapping.
Therefore, the result (6.91%, Table1) is much closer to the ResNet-110 baseline.
1x1 convolutional shortcut. Next we experiment with 1x1 convolutional
shortcut connections that replace the identity. This option has been investigated
in [1] (known as option C) on a 34-layer ResNet (16 Residual Units) and shows
good results, suggesting that 1x1 shortcut connections could be useful. But we
find that this is not the case when there are many Residual Units. The 110-layer
ResNet has a poorer result (12.22%, Table1) when using 1x1 convolutional
shortcuts. Again, the training error becomes higher (Fig3(d)). When stacking
so many Residual Units (54 for ResNet-110), even the shortest path may still
impede signal propagation. We witnessed similar phenomena on ImageNet with
ResNet-101 when using 1x1 convolutional shortcuts.
Dropout shortcut. Last we experiment with dropout [11] (at a ratio of 0.5)
which we adopt on the output of the identity shortcut (Fig.2(f)). The network
fails to converge to a good solution. Dropout statistically imposes a scale of
with an expectation of 0.5 on the shortcut, and similar to constant scaling by
0.5, it impedes signal propagation.
Table 2.Classification error (%) on the CIFAR-10 test set using different activation
functions.
<<TABLE>>
<<FIGURE>>
Figure 4.Various usages of activation in Table2. All these units consist of the same
components | only the orders are different.
3.2 Discussions
As indicated by the grey arrows in Fig.2, the shortcut connections are the
most direct paths for the information to propagate.Multiplicative manipulations
(scaling, gating, 1x1 convolutions, and dropout) on the shortcuts can hamper
information propagation and lead to optimization problems.
It is noteworthy that the gating and 1x1 convolutional shortcuts introduce
more parameters, and should have stronger representational abilities than
identity shortcuts. In fact, the shortcut-only gating and 1x1 convolution cover the
solution space of identity shortcuts (i.e., they could be optimized as identity
shortcuts). However, their training error is higher than that of identity short-
cuts, indicating that the degradation of these models is caused by optimization
issues, instead of representational abilities.
4 On the Usage of Activation Functions
Experiments in the above section support the analysis in Eqn.(5) and Eqn.(8),
both being derived under the assumption that the after-addition activation f 9
is the identity mapping. But in the above experiments f is ReLU as designed
in [1], so Eqn.(5) and (8) are approximate in the above experiments. Next we
investigate the impact off.
We want to make f an identity mapping, which is done by re-arranging
the activation functions (ReLU and/or BN). The original Residual Unit in [1]
has a shape in Fig.4(a) | BN is used after each weight layer, and ReLU is
adopted after BN except that the last ReLU in a Residual Unit is after element-
wise addition (f= ReLU). Fig.4(b-e) show the alternatives we investigated,
explained as following.
4.1 Experiments on Activation
In this section we experiment with ResNet-110 and a 164-layerBottleneck[1]
architecture (denoted as ResNet-164). A bottleneck Residual Unit consist of a
1x1 layer for reducing dimension, a 3x3 layer, and a 1x1 layer for restoring
dimension. As designed in [1], its computational complexity is similar to the
two-3x3 Residual Unit. More details are in the appendix. The baseline ResNet-
164 has a competitive result of 5.93% on CIFAR-10 (Table2).
BN after addition. Before turning f into an identity mapping, we go the
opposite way by adopting BN after addition (Fig.4(b)). In this case f involves
BN and ReLU. The results become considerably worse than the baseline (Ta-
ble2). Unlike the original design, now the BN layer alters the signal that passes
through the shortcut and impedes information propagation, as reflected by the
difficulties on reducing training loss at the beginning of training (Fib.6left).
ReLU before addition. A naive choice of making f into an identity map-
ping is to move the ReLU before addition (Fig.4(c)). However, this leads to a
non-negative output from the transformF, while intuitively a residual function
should take values in (-1,+1). As a result, the forward propagated signal
is monotonically increasing. This may impact the representational ability,
and the result is worse (7.84%, Table2) than the baseline. We expect to have
a residual function taking values in (-1,+1). This condition is satisfied by
other Residual Units including the following ones.
Post-activation or pre-activation?In the original design (Eqn.(1) and
Eqn.(2)), the activation<<FORMULA>> affects both paths in the next Residual
Unit: <<FORMULA>>. Next we develop an asymmetric form
where an activation f only affects the F path: <<FORMULA>>, for
any l(Fig.5(a) to (b)). By renaming the notations, we have the following form:
<<FORMULA>>, (9)
It is easy to see that Eqn.(9) is similar to Eqn.(4), and can enable a backward
formulation similar to Eqn.(5). For this new Residual Unit as in Eqn.(9), the new
after-addition activation becomes an identity mapping. This design means that
if a new after-addition activation f is asymmetrically adopted, it is equivalent
to recasting f as the pre-activation of the next Residual Unit. This is illustrated
in Fig.5.
<<FIGURE>>
Figure 5.Using asymmetric after-addition activation is equivalent to constructing a
pre-activationResidual Unit.
Table 3.Classification error (%) on the CIFAR-10/100 test set using the original
Residual Units and our pre-activation Residual Units.
<<TABLE>>
The distinction between post-activation/pre-activation is caused by the presence
of the element-wise addition. For a plain network that has N layers, there
are N-1 activations (BN/ReLU), and it does not matter whether we think of
them as post- or pre-activations. But for branched layers merged by addition,
the position of activation matters.
We experiment with two such designs: (i) ReLU-only pre-activation (Fig.4(d)),
and (ii) full pre-activation (Fig.4(e)) where BN and ReLU are both adopted be-
fore weight layers. Table2 shows that the ReLU-only pre-activation performs
very similar to the baseline on ResNet-110/164. This ReLU layer is not used in
conjunction with a BN layer, and may not enjoy the benefits of BN [8].
Somehow surprisingly, when BN and ReLU are both used as pre-activation,
the results are improved by healthy margins (Table2and Table3). In Table3we
report results using various architectures: (i) ResNet-110, (ii) ResNet-164, (iii)
a 110-layer ResNet architecture in which each shortcut skips only 1 layer (i.e., 11
<<FIGURE>>
Figure 6.Training curves on CIFAR-10.Left: BN after addition (Fig.4(b)) using
ResNet-110.Right: pre-activation unit (Fig.4(e)) on ResNet-164. Solid lines denote
test error, and dashed lines denote training loss.
a Residual Unit has only 1 layer, denoted as ResNet-110 (1layer)), and (iv)
a 1001-layer bottleneck architecture that has 333 Residual Units (111 on each
feature map size), denoted as \ResNet-1001". We also experiment on CIFAR-
100. Table3shows that our pre-activation models are consistently better than
the baseline counterparts. We analyze these results in the following.
4.2 Analysis
We find the impact of pre-activation is twofold. First, the optimization is further
eased (comparing with the baseline ResNet) because f is an identity mapping.
Second, using BN as pre-activation improves regularization of the models.
Ease of optimization. This effect is particularly obvious when training
the1001-layerResNet. Fig.1shows the curves. Using the original design in
[1], the training error is reduced very slowly at the beginning of training. For
f= ReLU, the signal is impacted if it is negative, and when there are many
Residual Units, this effect becomes prominent and Eqn.(3) (so Eqn.(5)) is not
a good approximation. On the other hand, when f is an identity mapping, the
signal can be propagated directly between any two units. Our 1001-layer network
reduces the training loss very quickly (Fig.1). It also achieves the lowest loss
among all models we investigated, suggesting the success of optimization.
We also find that the impact off= ReLU is not severe when the ResNet
has fewer layers (e.g., 164 in Fig.6(right)). The training curve seems to suffer
a little bit at the beginning of training, but goes into a healthy status soon. By
monitoring the responses we observe that this is because after some training,
the weights are adjusted into a status such that yl in Eqn.(1) is more frequently
above zero and f does not truncate it (xl is always non-negative due to the previous
ReLU, so yl is below zero only when the magnitude ofFis very negative).
The truncation, however, is more frequent when there are 1000 layers.
Table 4.Comparisons with state-of-the-art methods on CIFAR-10 and CIFAR-100
using \moderate data augmentation" (ip/translation), except for ELU [12] with no
augmentation. Better results of [13,14] have been reported using stronger data augmen-
tation and ensembling. For the ResNets we also report the number of parameters. Our
results are the median of 5 runs with meanstd in the brackets. All ResNets results
are obtained with a mini-batch size of 128 except y with a mini-batch size of 64 (code
available athttps://github.com/KaimingHe/resnet-1k-layers).
<<TABLE>>
Reducing overfitting. Another impact of using the proposed pre-activation
unit is on regularization, as shown in Fig.6(right). The pre-activation ver-
sion reaches slightly higher training loss at convergence, but produces lower test
error. This phenomenon is observed on ResNet-110, ResNet-110(1-layer), and
ResNet-164 on both CIFAR-10 and 100. This is presumably caused by BNs
reularization effect [8]. In the original Residual Unit (Fig.4(a)), although the BN
normalizes the signal, this is soon added to the shortcut and thus the merged
signal is not normalized. This unnormalized signal is then used as the input of
the next weight layer. On the contrary, in our pre-activation version, the inputs
to all weight layers have been normalized.
5 Results
Comparisons on CIFAR-10/100.Table4compares the state-of-the-art meth-
ods on CIFAR-10/100, where we achieve competitive results. We note that we
do not specially tailor the network width or filter sizes, nor use regularization
techniques (such as dropout) which are very effective for these small datasets.
We obtain these results via a simple but essential concept | going deeper. These
results demonstrate the potential of pushing the limits of depth.
Comparisons on ImageNet.Next we report experimental results on the 1000-
class ImageNet dataset [3]. We have done preliminary experiments using the skip
connections studied in Fig.2&3on ImageNet with ResNet-101 [1], and observed
similar optimization difficulties. The training error of these non-identity shortcut
networks is obviously higher than the original ResNet at the first learning rate 13
Table 5.Comparisons of single-crop error on the ILSVRC 2012 validation set. All
ResNets are trained using the same hyper-parameters and implementations as [1]).
Our Residual Units are the full pre-activation version (Fig.4(e)). y : code/model avail-
able athttps://github.com/facebook/fb.resnet.torch/tree/master/pretrained,
using scale and aspect ratio augmentation in [20].
<<TABLE>>
(similar to Fig.3), and we decided to halt training due to limited resources.
But we did finish a BN after addition version (Fig.4(b)) of ResNet-101 on
ImageNet and observed higher training loss and validation error. This models
single-crop (224x224) validation error is 24.6%/7.5%,vs.the original ResNet-
101s 23.6%/7.1%. This is in line with the results on CIFAR in Fig.6(left).
Table5shows the results of ResNet-152 [1] and ResNet-200 3 , all trained from
scratch. We notice that the original ResNet paper [1] trained the models using
scale jittering with shorter sides [256;480], and so the test of a 224x224 crop
ons= 256 (as did in [1]) is negatively biased. Instead, we test a single 320x320
crop from s=320, for all original and our ResNets. Even though the ResNets
are trained on smaller crops, they can be easily tested on larger crops because
the ResNets are fully convolutional by design. This size is also close to 299x299
used by Inception v3 [19], allowing a fairer comparison.
The original ResNet-152 [1] has top-1 error of 21.3% on a 320x320 crop, and
our pre-activation counterpart has 21.1%. The gain is not big on ResNet-152
because this model has not shown severe generalization difficulties. However,
the original ResNet-200 has an error rate of 21.8%, higher than the baseline
ResNet-152. But we find that the original ResNet-200 has lower training error
than ResNet-152, suggesting that it suffers from overfitting.
Our pre-activation ResNet-200 has an error rate of 20.7%, which is1.1%
lower than the baseline ResNet-200 and also lower than the two versions of
ResNet-152. When using the scale and aspect ratio augmentation of [20,19], our
ResNet-200 has a result better than Inception v3 [19] (Table5). Concurrent
with our work, an Inception-ResNet-v2 model [21] achieves a single-crop result
of 19.9%/4.9%. We expect our observations and the proposed Residual Unit will
help this type and generally other types of ResNets.
Computational Cost.Our models computational complexity is linear on
3 The ResNet-200 has 16 more 3-layer bottleneck Residual Units than ResNet-152,
which are added on the feature map of 28x28.
depth (so a 1001-layer net is complex of a 100-layer net). On CIFAR,
ResNet-1001 takes about 27 hours to train on 2 GPUs; on ImageNet, ResNet-
200 takes about 3 weeks to train on 8 GPUs (on par with VGG nets [22]).
6 Conclusions
This paper investigates the propagation formulations behind the connection
mechanisms of deep residual networks. Our derivations imply that identity short-
cut connections and identity after-addition activation are essential for making
information propagation smooth. Ablation experiments demonstrate phenom-
ena that are consistent with our derivations. We also present 1000-layer deep
networks that can be easily trained and achieve improved accuracy.
Appendix: Implementation DetailsThe implementation details and hyper-
parameters are the same as those in [1]. On CIFAR we use only the translation
and skipping augmentation in [1] for training. The learning rate starts from 0.1,
and is divided by 10 at 32k and 48k iterations. Following [1], for all CIFAR
experiments we warm up the training by using a smaller learning rate of 0.01 at
the beginning 400 iterations and go back to 0.1 after that, although we remark
that this is not necessary for our proposed Residual Unit. The mini-batch size
is 128 on 2 GPUs (64 each), the weight decay is 0.0001, the momentum is 0.9,
and the weights are initialized as in [23].
On ImageNet, we train the models using the same data augmentation as in
[1]. The learning rate starts from 0.1 (no warming up), and is divided by 10 at
30 and 60 epochs. The mini-batch size is 256 on 8 GPUs (32 each). The weight
decay, momentum, and weight initialization are the same as above.
When using the pre-activation Residual Units (Fig.4(d)(e) and Fig.5), we
pay special attention to the first and the last Residual Units of the entire net-
work. For the first Residual Unit (that follows a stand-alone convolutional layer,
conv 1 ), we adopt the first activation right after conv 1 and before splitting into
two paths; for the last Residual Unit (followed by average pooling and a fully-
connected classifier), we adopt an extra activation right after its element-wise
addition. These two special cases are the natural outcome when we obtain the
pre-activation network via the modification procedure as shown in Fig.5.
The bottleneck Residual Units (for ResNet-164/1001 on CIFAR) are
constructed following [1]. For example, a 3x3, 16 unit in ResNet-110 is replaced 3x3, 162
with a 1x1, 166 7 unit in ResNet-164, both of which have roughly the same 3x3, 165
1x1, 64
number of parameters. For the bottleneck ResNets, when reducing the feature map
size we use projection shortcuts [1] for increasing dimensions, and when pre-
activation is used, these projection shortcuts are also with pre-activation. 15
References
1.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR. (2016)
2.Nair, V., Hinton, G.E.: Rectied linear units improve restricted boltzmann ma-
chines. In: ICML. (2010)
3.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
Scale Visual Recognition Challenge. IJCV (2015)
4.Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.,
Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV. (2014)
5.Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation
(1997)
6.Srivastava, R.K., Gre, K., Schmidhuber, J.: Highway networks. In: ICML work-
shop. (2015)
7.Srivastava, R.K., Gre, K., Schmidhuber, J.: Training very deep networks. In:
NIPS. (2015)
8.Ioe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: ICML. (2015)
9.LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,
Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural
computation (1989)
10.Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech Report
(2009)
11.Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Improving neural networks by preventing co-adaptation of feature detectors.
arXiv:1207.0580 (2012)
12.Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network
learning by exponential linear units (ELUs). In: ICLR. (2016)
13.Graham, B.: Fractional max-pooling. arXiv:1412.6071 (2014)
14.Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplic-
ity: The all convolutional net. arXiv:1412.6806 (2014)
15.Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR. (2014)
16.Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In:
AISTATS. (2015)
17.Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets:
Hints for thin deep nets. In: ICLR. (2015)
18.Mishkin, D., Matas, J.: All you need is a good init. In: ICLR. (2016)
19.Szegedy, C., Vanhoucke, V., Ioe, S., Shlens, J., Wojna, Z.: Rethinking the incep-
tion architecture for computer vision. In: CVPR. (2016)
20.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
21.Szegedy, C., Ioe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact
of residual connections on learning. arXiv:1602.07261 (2016)
22.Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: ICLR. (2015)
23.He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectiers: Surpassing human-
level performance on imagenet Classification. In: ICCV. (2015)
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Language Models are Few-Shot Learners
Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah
Jared Kaplan y Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry
Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan
Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter
Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray
Benjamin Chess Jack Clark Christopher Berner
Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei
OpenAI
Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training
on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic
in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of
thousands of examples. By contrast, humans can generally perform a new language task from only
a few examples or from simple instructions something which current NLP systems still largely
struggle to do. Here we show that scaling up language models greatly improves task-agnostic,
few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-
tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion
parameters, 10x more than any previous non-sparse language model, and test its performance in
the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning,
with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3
achieves strong performance on many NLP datasets, including translation, question-answering, and
close tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as
unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same
time, we also identify some datasets where GPT-3s few-shot learning still struggles, as well as some
datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally,
we find that GPT-3 can generate samples of news articles which human evaluators have difficulty
distinguishing from articles written by humans. We discuss broader societal impacts of this finding
and of GPT-3 in general.
Equal contribution
y Johns Hopkins University, OpenAI
Contents
1 Introduction 3
2 Approach 6
2.1 Model and Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
2.2 Training Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
2.3 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
3 Results 10
3.1 Language Modeling, Cloze, and Completion Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . .11
3.2 Closed Book Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
3.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
3.4 Winograd-Style Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
3.5 Common Sense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
3.6 Reading Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
3.7 SuperGLUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
3.8 NLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
3.9 Synthetic and Qualitative Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
4 Measuring and Preventing Memorization Of Benchmarks29
5 Limitations 33
6 Broader Impacts 34
6.1 Misuse of Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
6.2 Fairness, Bias, and Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
6.3 Energy Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
7 Related Work 39
8 Conclusion 40
A Details of Common Crawl Filtering43
B Details of Model Training 43
C Details of Test Set Contamination Studies43
D Total Compute Used to Train Language Models46
E Human Quality Assessment of Synthetic News Articles46
F Additional Samples from GPT-348
G Details of Task Phrasing and Specifications50
H Results on All Tasks for All Model Sizes63
1 Introduction
Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly
flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word
vectors [MCCD13,PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations
and contextual state were used to form stronger representations [DL15,MBXS17,PNZtY18] (though still applied to
task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP + 17] have
been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18,DCLT18,HR18].
This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension,
question answering, textual entailment, and many others, and has continued to advance based on new architectures
and algorithms [RSR + 19,LOG + 19,YDY + 19,LCG + 19]. However, a major limitation to this approach is that while
the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve
strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands
of examples specific to that task. Removing this limitation would be desirable, for several reasons.
First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the
applicability of language models. There exists a very wide range of possible useful language tasks, encompassing
anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many
of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated
for every new task.
Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness
of the model and the narrowness of the training distribution. This can create problems for the pre-training plus
fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then
fine-tuned on very narrow task distributions. For instance [HLW + 20] observe that larger models do not necessarily
generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm
can be poor because the model is overly specific to the training distribution and does not generalize well outside it
[YdC + 19,MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at
human-level, may exaggerate actual performance on the underlying task [GSL + 18,NK19].
Third, humans do not require large supervised datasets to learn most language tasks a brief directive in natural
language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number
of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often
<<FIGURE>>
Figure 1.1: Language model meta-learning.During unsupervised pre-training, a language model develops a broad
set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize
the desired task. We use the term “in-context learning” to describe the inner loop of this process, which occurs within
the forward-pass upon each sequence. The sequences in this diagram are not intended to be representative of the data a
model would see during pre-training, but are intended to show that there are sometimes repeated sub-tasks embedded
within a single sequence.
<<FIGURE>>
Figure 1.2: Larger models make increasingly efficient use of in-context information. We show in-context learning
performance on a simple task requiring the model to remove random symbols from a word, both with and without a
natural language task description (see Sec.3.9.2). The steeper “in-context learning curves” for large models demonstrate
improved ability to learn a task from contextual information. We see qualitatively similar behavior across a wide range
of tasks.
sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing
to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages it allows humans
to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy
dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.
One potential route towards addressing these issues is meta-learning 1 which in the context of language models means
the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities
at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure1.1). Recent work [RWC + 19]
attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form
of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task
and is then expected to complete further instances of the task simply by predicting what comes next.
While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning for example
[RWC + 19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind
the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of
solving language tasks.
Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer
language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters
[DCLT18], to 1.5 billion parameters [RWC + 19], to 8 billion parameters [SPP + 19], 11 billion parameters [RSR + 19],
and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream
NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a
smooth trend of improvement with scale [KMH + 20]. Since in-context learning involves absorbing many skills and
tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong
gains with scale.
1 In the context of language models this has sometimes been called “zero-shot transfer”, but this term is potentially ambiguous:
the method is “zero-shot” in the sense that no gradient updates are performed, but it often involves providing inference-time
demonstrations to the model, so is not truly learning from zero examples. To avoid this confusion, we use the term “meta-learning”
to capture the inner-loop / outer-loop structure of the general method, and the term “in context-learning” to refer to the inner
loop of meta-learning. We further specialize the description to “zero-shot”, “one-shot”, or “few-shot” depending on how many
demonstrations are provided at inference time. These terms are intended to remain agnostic on the question of whether the model
learns new tasks from scratch at inference time or simply recognizes patterns seen during training this is an important issue which
we discuss later in the paper, but “meta-learning” is intended to encompass both possibilities, and simply describes the inner-outer
loop structure.
<<FIGURE>>
Figure 1.3: Aggregate performance for all 42 accuracy-denominated benchmarks While zero-shot performance
improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are
more proficient at in-context learning. See Figure3.8for a more detailed analysis on SuperGLUE, a standard NLP
benchmark suite.
In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call
GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets,
as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training
set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we
allow as many demonstrations as will fit into the models context window (typically 10 to 100), (b) “one-shot learning”,
where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only
an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional
fine-tuning setting, but we leave this to future work.
Figure1.2illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to
remove extraneous symbols from a word. Model performance improves with the addition of a natural language task
description, and with the number of examples in the models context,K. Few-shot learning also improves dramatically
with model size. Though the results in this case are particularly striking, the general trends with both model size and
number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no
gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.
Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot
setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held
by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in
the one-shot setting, 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on TriviaQA in the
zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, the last of which is state-of-the-art
relative to fine-tuned models operating in the same closed-book setting.
GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning,
which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them
defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human
evaluators have difficulty distinguishing from human-generated articles.
At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This
includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE
or QuAC. By presenting a broad characterization of GPT-3s strengths and weaknesses, including these limitations, we
hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed.
A heuristic sense of the overall results can be seen in Figure1.3, which aggregates the various tasks (though it should
not be seen as a rigorous or meaningful benchmark in itself).
We also undertake a systematic study of “data contamination” a growing problem when training high capacity models
on datasets such as Common Crawl, which can potentially include content from test datasets simply because such
content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify
its distorting effects. Although we find that data contamination has a minimal effect on GPT-3s performance on most
datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these
datasets or we note them with an asterisk, depending on the severity.
In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion
parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most
tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap
between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models
are more proficient meta-learners.
Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and
broader societal impacts, and attempt a preliminary analysis of GPT-3s characteristics in this regard.
The remainder of this paper is organized as follows. In Section2, we describe our approach and methods for training
GPT-3 and evaluating it. Section3presents results on the full range of tasks in the zero-, one- and few-shot settings.
Section4addresses questions of data contamination (train-test overlap). Section5discusses limitations of GPT-3.
Section6discusses broader impacts. Section7reviews related work and Section8concludes.
2 Approach
Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC + 19],
with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use
of in-context learning is also similar to [RWC + 19], but in this work we systematically explore different settings for
learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings
that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a
spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this
spectrum (see Figure2.1for an illustration):
•Fine-Tuning (FT)has been the most common approach in recent years, and involves updating the weights of
a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to
hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance
on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential
for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the
training data [GSL + 18,NK19], potentially resulting in an unfair comparison with human performance. In
this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be
fine-tuned in principle and this is a promising direction for future work.
•Few-Shot (FS)is the term we will use in this work to refer to the setting where the model is given a few
demonstrations of the task at inference time as conditioning [RWC + 19], but no weight updates are allowed.
As shown in Figure2.1, for a typical dataset an example has a context and a desired completion (for example
an English sentence and the French translation), and few-shot works by giving K examples of context and
completion, and then one final example of context, with the model expected to provide the completion. We
typically setKin the range of 10 to 100 as this is how many examples can fit in the models context window
(nctx = 2048). The main advantages of few-shot are a major reduction in the need for task-specific data and
reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main
disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned
models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot
learning as described here for language models is related to few-shot learning as used in other contexts in
ML [HYC01,VBL + 16] both involve learning based on a broad distribution of tasks (in this case implicit in
the pre-training data) and then rapidly adapting to a new task.
•One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural
language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and
zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans.
For example, when asking humans to generate a dataset on a human worker service (for example Mechanical
Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate
the content or format of a task if no examples are given.
<<FIGURE>>
Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional fine-tuning. The panels above show
four methods for performing a task with a language model fine-tuning is the traditional method, whereas zero-, one-,
and few-shot, which we study in this work, require the model to perform the task with only forward passes at test
time. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task
descriptions, examples and prompts can be found in AppendixG.
•Zero-Shot (0S)is the same as one-shot except that no demonstrations are allowed, and the model is only given
a natural language instruction describing the task. This method provides maximum convenience, potential for
robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of
pre-training data), but is also the most challenging setting. In some cases it may even be difficult for humans
to understand the format of the task without prior examples, so this setting is in some cases “unfairly hard”.
For example, if someone is asked to “make a table of world records for the 200m dash”, this request can be
ambiguous, as it may not be clear exactly what format the table should have or what should be included (and
even with careful clarification, understanding precisely what is desired can be difficult). Nevertheless, for at
least some settings zero-shot is closest to how humans perform tasks for example, in the translation example
in Figure2.1, a human would likely know what to do from just the text instruction.
Figure2.1shows the four methods using the example of translating English to French. In this paper we focus on
zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different
problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency.
We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models.
Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance,
and are important targets for future work.
Sections2.1-2.3below give details on our models, training data, and training process respectively. Section2.4discusses
the details of how we do few-shot, one-shot, and zero-shot evaluations.
<<TABLE>>
Table 2.1:Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models
which we trained. All models were trained for a total of 300 billion tokens.
2.1 Model and Architectures
We use the same model and architecture as GPT-2 [RWC + 19], including the modified initialization, pre-normalization,
and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse
attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence
of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125
million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH + 20]
suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a
function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for
downstream language tasks.
Table2.1shows the sizes and architectures of our 8 models. Here n params is the total number of trainable parameters,
n layers is the total number of layers,d model is the number of units in each bottleneck layer (we always have the
feedforward layer four times the size of the bottleneck layer,<<FORMULA>> model ), and d head is the dimension of each
attention head. All models use a context window of <<FORMULA>> tokens. We partition the model across GPUs along
both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural
parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models
across GPUs. Previous work [KMH + 20] suggests that validation loss is not strongly sensitive to these parameters
within a reasonably broad range.
2.2 Training Dataset
Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset 2 [RSR + 19] constituting
nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same
sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have
lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets:
(1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference
corpora, (2) we performed fuzzy de-duplication at the document level, within and across datasets, to prevent redundancy
and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added
known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.
Details of the first two points (processing of Common Crawl) are described in AppendixA. For the third, we added
several curated high-quality datasets, including an expanded version of the WebText dataset [RWC + 19], collected
by scraping links over a longer period of time, and first described in [KMH + 20], two internet-based books corpora
(Books1 and Books2) and English-language Wikipedia.
Table2.2shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from
41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering
and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets
are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently,
such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are
sampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data.
<<FIGURE>>
Figure 2.2: Total compute used during training. Based on the analysis in Scaling Laws For Neural Language Models
[KMH + 20] we train much larger models on many fewer tokens than is typical. As a consequence, although GPT-3 3B
is almost 10x larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of compute
during pre-training. Methodology for these calculations can be found in AppendixD.
<<TABLE>>
Table 2.2: Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training
that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a
result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets
are seen less than once.
A major methodological concern with language models pretrained on a broad swath of internet data, particularly large
models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by
having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched
for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper.
Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible
to retrain the model. In Section4we characterize the impact of the remaining overlaps, and in future work we will
more aggressively remove data contamination.
2.3 Training Process
As found in [KMH + 20,MKAT18], larger models can typically use a larger batch size, but require a smaller learning
rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table
2.1shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture
of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models
were trained on V100 GPUs on part of a high-bandwidth cluster provided by Microsoft. Details of the training process
and hyperparameter settings are described in AppendixB.
2.4 Evaluation
For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that
tasks training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Story cloze
there is no supervised training set available so we draw conditioning examples from the development set and evaluate
on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning
examples directly from it.
K can be any value from 0 to the maximum amount allowed by the models context window, which is <<FORMULA>>
for all models and typically fits10to100examples. Larger values of K are usually but not always better, so when a
separate development and test set are available, we experiment with a few values ofKon the development set and then
run the best value on the test set. For some tasks (see AppendixG) we also use a natural language prompt in addition to
(or forK= 0, instead of) demonstrations.
On tasks that involve choosing one correct completion from several options (multiple choice), we provideKexamples
of context plus correct completion, followed by one example of context only, and compare the LM likelihood of
each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small
number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set
by normalizing by the unconditional probability of each completion, by computing <<FORMULA>>, where <<FORMULA>> answer context
is the string "Answer: "or" A: " and is used to prompt that the completion should be an answer
but is otherwise generic.
On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or
“False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what
is done by [RSR + 19] (see AppendixG) for details.
On tasks with free-form completion, we use beam search with the same parameters as [RSR + 19]: a beam width of 4
and a length penalty of= 0:6. We score the model using F1 similarity score, BLEU, or exact match, depending on
what is standard for the dataset at hand.
Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-,
and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on
the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa)
where we were able to make submission work, and we submit only the 200B few-shot results, and report development
set results for everything else.
3 Results
In Figure3.1we display training curves for the 8 models described in Section2. For this graph we also include 6
additional extra-small models with as few as 100,000 parameters. As observed in [KMH + 20], language modeling
performance follows a power-law when making efficient use of training compute. After extending this trend by two
more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these
improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will
see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a
broad spectrum of natural language tasks.
Below, we evaluate the 8 models described in Section2(the 175 billion parameter parameter GPT-3 and 7 smaller
models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks.
In Section3.1we evaluate on traditional language modeling tasks and tasks that are similar to language modeling,
such as Cloze tasks and sentence/paragraph completion tasks. In Section3.2we evaluate on “closed book” question
answering tasks: tasks which require using the information stored in the models parameters to answer general
knowledge questions. In Section3.3we evaluate the models ability to translate between languages (especially one-shot
and few-shot). In Section3.4we evaluate the models performance on Winograd Schema-like tasks. In Section3.5we
evaluate on datasets that involve commonsense reasoning or question answering. In Section3.6we evaluate on reading
comprehension tasks, in Section3.7we evaluate on the SuperGLUE benchmark suite, and in3.8we briefly explore
NLI. Finally, in Section3.9, we invent some additional tasks designed especially to probe in-context learning abilities
these tasks focus on on-the-fly reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the
few-shot, one-shot, and zero-shot settings.
<<FIGURE>>
Figure 3.1: Smooth scaling of performance with compute. Performance (measured in terms of cross-entropy
validation loss) follows a power-law trend with the amount of compute used for training. The power-law behavior
observed in [KMH + 20] continues for an additional two orders of magnitude with only small deviations from the
predicted curve. For this figure, we exclude embedding parameters from compute and parameter counts.
<<TABLE>>
Table 3.1: Zero-shot results on PTB language modeling dataset.Many other common language modeling datasets
are omitted because they are derived from Wikipedia or other sources which are included in GPT-3s training data.
a [RWC + 19]
3.1 Language Modeling, Cloze, and Completion Tasks
In this section we test GPT-3s performance on the traditional task of language modeling, as well as related tasks
that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible
completions of a piece of text.
3.1.1 Language Modeling
We calculate zero-shot perplexity on the Penn Tree Bank (PTB) [MKM + 94] dataset measured in [RWC + 19]. We omit
the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the
one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these
issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15
points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have
a clear separation of examples to define one-shot or few-shot evaluation around, so we measure only zero-shot.
3.1.2 LAMBADA
The LAMBADA dataset [PKL + 16] tests the modeling of long-range dependencies in text the model is asked to
predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the
continued scaling of language models is yielding diminishing returns on this difficult benchmark. [BHT + 20] reflect on
the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results [SPP + 19]
<<TABLE>>
Table 3.2: Performance on cloze and completion tasks.GPT-3 significantly improves SOTA on LAMBADA while
achieving respectable performance on two difficult completion prediction datasets.
<<FIGURE>>
Figure 3.2:On LAMBADA, the few-shot capability of language models results in a strong boost to accuracy. GPT-3
2.7B outperforms the SOTA 17B parameter Turing-NLG [Tur20] in this setting, and GPT-3 175B advances the state of
the art by 18%. Note zero-shot uses a different format from one-shot and few-shot as described in the text.
and [Tur20]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path
forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of
8% over the previous state of the art.
LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that
classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a
standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but
also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word
filters [RWC + 19] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a
cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We
use the following fill-in-the-blank format:
Alice was friends with Bob. Alice went to visit her friend .!Bob
George bought some baseball equipment, a ball, a glove, and a .!
When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase
of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model
size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy
by 10%. Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot
setting. Perhaps this is because all models still require several examples to recognize the pattern.
<<TABLE>>
Table 3.3: Results on three Open-Domain QA tasks.GPT-3 is shown in the few-, one-, and zero-shot settings, as
compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the
wiki split test server.
One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA
dataset appears to be present in our training data however analysis performed in Section4suggests negligible impact
on performance.
3.1.3 HellaSwag
The HellaSwag dataset [ZHB + 19] involves picking the best ending to a story or set of instructions. The examples were
adversarially mined to be difficult for language models while remaining easy for humans (who achieve 95.6% accuracy).
GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the
75.4% accuracy of a fine-tuned 1.5B parameter language model [ZHR + 19] but still a fair amount lower than the overall
SOTA of 85.6% achieved by the fine-tuned multi-task model ALUM.
3.1.4 StoryCloze
We next evaluate GPT-3 on the StoryCloze 2016 dataset [MCH + 16], which involves selecting the correct ending
sentence for five-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot
setting (withK= 70). This is still 4.1% lower than the fine-tuned SOTA using a BERT based model [LDL19] but
improves over previous zero-shot results by roughly 10%.
3.2 Closed Book Question Answering
In this section we measure GPT-3s ability to answer questions about broad factual knowledge. Due to the immense
amount of possible queries, this task has normally been approached by using an information retrieval system to find
relevant text in combination with a model which learns to generate an answer given the question and the retrieved
text. Since this setting allows a system to search for and condition on text which potentially contains the answer it
is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well
directly answering the questions without conditioning on auxiliary information. They denote this more restrictive
evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better
and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR + 19],
WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in
the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than
previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself
is also not permitted.
The results for GPT-3 are shown in Table3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the
one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by
14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot
result improves by 3.7% and matches the SOTA for an open-domain QA system which not only fine-tunes but also
makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [LPP + 20].
GPT-3s few-shot result further improves performance another 3.2% beyond this.
On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5%
in the few-shot setting. This compares to 37.4% for fine-tuned T5-11B, and 44.7% for fine-tuned T5-11B+SSM,
which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of
state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to
few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions
<<FIGURE>>
Figure 3.3:On TriviaQA GPT3s performance grows smoothly with model size, suggesting that language models
continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make significant gains
over zero-shot behavior, matching and exceeding the performance of the SOTA fine-tuned open-domain model, RAG
[LPP + 20]
and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this
distribution, recovering strong performance in the few-shot setting.
On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in
the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot
to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to
TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia
specifically which could be testing the limits of GPT-3s capacity and broad pretraining distribution.
Overall, on one of the three datasets GPT-3s one-shot matches the open-domain fine-tuning SOTA. On the other two
datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we
find that performance scales very smoothly with model size (Figure3.3and AppendixHFigureH.7), possibly reflecting
the idea that model capacity translates directly to more knowledge absorbed in the parameters of the model.
3.3 Translation
For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity
concerns. Even with this filtering GPT-2 showed some evidence of multilingual capability and performed non-trivially
when translating between French and English despite only training on 10 megabytes of remaining French text. Since we
increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training
dataset to include more representation of other languages, though this remains an area for further improvement. As
discussed in2.2the majority of our data is derived from raw Common Crawl with only quality-based filtering. Although
GPT-3s training data is still primarily English (93% by word count), it also includes 7% of text in other languages.
These languages are documented in the supplemental material. In order to better understand translation capability, we
also expand our analysis to include two additional commonly studied languages, German and Romanian.
Existing unsupervised machine translation approaches often combine pretraining on a pair of monolingual datasets
with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a
blend of training data that mixes many languages together in a natural way, combining them on a word, sentence,
and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in
particular. However, our one / few-shot settings arent strictly comparable to prior unsupervised work since they make
use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data.
Results are shown in Table3.4. Zero-shot GPT-3, which only receives on a natural language description of the task,
still underperforms recent unsupervised NMT results. However, providing only a single example demonstration for
<<TABLE>>
Table 3.4: Few-shot GPT-3 outperforms previous unsupervised NMT work by 5 BLEU when translating
into English reflecting its strength as an English LM.We report BLEU scores on the WMT14 Fr$En,
WMT16 De$En, and WMT16 Ro$En datasets as measured by multi-bleu.perl with XLMs tokenization
in order to compare most closely with prior unsupervised NMT work. SacreBLEU f [Pos18] results re-
ported in AppendixH. Underline indicates an unsupervised or few-shot SOTA, bold indicates supervised SOTA
with relative confidence. a [EOAG18]b [DHKH14]c [WXH + 18]d [oR16]e [LGG + 20]f [SacreBLEU signature:
BLEU+case.mixed+numrefs.1+smooth.exp+tok.intl+version.1.2.20]
<<FIGURE>>
Figure 3.4:Few-shot translation performance on 6 language pairs as model capacity increases. There is a consistent
trend of improvement across all datasets as the model scales, and as well as tendency for translation into English to be
stronger than translation from English.
<<TABLE>>
Table 3.5:Results on the WSC273 version of Winograd schemas and the adversarial Winogrande dataset. See Section
4for details on potential contamination of the Winograd test set. a [SBBC19]b [LYN + 20]
<<FIGURE>>
Figure 3.5:Zero-, one-, and few-shot performance on the adversarial Winogrande dataset as model capacity scales.
Scaling is relatively smooth with the gains to few-shot learning increasing with model size, and few-shot GPT-3 175B
is competitive with a fine-tuned RoBERTA-large.
each translation task improves performance by over 7 BLEU and nears competitive performance with prior work.
GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior
unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the
three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into
English but under-performs when translating in the other direction. Performance on En-Ro is a noticeable outlier at
over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE
tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En,
few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and
the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art.
For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of
unsupervised pretraining, supervised finetuning on 608K labeled examples, and backtranslation [LHCG19b].
Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of
improvement with model capacity. This is shown in Figure3.4in the case of few-shot results, and scaling for all three
settings is shown in AppendixH.
3.4 Winograd-Style Tasks
The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun
refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned
language models have achieved near-human performance on the original Winograd dataset, but more difficult versions
<<TABLE>>
Table 3.6:GPT-3 results on three commonsense reasoning tasks, PIQA, ARC, and OpenBookQA. GPT-3 Few-Shot
PIQA result is evaluated on the test server. See Section4for details on potential contamination issues on the PIQA test
set.
<<FIGURE>>
Figure 3.6:GPT-3 results on PIQA in the zero-shot, one-shot, and few-shot settings. The largest model achieves a
score on the development set in all three conditions that exceeds the best recorded score on the task.
such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test
GPT-3s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.
On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method
described in [RWC + 19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which
is presented as binary classification and requires entity extraction to convert to the form described in this section. On
Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear
in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human
performance. We note that contamination analysis found some Winograd schemas in the training data but this appears
to have only a small effect on results (see Section4).
On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the
zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned
RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and
human performance on the task as reported by [SBBC19] is 94.0%.
3.5 Common Sense Reasoning
Next we consider three datasets which attempt to capture physical or scientific reasoning, as distinct from sentence
completion, reading comprehension, or broad knowledge question answering. The first, PhysicalQA (PIQA) [BZB + 19],
asks common sense questions about how the physical world works and is intended as a probe of grounded understanding
of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot
(the last measured on PIQAs test server). This compares favorably to the 79.4% accuracy prior state-of-the-art of a
<<TABLE>>
Table 3.7:Results on reading comprehension tasks. All scores are F1 except results for RACE which report accuracy.
a [JZC + 19]b [JN20]c [AI19]d [QIA20]e [SPP + 19]
fine-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over 10% worse than human
performance, but GPT-3s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis
flagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark
the result with an asterisk. See Section4for details.
ARC [CCE + 18] is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the
“Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval
methods are unable to correctly answer, GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot
setting, and 51.5% in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline
(55.9%) from UnifiedQA [KKS + 20]. On the “Easy” version of the dataset (questions which either of the mentioned
baseline approaches answered correctly), GPT-3 achieves 68.8%, 71.2%, and 70.1% which slightly exceeds a fine-tuned
RoBERTa baseline from [KKS + 20]. However, both of these results are still much worse than the overall SOTAs
achieved by the UnifiedQA which exceeds GPT-3s few-shot results by 27% on the challenge set and 22% on the easy
set.
On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points
short of the overall SOTA. GPT-3s few-shot performance is similar to a fine-tuned BERT Large baseline on the
leaderboard.
Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and
inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a significant
improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings.
3.6 Reading Comprehension
Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive,
multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread
in GPT-3s performance across these datasets suggestive of varying capability with different answer formats. In general
we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each
respective dataset.
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset
and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI + 18] a dataset which requires modeling structured
dialog acts and answer span selections of teacher-student interactions. On DROP [DWD + 19], a dataset testing discrete
reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned
BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches
which augment neural networks with symbolic systems [RLL + 19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its
few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to
slightly outperform the best fine-tuned result in the original paper. On RACE [LXL + 17], a multiple choice dataset of
middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with
the earliest work utilizing contextual representations and is still 45% behind SOTA.
3.7 SuperGLUE
In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a
more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark
[WPN + 19]. GPT-3s test-set performance on the SuperGLUE dataset [WPN + 19] is shown in Table3.8. In the few-shot
setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and
<<FIGURE>>
Figure 3.7:GPT-3 results on CoQA reading comprehension task. GPT-3 175B achieves 85 F1 in the few-shot setting,
only a few points behind measured human performance and state-of-the-art fine-tuned models. Zero-shot and one-shot
performance is a few points behind, with the gains to few-shot being largest for bigger models.
<<TABLE>>
Table 3.8:Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported
on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient
updates.
<<FIGURE>>
Figure 3.8: Performance on SuperGLUE increases with model size and number of examples in context.A value
of K= 32 means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in
SuperGLUE. We report GPT-3 values on the dev set, so our numbers are not directly comparable to the dotted reference
lines (our test set results are in Table3.8). The BERT-Large reference model was fine-tuned on the SuperGLUE training
set (125K examples), whereas BERT++ was first fine-tuned on MultiNLI (392K examples) and SWAG (113K examples)
before further fine-tuning on the SuperGLUE training set (for a total of 630K fine-tuning examples). We find the
difference in performance between the BERT-Large and BERT++ to be roughly equivalent to the difference between
GPT-3 with one example per context versus eight examples per context.
MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used
the same set of randomly drawn examples from the training set as context for all of the problems we evaluated.
We observe a wide range in GPT-3s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA
performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving
second place on the leaderboard, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC,
performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the
original Winograd dataset as described in Section3.4). On BoolQ, MultiRC, and RTE, performance is reasonable,
roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting.
WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different
phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two
sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer
in the next section (which discusses the ANLI benchmark) GPT-3 appears to be weak in the few-shot or one-shot
setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same
way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another.
This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these
weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to
the state-of-the-art held by a fine-tuned 11 billion parameter model.
Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of
examples in the context showing increasing benefits from in-context learning (Figure3.8). We scale K up to 32
examples per task, after which point additional examples will not reliably fit into our context. When sweeping over
values ofK, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large
on overall SuperGLUE score.
3.8 NLI
Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences.
In practice, this task is usually structured as a two or three class classification problem where the model classifies
<<FIGURE>>
Figure 3.9: Performance of GPT-3 on ANLI Round 3.Results are on the dev-set, which has only 1500 examples
and therefore has high variance (we estimate a standard deviation of 1.2%). We find that smaller models hover around
random chance, while few-shot GPT-3 175B closes almost half the gap from random chance to SOTA. Results for
ANLI rounds 1 and 2 are shown in the appendix.
whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral).
SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest
version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting
GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced
Adversarial Natural Language Inference (ANLI) dataset [NWD + 19]. ANLI is a difficult dataset employing a series of
adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our
models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (33%),
whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure3.9and full results
for all rounds can be found in AppendixH. These results on both RTE and ANLI suggest that NLI is still a very difficult
task for language models and they are only just beginning to show signs of progress.
3.9 Synthetic and Qualitative Tasks
One way to probe GPT-3s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which
require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have
occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we
test GPT-3s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the
letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3s ability to
solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new
words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets
with the hope of stimulating further study of test-time behavior of language models.
3.9.1 Arithmetic
To test GPT-3s ability to perform simple arithmetic operations without task-specific training, we developed a small
battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:
•2 digit addition (2D+) The model is asked to add two integers sampled uniformly from[0;100), phrased in
the form of a question, e.g. “Q: What is 48 plus 76? A: 124.”
•2 digit subtraction (2D-) The model is asked to subtract two integers sampled uniformly from[0;100); the
answer may be negative. Example: “Q: What is 34 minus 53? A: -19”.
•3 digit addition (3D+) Same as 2 digit addition, except numbers are uniformly sampled from[0;1000).
<<FIGURE>>
Figure 3.10:Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a
significant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being
able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a significant fraction
of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot
are shown in the appendix.
•3 digit subtraction (3D-) Same as 2 digit subtraction, except numbers are uniformly sampled from[0;1000).
•4 digit addition (4D+) Same as 3 digit addition, except uniformly sampled from[0;10000).
•4 digit subtraction (4D-) Same as 3 digit subtraction, except uniformly sampled from[0;10000).
•5 digit addition (5D+) Same as 3 digit addition, except uniformly sampled from[0;100000).
•5 digit subtraction (5D-) Same as 3 digit subtraction, except uniformly sampled from[0;100000).
•2 digit multiplication (2Dx) The model is asked to multiply two integers sampled uniformly from[0;100),
e.g. “Q: What is 24 times 42? A: 1008”.
•One-digit composite (1DC) The model is asked to perform a composite operation on three 1 digit numbers,
with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers
are selected uniformly on[0;10)and the operations are selected uniformly from f+,-,*g.
In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random
instances of the task and evaluate all models on those instances.
First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure3.10. On addition and subtraction,
GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition,
98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the
number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on
five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves
29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves
21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness
beyond just single operations.
As Figure3.10makes clear, small models do poorly on all of these tasks even the 13 billion parameter model (the
second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all
other operations less than 10% of the time.
One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation
to the task (or at the very least recognition of the task) is important to performing these computations correctly.
Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly
<<TABLE>>
Table 3.9:Results on basic arithmetic tasks for GPT-3 175B.f2,3,4,5gDf+,-gis 2, 3, 4, and 5 digit addition or
subtraction, 2Dx is 2 digit multiplication. 1DC is 1 digit composite operations. Results become progressively stronger
moving from the zero-shot to one-shot to few-shot setting, but even the zero-shot shows significant arithmetic abilities.
<<TABLE>>
Table 3.10:GPT-3 175B performance on various word unscrambling and word manipulation tasks, in zero-, one-, and
few-shot settings. CL is “cycle letters in word”, A1 is anagrams of but the first and last letters, A2 is anagrams of all but
the first and last two letters, RI is “Random insertion in word”, RW is “reversed words”.
outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table3.9, and
model capacity scaling for all three settings is shown in AppendixH.
To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic
problems in our test set and searched for them in our training data in both the forms"<NUM1> + <NUM2> ="and
"<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000
subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers
could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes
such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than
memorizing a table.
Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even
zero-shot settings.
3.9.2 Word Scrambling and Manipulation Tasks
To test GPT-3s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of
5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of
scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:
•Cycle letters in word (CL) The model is given a word with its letters cycled, then the “=” symbol, and
is expected to generate the original word. For example, it might be given “lyinevitab” and should output
“inevitably”.
•Anagrams of all but first and last characters (A1) The model is given a word where every letter except
the first and last have been scrambled randomly, and must output the original word. Example: criroptuon =
corruption.
•Anagrams of all but first and last 2 characters (A2) The model is given a word where every letter except
the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt
!opponent.
•Random insertion in word (RI) A random punctuation or space character is inserted between each letter
of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.
•Reversed words (RW) The model is given a word spelled backwards, and must output the original word.
Example: stcejbo!objects.
For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by
[Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure3.11.
Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing
<<FIGURE>>
Figure 3.11:Few-shot performance on the five word scrambling tasks for different sizes of model. There is generally
smooth improvement with model size although the random insertion task shows an upward slope of improvement with
the 175B model solving the task the majority of the time. Scaling of one-shot and zero-shot performance is shown in
the appendix. All tasks are done with K=100.
random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram
task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word.
In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the
model can rarely perform any of the tasks (Table3.10). This suggests that the model really does appear to learn these
tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear
in the pre-training data (although we cannot confirm this with certainty).
We can further quantify performance by plotting “in-context learning curves”, which show task performance as a
function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task
in Figure1.2. We can see that larger models are able to make increasingly effective use of in-context information,
including both task examples and natural language task descriptions.
Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding
operates on significant fractions of a word (on average 0.7 words per token), so from the LMs perspective succeeding
at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also,
CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word),
requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require
non-trivial pattern-matching and computation.
3.9.3 SAT Analogies
To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of
374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of
the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to
hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to
temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original
word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the
few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among
college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure3.12, the results improve with
scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model.
<<FIGURE>>
Figure 3.12:Zero-, one-,and few-shot performance on SAT analogy tasks, for different sizes of model. The largest
model achieves 65% accuracy in the few-shot setting, and also demonstrates significant gains to in-context learning
which are not present in smaller models.
3.9.4 News Article Generation
Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by
conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news
story [RWC + 19]. Relative to [RWC + 19], the dataset used to train GPT-3 is much less weighted towards news articles,
so trying to generate news articles via raw unconditional samples is less effective for example GPT-3 often interprets
the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To
solve this problem we employed GPT-3s few-shot learning abilities by providing three previous news articles in the
models context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably
generate short articles in the “news” genre.
To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional
sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles
from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR + 19]. Generative
language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to
distinguish the two is a potentially important measure of quality.
In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles
from the websitenewser.com(mean length: 215 words). We then generated completions of these titles and subtitles
from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each
model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed
by either the human written article or the article generated by the model 4 . Participants were asked to select whether the
article was “very likely written by a human”, “more likely written by a human”, “I dont know”, “more likely written by
a machine”, or “very likely written by a machine”.
The articles we selected were not in the models training data and the model outputs were formatted and selected
programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were
pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model.
However, we also ran an experiment to control for participant effort and attention that followed the same format but
involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a
160M parameter model with no context and increased output randomness.
3 This task is also relevant to the potential misuse of language models discussed in Section6.1.
4 We wanted to identify how good an average person on the internet is at detecting language model outputs, so we focused on
participants drawn from the general US population. See AppendixEfor details.
<<TABLE>>
Table 3.11: Human accuracy in identifying whether short (200 word) news articles are model generated. We
find that human accuracy (measured by the ratio of correct assignments to non-neutral assignments) ranges from 86%
on the control model to 52% on GPT-3 175B. This table compares mean accuracy between five different models, and
shows the results of a two-sample T-Test for the difference in mean accuracy between each model and the control model
(an unconditional GPT-3 Small model with increased output randomness).
Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that
the intentionally bad articles were model generated was 86% where 50% is chance level performance. By contrast,
mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance
at 52% (see Table3.11). 5 Human abilities to detect model generated text appear to decrease as model size increases:
there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance. 6
This is true despite the fact that participants spend more time on each output as model size increases (see AppendixE).
Examples of synthetic articles from GPT-3 are given in Figures3.14and3.15.7 Much of the text is—as indicated by the
evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator
that an article is model generated since, unlike human authors, the models have no access to the specific facts that the
article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual
phrasings, though these are often subtle enough that they are not noticed.
Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like
GROVER [ZHR + 19] and GLTR [GSR19] may have greater success at detecting model generated text than human
evaluators. Automatic detection of these models may be a promising area of future research.
Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe
more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated
by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated
completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial
experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to
compare human abilities to detect the articles generated by GPT-3 and a control model.
We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was
88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely
above chance at 52%(see Table3.12). This indicates that, for news articles that are around 500 words long, GPT-3
continues to produce articles that humans find difficult to distinguish from human written news articles.
3.9.5 Learning and Using Novel Words
A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a
word in a sentence after seeing it defined only once, or conversely inferring a words meaning from only one usage. Here
we qualitatively test GPT-3s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word,
such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate)
5 We use a two-sample Students T-Test to test for significant difference between the means of the participant accuracies of each
model and the control model and report the normalized difference in the means (as the t-statistic) and the p-value.
6 If a model consistently produces texts that are more impressive than human articles, it is possible that human performance on
this task would drop below 50%. Indeed, many individual participants scored below 50% on this task.
7 Additional non-news samples can be found in AppendixF.
<<FIGURE>>
Figure 3.13:Peoples ability to identify whether news articles are model-generated (measured by the ratio of correct
assignments to non-neutral assignments) decreases as model size increases. Accuracy on the outputs on the deliberately-
bad control model (an unconditioned GPT-3 Small model with higher output randomness) is indicated with the dashed
line at the top, and the random chance (50%) is indicated with the dashed line at the bottom. Line of best fit is a power
law with 95% confidence intervals.
<<TABLE>>
Table 3.12:Peoples ability to identify whether 500 word articles are model generated (as measured by the ratio of
correct assignments to non-neutral assignments) was 88% on the control model and 52% on GPT-3 175B. This table
shows the results of a two-sample T-Test for the difference in mean accuracy between GPT-3 175B and the control
model (an unconditional GPT-3 Small model with increased output randomness).
<<FIGURE>>
Figure 3.14:The GPT-3 generated news article that humans had the greatest difficulty distinguishing from a human
written article (accuracy: 12%).
<<FIGURE>>
Figure 3.15:The GPT-3 generated news article that humans found the easiest to distinguish from a human written
article (accuracy: 61%).
<<FIGURE>>
Figure 3.16:Representative GPT-3 completions for the few-shot task of using a new word in a sentence. Boldface is
GPT-3s completions, plain text is human prompts. In the first example both the prompt and the completion are provided
by a human; this then serves as conditioning for subsequent examples where GPT-3 receives successive additional
prompts and provides the completions. Nothing task-specific is provided to GPT-3 other than the conditioning shown
here.
nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the
broad task and one-shot in terms of the specific word. Table3.16shows the 6 examples we generated; all definitions
were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were
generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try
any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final
sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of
the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy
sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.
3.9.6 Correcting English Grammar
Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the few-
shot setting by giving prompts of the form"Poor English Input: <sentence>nn Good English Output:
<sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any
omissions or repeats). Results are shown in Figure3.17.
4 Measuring and Preventing Memorization Of Benchmarks
Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our
benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research
without established best practices. While it is common practice to train large models without investigating contamination,
given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to.
This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18]
detected and removed a training document which overlapped with one of their evaluation datasets. Other work such
as GPT-2 [RWC + 19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that
<<FIGURE>>.
Figure 3.17:Representative GPT-3 completions for the few-shot task of correcting English grammar. Boldface
is GPT-3s completions, plain text is human prompts. In the first few examples example both the prompt and the
completion are provided by a human; this then serves as conditioning for subsequent examples where GPT-3 receives
successive additional prompts and provides the completions. Nothing task-specific is provided to GPT-3 aside from
the first few examples as conditioning and the “Poor English input/Good English output” framing. We note that the
distinction between ”poor” and ”good” English (and the terms themselves) is complex, contextual, and contested. As
the example mentioning the rental of a house shows, assumptions that the model makes about what “good” is can even
lead it to make errors (here, the model not only adjusts grammar, but also removes the word ”cheap” in a way that alters
meaning).
<<FIGURE>>
Figure 4.1: GPT-3 Training Curves We measure model performance during training on a deduplicated validation
split of our training distribution. Though there is some gap between training and validation performance, the gap grows
only minimally with model size and training time, suggesting that most of the gap comes from a difference in difficulty
rather than overfitting.
although models did perform moderately better on data that overlapped between training and testing, this did not
significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).
GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of
magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential
for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B
does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was
deduplicated (Figure4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as
large as feared.
We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap
between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a
bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasnt
feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts
results.
For each benchmark, we produce a clean version which removes all potentially leaked examples, defined roughly as
examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when
it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination,
so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in
AppendixC.
We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean
subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a
significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be
inflating the results. The results are summarized in Figure4.2. Although potential contamination is often high (with a
quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence
that contamination level and performance difference are correlated. We conclude that either our conservative method
substantially overestimated contamination or that contamination has little effect on performance.
Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on
the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference
difficult.
Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension
(QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English
<<FIGURE>>
Figure 4.2: Benchmark contamination analysis We constructed cleaned versions of each of our benchmarks to
check for potential contamination in our training set. The x-axis is a conservative lower bound for how much of the
dataset is known with high confidence to be clean, and the y-axis shows the difference in performance when evaluating
only on the verified clean subset. Performance on most benchmarks changed negligibly, but some were flagged for
further review. On inspection we find some evidence for contamination of the PIQA and Winograd results, and we mark
the corresponding results in Section3with an asterisk. We find no evidence that other benchmarks are affected.
translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false
positives. We summarize the results for each group of tasks below:
•Reading Comprehension:Our initial analysis flagged>90% of task examples from QuAC, SQuAD2, and
DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult.
Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source
text was present in our training data but the question/answer pairs were not, meaning the model gains only
background information and cannot memorize the answer to a specific question.
•German translation:We found 25% of the examples in the WMT16 German-English test set were marked
as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the
flagged examples contain paired sentences resembling NMT training data and collisions were monolingual
matches mostly of snippets of events discussed in the news.
•Reversed Words and Anagrams:Recall that these tasks are of the form “alaok = koala”. Due to the
short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged
overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set,
but rather palindromes or trivial unscramblings, e.g “kayak = kayak”. The amount of overlap was small,
but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the
symbol insertion task shows high overlap but no effect on performance this is because that task involves
removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to
many spurious matches.
•PIQA:The overlap analysis flagged 29% of examples as contaminated, and observed a 3 percentage point
absolute decrease (4% relative decrease) in performance on the clean subset. Though the test dataset was
released after our training set was created and its labels are hidden, some of the web pages used by the
crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller
model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias
rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot
rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential
contamination.
•Winograd:The overlap analysis flagged 45% of examples, and found a 2.6% decrease in performance on the
clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in
fact present in our training set, though presented in a different format than we present the task to the model.
Although the decrease in performance is small, we mark our Winograd results in the main paper with an
asterisk.
•Language modeling:We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the
Childrens Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably
extract a clean subset here, we do not report results on these datasets, even though we intended to when starting
this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language
modeling benchmark.
We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply
to verify how much actual contamination existed. These appeared to often contain false positives. They had either
no actual contamination, or had contamination that did not give away the answer to the task. One notable exception
was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very
small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format
precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this
paper, the potential contamination is noted in the results section.
An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the
same distribution as the original dataset. It remains possible that memorization inflates results but at the same time
is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number
of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small
models, which are unlikely to be memorizing.
Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright
remove problematic results, depending on the severity. Much work remains to be done to address this important and
subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed
explanation of our analysis, we refer the reader to AppendixC.
5 Limitations
GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for
future work.
First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct
predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although
the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to
lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences
or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of
GPT-3s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed
informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some
datasets (such as PIQA [BZB + 19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type
“If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3s in-context learning performance has some notable
gaps on our suite of benchmarks, as described in Section3, and in particular it does little better than chance when
evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same
way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading
comprehension tasks. This is especially striking given GPT-3s strong few-shot performance on many other tasks.
GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused
on exploring in-context learning behavior in autoregressive language models because it is straightforward to both
sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional
architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent
literature, which has documented improved fine-tuning performance when using these approaches over standard
language models [RSR + 19]. Thus our design decision comes at the cost of potentially worse performance on tasks
which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back
and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then
generating a very short answer. This could be a possible explanation for GPT-3s lagging few-shot performance on a
few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves
comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and
RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning
than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with
few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.
A more fundamental limitation of the general approach described in this paper scaling up any LM-like model, whether
autoregressive or bidirectional is that it may eventually run into (or could already be running into) the limits of the
pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to
predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also,
with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas
ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed
actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains
of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world
[BHT + 20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a
different approach is likely to be necessary. Promising future directions in this vein might include learning the objective
function from humans [ZSW + 19a], fine-tuning with reinforcement learning, or adding additional modalities such as
images to provide grounding and a better model of the world [CLY + 19].
Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3
takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more
text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is
an important direction for future work, and might come from grounding in the physical world to provide additional
information, or from algorithmic improvements.
A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot
learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it
has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that
are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format,
to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on
this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words
seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although
possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what
humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training
and identifying them at test time would be an advance for language models, but nevertheless understanding precisely
how few-shot learning works is an important unexplored direction for future research.
A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are
both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of
models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large
models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills,
most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible.
Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters;
new challenges and opportunities may be associated with applying it to models of this size.
Finally, GPT-3 shares some limitations common to most deep learning systems its decisions are not easily interpretable,
it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in
performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This
last issue biases in the data that may lead the model to generate stereotyped or prejudiced content is of special
concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts
(Section6).
6 Broader Impacts
Language models have a wide range of beneficial applications for society, including code and writing auto-completion,
grammar assistance, game narrative generation, improving search engine responses, and answering questions. But
they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over
smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the
potential to advance both the beneficial and harmful applications of language models.
Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily
greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this
are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in
Section6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section6.2. We also briefly
discuss issues of energy efficiency (Section6.3).
6.1 Misuse of Language Models
Malicious uses of language models can be somewhat difficult to anticipate because they often involve repurposing
language models in a very different environment or for a different purpose than researchers intended. To help with this,
we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying
threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact
[Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.
6.1.1 Potential Misuse Applications
Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples
include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing
and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high
quality text. Language models that produce high quality text generation could lower existing barriers to carrying out
these activities and increase their efficacy.
The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to
generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in
3.9.4 represents a concerning milestone in this regard.
6.1.2 Threat Actor Analysis
Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors
who may be able to build a malicious product to advanced persistent threats (APTs): highly skilled and well-resourced
(e.g. state-sponsored) groups with long-term agendas [SBC + 19].
To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat
groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did
find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances
of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated
with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is
not immediate, but significant improvements in reliability could change this.
Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about
possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible
difference in operations that may see potential gains by using language models. The assessment was that language
models may not be worth investing significant resources in because there has been no convincing demonstration that
current language models are significantly better than current methods for generating text, and because methods for
“targeting” or “controlling” the content of language models are still at a very early stage.
6.1.3 External Incentive Structures
Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their
agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular
among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login
credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.
Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs.
The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k
truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot
produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the
amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts
how scalable the operation can be.
Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will
eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to
malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on
this through a combination of mitigation research, prototyping, and coordinating with other technical developers.
6.2 Fairness, Bias, and Representation
Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning,
since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and
producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in
the model in order to better understand GPT-3s limitations when it comes to fairness, bias, and representation. 8
Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and
behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely
present and could be studied in follow-up work. This is a preliminary analysis and does not reflect all of the models
biases even within the studied categories.
Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes
present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race,
and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how
they are different in this dimension.
6.2.1 Gender
In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found
that occupations in general have a higher probability of being followed by a male gender identifier than a female one
(in other words, they are male leaning) when given a context such as"Thefoccupationgwas a"(Neutral Variant).
83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured
this by feeding the model a context such as"The detective was a"and then looking at the probability of the
model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.).
In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus
were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and
sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist,
housekeeper etc.
We also tested how these probabilities changed when we shifted the context to be the"The competentfoccupationg
was a"(Competent Variant), and when we shifted the context to be"The incompetentfoccupationgwas a"
(Incompetent Variant) for each occupation in the dataset. We found that, when prompted with"The competent
foccupationgwas a,"the majority of occupations had an even higher probability of being followed by a
male identifier than a female one than was the case with our original neutral prompt,"Thefoccupationgwas
a". With the prompt"The incompetentfoccupationgwas a"the majority of occupations still leaned male
with a similar probability than for our original neutral prompt. The average occupation bias - measured as
<<FORMULA>> was <<FORMULA>> for the Neutral Variant,<<FORMULA>> for the Competent Variant and <<FORMULA>> jobs
for the Incompetent Variant.
We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further
corroborated the models tendency to associate most occupations with males. One method measured the mod-
els ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model
a context such as"The advisor met with the advisee because she wanted to get advice about job
applications. She refers to the" and found the option with the lowest probability between the two possi-
ble options (Choices between Occupation Option: advisor; Participant Option: advisee).
Occupation and participant words often have societal biases associated with them such as the assumption that most
occupants are by default male. We found that the language models learnt some of these biases such as a tendency to
associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of
all the models (64.17%) on this task. It was also the only model where the accuracy for Occupant sentences (sentences
where the correct answer was the Occupation option) for females was higher than for males (81.7% vs 76.7%). All
other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns
with the exception of our second largest model- GPT-3 13B - which had the same accuracy (60%) for both. This offers
some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger
models are more robust than smaller models.
We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other pre-
selected words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature
8 Evaluating fairness, bias, and representation in language models is a rapidly-developing area with a large body of prior work.
See, for example, [HZJ + 19,NBR20,SCNP19].
Table 6.1:Most Biased Descriptive Words in 175B Model
<<TABLE>>
of 1 and topp of 0.9 for every prompt in our dataset. For gender, we had prompts such as"He was very","She
was very","He would be described as","She would be described as" 9 . We looked at the adjectives and
adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more
often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were
more often described using adjectives that span a greater spectrum.
Table6.1shows the top 10 most favored descriptive words for the model along with the raw number of times each
word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a
category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective,
we have also included the average for the number of co-occurrences across all qualifying words for each gender.
6.2.2 Race
To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The fracegman was very",
"The fracegwoman was very"and"People would describe thefracegperson as"and generated 800
samples for each of the above prompts, withfracegreplaced with a term indicating a racial category such as White
or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that
language models produce text of differing sentiment when varying features such as occupation [HZJ + 19], we explored
how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred
disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive
words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid:
-87.5) and a score of 0 indicating neutral words (eg. sloping, chalet).
It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that
focused on racial features; these results are not from the models talking about race in the wild but talking about race in
an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply
looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to
a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated
with a negative sentiment under this testing methodology.
Across the models we analyzed, Asian had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the
other hand, Black had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences
narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and
highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data.
9 We only used male and female pronouns. This simplifying assumption makes it easier to study co-occurrence since it does not
require the isolation of instances in which they refers to a singular noun from those where it didnt, but other forms of gender bias
are likely present and could be studied using different approaches.
<<FIGURE>>
Figure 6.1:Racial Sentiment Across Models
<<TABLE>>
Table 6.2:Shows the ten most favored words about each religion in the GPT-3 175B model.
6.2.3 Religion
We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam,
and Judaism, by generating 800 model outputs of length 50 with a temperature of 1 and a to p of 0.9 for every
prompt. Our prompts were of the nature"fReligion practitionersgare"(Eg."Christians are") for each
of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a
corpus of such completions for studying co-occurrence of words.
The following is an example output from the model:
"Buddhists aredivided into two main branches - Theravada and Mahayana. Theravada
is the more conservative branch, centering on monastic life and the earliest sutras
and refusing to recognize the later Mahayana sutras as authentic."
Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect
how these terms are sometimes presented in the world. For example, with the religionIslam, we found that words such
asramadan,prophetandmosqueco-occurred at a higher rate than for other religions. We also found that words such
asviolent,terrorismandterroristco-occurred at a greater rate with Islam than with other religions and were in
the top 40 most favored words for Islam in GPT-3.
6.2.4 Future Bias and Fairness Challenges
We have presented this preliminary analysis to share some of the biases we found in order to motivate further research,
and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an
area of continuous research for us and are excited to discuss different methodological approaches with the community.
We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but
we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model
attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ + 18].
Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this
is also extensive [QMZH19,HZJ + 19], so we offer only a few brief comments on future directions specific to large
language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for
building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for
these models. There is room for more research that engages with the literature outside NLP, better articulates normative
statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20].
Thus, mitigation work should not be approached purely with a metric driven objective to remove bias as this has been
shown to have blind spots [GG19,NvNvdG19] but in a holistic manner.
6.3 Energy Usage
Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3
175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days
for a 1.5B parameter GPT-2 model (Figure2.2). This means we should be cognizant of the cost and efficiency of such
models, as advocated by [SDSE19].
The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we
should consider not only the resources that go into training them, but how these resources are amortized over the
lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though
models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even
with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or
only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down
the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient
versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency
of such models over time, similar to trends observed in image recognition and neural machine translation [HB20].
7 Related Work
Several lines of work have focused on increasing parameter count and/or computation in language models as a
means to improve generative or task performance. An early work scaled LSTM based language models to over a
billion parameters [JVS + 16]. One line of work straightforwardly increases the size of transformer models, scaling
up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size:
213 million parameters [VSP + 17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters
[RWC + 19], 8 billion parameters [SPP + 19], 11 billion parameters [RSR + 19], and most recently 17 billion parameters
[Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of
increasing models capacity to store information without increased computational cost. These approaches rely on the
conditional computation framework [BLC13] and specifically, the mixture-of-experts method [SMM + 17] has been
used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19],
though only a small fraction of the parameters are actually used on each forward pass. A third approach increases
computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and
the universal transformer [DGV + 18]. Our work focuses on the first approach (scaling compute and parameters together,
by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ
this strategy.
Several efforts have also systematically studied the effect of scale on language model performance. [KMH + 20,
RRBS19,LWS + 20,HNA + 17], find a smooth power-law trend in loss as autoregressive language models are scaled up.
This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the
curve can perhaps be detected in Figure3.1), and we also find relatively smooth increases in many (though not all)
downstream tasks across 3 orders of magnitude of scaling.
Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language
models that are as small as possible. This approach includes ALBERT [LCG + 19] as well as general [HVD15] and
task-specific [SDCW19,JYS + 19,KR16] approaches to distillation of language models. These architectures and
techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint
of giant models.
As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable
effort has been devoted to constructing more difficult or open-ended tasks, including question answering [KPR + 19,
IBGC + 14,CCE + 18,MCKS18], reading comprehension [CHI + 18,RCM19], and adversarially constructed datasets
designed to be difficult for existing language models [SBBC19,NWD + 19]. In this work we test our models on many
of these datasets.
Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the
tasks we tested on. Recent efforts include [RSR + 19,RRS20], which fine-tuned an 11 billion parameter language model,
and [GLT + 20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on
in-context learning but could be combined in the future with those of [GLT + 20,LPP + 20].
Metalearning in language models has been utilized in [RWC + 19], though with much more limited results and no
systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it
structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including
matching networks [VBL + 16], RL2 [DSC + 16], learning to optimize [RL16,ADG + 16,LM17] and MAML [FAL17].
Our approach of stuffing the models context with previous examples is most structurally similar to RL2 and also
resembles [HYC01], in that an inner loop of adaptation takes place through computation in the models activations
across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training)
updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time.
Few-shot auto-regressive density estimation was explored in [RCP + 17] and [GWC + 18] studied low-resource NMT as
a few-shot learning problem.
While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained
language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with
similar goals is semi-supervised learning where approaches such as UDA [XDH + 19] also explore methods of fine-tuning
when very little labeled data is available.
Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18]
and utilized for some tasks (such as summarizing) in a language model with [RWC + 19]. The notion of presenting
tasks in natural language was also explored in the text-to-text transformer [RSR + 19], although there it was applied for
multi-task fine-tuning rather than for in-context learning without weight updates.
Another approach to increasing generality and transfer-learning capability in language models is multi-task learning
[Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for
each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the
weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating
the weights for a new task. Multi-task learning has shown some promising initial results [LGH + 15,LSP + 18] and
multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed
the boundaries on certain tasks [KKS + 20], but is still limited by the need to manually curate collections of datasets and
set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of
tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate
a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR + 17], human
interaction [ZSW + 19b], or active learning [Mac92].
Algorithmic innovation in language models over the last two years has been enormous, including denoising-based
bidirectionality [DCLT18], prefixLM [DL15] and encoder-decoder architectures [LLG + 19,RSR + 19], random permu-
tations during training [YDY + 19], architectures that improve the efficiency of sampling [DYY + 19], improvements in
data and training procedures [LOG + 19], and efficiency increases in the embedding parameters [LCG + 19]. Many of
these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure autoregressive
language models, both in order to focus on in-context learning performance and to reduce the complexity of our large
model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3s
performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3s scale with these
algorithmic techniques is a promising direction for future work.
8 Conclusion
We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and
benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of
state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at
tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning.
We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results
suggest that very large language models may be an important ingredient in the development of adaptable, general
language systems.
Acknowledgements
The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub
Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea
Voss for helping run evaluations on OpenAIs infrastructure. Thanks to David Luan for initial support in scaling up
this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura
Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early
discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments,
Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of
people who created content that was used in the training of the model, and to those who were involved in indexing or
upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure
and supercomputing teams for making it possible to train models at this scale.
Contributions
Tom Brown, Ben Mann, Prafulla Dhariwal, Dario Amodei, Nick Ryder, Daniel M Ziegler, and Jeffrey Wu
implemented the large-scale models, training infrastructure, and model-parallel strategies.
Tom Brown, Dario Amodei, Ben Mann, and Nick Ryderconducted pre-training experiments.
Ben Mann and Alec Radfordcollected, filtered, deduplicated, and conducted overlap analysis on the training data.
Melanie Subbiah, Ben Mann, Dario Amodei, Jared Kaplan, Sam McCandlish, Tom Brown, Tom Henighan, and
Girish Sastryimplemented the downstream tasks and the software framework for supporting them, including creation
of synthetic tasks.
Jared Kaplan and Sam McCandlishinitially predicted that a giant language model should show continued gains, and
applied scaling laws to help predict and guide model and data scaling decisions for the research.
Ben Mannimplemented sampling without replacement during training.
Alec Radfordoriginally demonstrated few-shot learning occurs in language models.
Jared Kaplan and Sam McCandlishshowed that larger models learn more quickly in-context, and systematically
studied in-context learning curves, task prompting, and evaluation methods.
Prafulla Dhariwalimplemented an early version of the codebase, and developed the memory optimizations for fully
half-precision training.
Rewon Child and Mark Chendeveloped an early version of our model-parallel strategy.
Rewon Child and Scott Graycontributed the sparse transformer.
Aditya Rameshexperimented with loss scaling strategies for pretraining.
Melanie Subbiah and Arvind Neelakantanimplemented, experimented with, and tested beam search.
Pranav Shyamworked on SuperGLUE and assisted with connections to few-shot learning and meta-learning literature.
Sandhini Agarwalconducted the fairness and representation analysis.
Girish Sastry and Amanda Askellconducted the human evaluations of the model.
Ariel Herbert-Vossconducted the threat analysis of malicious use.
Gretchen Kruegeredited and red-teamed the policy sections of the paper.
Benjamin Chess, Clemens Winter, Eric Sigler, Christopher Hesse, Mateusz Litwin, and Christopher Berner
optimized OpenAIs clusters to run the largest models efficiently.
Scott Graydeveloped fast GPU kernels used during training.
Jack Clarkled the analysis of ethical impacts — fairness and representation, human assessments of the model, and
broader impacts analysis, and advised Gretchen, Amanda, Girish, Sandhini, and Ariel on their work.
Dario Amodei, Alec Radford, Tom Brown, Sam McCandlish, Nick Ryder, Jared Kaplan, Sandhini Agarwal,
Amanda Askell, Girish Sastry, and Jack Clarkwrote the paper.
Sam McCandlishled the analysis of model scaling, and advised Tom Henighan and Jared Kaplan on their work.
Alec Radfordadvised the project from an NLP perspective, suggested tasks, put the results in context, and demonstrated
the benefit of weight decay for training.
Ilya Sutskeverwas an early advocate for scaling large generative likelihood models, and advised Pranav, Prafulla,
Rewon, Alec, and Aditya on their work.
Dario Amodeidesigned and led the research.
A Details of Common Crawl Filtering
As mentioned in Section2.2, we employed two techniques to improve the quality of the Common Crawl dataset: (1)
filtering Common Crawl and (2) fuzzy deduplication:
1.In order to improve the quality of Common Crawl, we developed an automatic filtering method to remove low
quality documents. Using the original WebText as a proxy for high-quality documents, we trained a classifier
to distinguish these from raw Common Crawl. We then used this classifier to re-sample Common Crawl by
prioritizing documents which were predicted by the classifier to be higher quality. The classifier is trained
using logistic regression classifier with features from Sparks standard tokenizer and HashingTF 10 . For the
positive examples, we used a collection of curated datasets such as WebText, Wikiedia, and our web books
corpus as the positive examples, and for the negative examples, we used unfiltered Common Crawl. We used
this classifier to score Common Crawl documents. We kept each document in our dataset iff
<<FORMULA>>
We chose <<FORMULA>> in order to take mostly documents the classifier scored highly, but still include some documents
that were out of distribution <<FORMULA>> was chosen to match the distribution of scores from our classifier on WebText.
We found this re-weighting increased quality as measured by loss on a range of out-of-distribution generative
text samples.
2.To further improve model quality and prevent overfitting (which becomes increasingly important as model
capacity increases), we fuzzily deduplicated documents (i.e. removed documents with high overlap with
other documents) within each dataset using Sparks MinHashLSH implementation with 10 hashes, using the
same features as were used for classification above. We also fuzzily removed WebText from Common Crawl.
Overall this decreased dataset size by an average of 10%.
After filtering for duplicates and quality, we also partially removed text occurring in benchmark datasets, described in
Appendix C.
B Details of Model Training
To train all versions of GPT-3, we use Adam with <<FORMULA>>, we clip the global norm of the
gradient at 1.0, and we use cosine decay for learning rate down to 10% of its value, over 260 billion tokens (after 260
billion tokens, training continues at 10% of the original learning rate). There is a linear LR warmup over the first 375
million tokens. We also gradually increase the batch size linearly from a small value (32k tokens) to the full value over
the first 4-12 billion tokens of training, depending on the model size. Data are sampled without replacement during
training (until an epoch boundary is reached) to minimize overfitting. All models use weight decay of 0.1 to provide a
small amount of regularization [LH17].
During training we always train on sequences of the fullnctx = 2048token context window, packing multiple
documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency.
Sequences with multiple documents are not masked in any special way but instead documents within a sequence
are delimited with a special end of text token, giving the language model the information necessary to infer that
context separated by the end of text token is unrelated. This allows for efficient training without need for any special
sequence-specific masking.
C Details of Test Set Contamination Studies
In section4we gave a high level overview of test set contamination studies. In this section we provide details on
methodology and results.
Initial training set filtering We attempted to remove text occurring in benchmarks from training data by searching
for 13-gram overlaps between all test/development sets used in this work and our training data, and we removed
the colliding 13-gram as well as a 200 character window around it, splitting the original document into pieces. For
filtering purposes we define a gram as a lowercase, whitespace delimited word with no punctuation. Pieces less than
200characters long were discarded. Documents split into more than 10 pieces were considered contaminated and
removed entirely. Originally we removed entire documents given a single collision, but that overly penalized long
documents such as books for false positives. An example of a false positive might be a test set based on Wikipedia, in
which the Wikipedia article quotes a single line from a book. We ignored13grams that matched more than 10 training
documents, as inspection showed the majority of these to contain common cultural phrases, legal boilerplate, or similar
content that we likely do want the model to learn, rather than undesired specific overlaps with test sets. Examples for
various frequencies can be found in the GPT-3 release repository.
Overlap methodology For our benchmark overlap analysis in Section4, we used a variable number of wordsNto
check for overlap for each dataset, whereNis the 5th percentile example length in words, ignoring all punctuation,
whitespace, and casing. Due to spurious collisions at lower values ofNwe use a minimum value of 8 on non-synthetic
tasks. For performance reasons, we set a maximum value of 13 for all tasks. Values forNand the amount of data
marked as dirty are shown in TableC.1. Unlike GPT-2s use of bloom filters to compute probabilistic bounds for test
contamination, we used Apache Spark to compute exact collisions across all training and test sets. We compute overlaps
between test sets and our full training corpus, even though we only trained on 40% of our filtered Common Crawl
documents per Section2.2.
We define a dirty example as one with anyN-gram overlap with any training document, and a clean example as one
with no collision.
Test and validation splits had similar contamination levels despite some test splits being unlabeled. Due to a bug revealed
by this analysis, filtering described above failed on long documents such as books. Because of cost considerations it
was infeasible to retrain the model on a corrected version of the training dataset. As such, several language modeling
benchmarks plus the Childrens Book Test showed almost complete overlap, and therefore were not included in this
paper. Overlaps are shown in TableC.1
Overlap results To understand how much having seen some of the data helps the model perform on downstream
tasks, we filter every validation and test set by dirtiness. Then we run evaluation on the clean-only examples and report
the relative percent change between the clean score and the original score. If the clean score is more than 1% or 2%
worse than the overall score, it suggests the model may have overfit to the examples it has seen. If the clean score is
significantlybetter, our filtering scheme may have preferentially marked easier examples as dirty.
This overlap metric tends to show a high rate of false positives for datasets that contain background information (but
not answers) drawn from the web (such as SQuAD, which draws from Wikipedia) or examples less than 8 words
long, which we ignored in our filtering process (except for wordscrambling tasks). One instance where this technique
seems to fail to give good signal is DROP, a reading comprehension task in which 94% of the examples are dirty. The
information required to answer the question is in a passage provided to the model, so having seen the passage during
training but not the questions and answers does not meaningfully constitute cheating. We confirmed that every matching
training document contained only the source passage, and none of the questions and answers in the dataset. The more
likely explanation for the decrease in performance is that the 6% of examples that remain after filtering come from a
slightly different distribution than the dirty examples.
Figure4.2shows that as the dataset becomes more contaminated, the variance of the clean/all fraction increases, but
there is no apparent bias towards improved or degraded performance. This suggests that GPT-3 is relatively insensitive
to contamination. See Section4for details on the datasets we flagged for further review.
<<TABLE>>
Table C.1:Overlap statistics for all datasets sorted from dirtiest to cleanest. We consider a dataset example dirty if it
has a singleN-gram collision with any document in our training corpus. “Relative Difference Clean vs All” shows the
percent change in performance between only the clean examples vs all the examples in the benchmark. “Count” shows
the number of examples. “Clean percentage” is the percent of examples that are clean vs total. For “Acc/F1/BLEU” we
use the metric specified in “Metric”. These scores come from evaluations with a different seed for the random examples
used for in-context learning, and will therefore differ slightly from the scores elsewhere in the paper.
D Total Compute Used to Train Language Models
This appendix contains the calculations that were used to derive the approximate compute used to train the language
models in Figure2.2. As a simplifying assumption, we ignore the attention operation, as it typically uses less than 10%
of the total compute for the models we are analyzing.
Calculations can be seen in TableD.1and are explained within the table caption.
<<TABLE>>
Table D.1:Starting from the right hand side and moving left, we begin with the number of training tokens that each
model was trained with. Next we note that since T5 uses an encoder-decoder model, only half of the parameters are
active for each token during a forward or backwards pass. We then note that each token is involved in a single addition
and a single multiply for each active parameter in the forward pass (ignoring attention). Then we add a multiplier of
3x to account for the backwards pass (as computing both @params and @acts use a similar amount of compute as the
forwards pass. Combining the previous two numbers, we get the total flops per parameter per token. We multiply this @loss @loss
value by the total training tokens and the total parameters to yield the number of total flops used during training. We
report both flops and petaflop/s-day (each of which are 2.88e+7 flops).
E Human Quality Assessment of Synthetic News Articles
This appendix contains details on the experiments measuring human ability to distinguish GPT-3-generated synthetic
news articles from real news articles. We first describe the experiments on the200word news articles, and then
describe the preliminary investigation of500word news articles generated by GPT-3.
Participants:We recruited 718 unique participants to take part in 6 experiments. 97 participants were excluded for
failing an internet check question, leaving a total of 621 participants: 343 male, 271 female, and 7 other. Mean
participant age was38years old. All participants were recruited through Positly, which maintains a whitelist of
high-performing workers from Mechanical Turk. All participants were US-based but there were no other demographic
restrictions. Participants were paid$12 for their participation, based on a task time estimate of 60 minutes determined
by pilot runs. In order to ensure that the sample of participants for each experiment quiz was unique, participants were
not allowed to take part in an experiment more than once.
Procedure and design:We arbitrarily selected 25 news articles that appeared innewser.comin early 2020. We used
the article titles and subtitles to produce outputs from the 125M, 350M, 760M, 1.3B, 2.7B, 6.7B, 13.0B, and 200B
(GPT-3) parameter language models. Five outputs per question were generated by each model and the generation with a
word count closest to that of the human written article was selected automatically. This was to minimize the effect
that completion length might have on participants judgments. The same output procedure for each model with the
exception of the removal of the intentionally bad control model, as described in the main text.
<<TABLE>>
Table E.1:Participant details and article lengths for each experiment to evaluate human detection of200word model
generated news articles. Participants were excluded due to internet check fails.
<<TABLE>>
Figure E.1:Participants spend more time trying to identify whether each news article is machine generated as model
size increases. Duration on the control model is indicated with the dashed line. Line of best fit is a linear model on a log
scale with 95% confidence intervals.
In each experiment, half of the participants were randomly assigned to quiz A and half were randomly assigned to quiz
B. Each quiz consisted of 25 articles: half (12-13) were human written and half (12-13) were model generated: the
articles with human written completions in quiz A had model generated completions in quiz B and vice versa. The
order of quiz question was shuffled for each participant. Participants could leave comments and were asked to indicate
if they had seen the articles before. Participants were instructed not to look up the articles or their content during the
quiz and at the end of the quiz were asked if they had looked anything up during the quiz.
Statistical Tests:To compare means on the different runs, we performed a two-sample t-test for independent groups for
each model against the control. This was implemented in Python using thescipy.stats.ttest_indfunction. When
plotting a regression line in the graph of average participant accuracy vs model size, we fit a power law of the form
ax b . The 95% confidence intervals were estimated from the t-distribution of the sample mean.
Duration statistics: In the main text, we discussed the finding that the ability of human participants to distinguish
model and human generated news articles decreases as our models become larger. We have also found that the
average time spent for a given set of questions increases as the model size increases, as shown in FigureE.1. Lower
<<TABLE>>
Table E.2:Participant details and article lengths for the experiments investigating human detection of500word
model generated news articles. Participants were excluded due to internet check fails.
accuracy scores despite increased time investment from participants supports the finding that larger models generate
harder-to-distinguish news articles.
Preliminary investigation of 500 word articles: We recruited 160 unique US-based participants to take part in 2
experiments through Positly (details are given in TableE.2). We randomly selected 12 Reuters world news articles from
late 2019 and created a context for GPT-3 175B that consisted of a single Reuters article not in this set of 12. We then
used the article titles and Reuters locations to generate completions from GPT-3 175B and the 160M control model
from the previous experiments. These were used to create two 12-question quizzes per model, each consisting of half
human written and half model generated articles. Comprehension questions were added and articles were shown to
participants in 3 stages at 30 second intervals to encourage closer reading. Participants were paid$12 for this task.
Model generation selection methods, exclusion criteria, and statistical tests mirror those of the previous experiments.
F Additional Samples from GPT-3
GPT-3 adapts well to many tasks other than the ones explored in the main body of the paper. As an example, in Figure
F.1, we show four uncurated samples from a prompt suggesting that the model write a poem, with a given title, in the
style of Wallace Stevens. We first experimented with a few prompts, then generated four samples with no additional
editing or selection (sampling at temperature1using nucleus sampling [HBFC19] withP= 0:9). Completions were
truncated when the model began to write a new title and author heading, or broke into prose commentary.
<<FIGURE>>
Figure F.1:Four uncurated completions from a context suggesting the model compose a poem in the style of Wallace
Stevens with the title Shadows on the Way.
G Details of Task Phrasing and Specifications
The following figures illustrate the formatting and phrasing of all the tasks included in the paper. All data comes from
the ground truth datasets in this section, and no samples from GPT-3 are included here.
<<FIGURE>>
Figure G.1:Formatted dataset example for RACE-h. When predicting, we normalize by the unconditional probability
of each answer as described in2.
<<FIGURE>>
Figure G.4:Formatted dataset example for PIQA
<<FIGURE>>
Figure G.5:Formatted dataset example for COPA
<<FIGURE>>
Figure G.6:Formatted dataset example for ReCoRD. We consider the context above to be a single ”problem” because
this is how the task is presented in the ReCoRD dataset and scored in the ReCoRD evaluation script.
<<FIGURE>>
Figure G.8:Formatted dataset example for OpenBookQA. When predicting, we normalize by the unconditional
probability of each answer as described in2.
Context! Making a cake: Several cake pops are shown on a display. A woman and girl
are shown making the cake pops in a kitchen. They
Correct Answer! bake them, then frost and decorate.
Incorrect Answer! taste them as they place them on plates.
Incorrect Answer! put the frosting on the cake as they pan it.
Incorrect Answer! come out and begin decorating the cake as well.
Figure G.9:Formatted dataset example for HellaSwag
<<FIGURE>>
Figure G.10:Formatted dataset example for ANLI R3
<<FIGURE>>
Figure G.11:Formatted dataset example for ARC (Challenge). When predicting, we normalize by the unconditional
probability of each answer as described in2.
<<FIGURE>>
Figure G.12:Formatted dataset example for SAT Analogies
<<FIGURE>>
Figure G.14:Formatted dataset example for Winogrande. The partial evaluation method we use compares the
probability of the completion given a correct and incorrect context.
<<FIGURE>>
Figure G.15:Formatted dataset example for MultiRC. There are three levels within MultiRC: (1) the passage, (2) the
questions, and (3) the answers. During evaluation, accuracy is determined at the per-question level, with a question
being considered correct if and only if all the answers within the question are labeled correctly. For this reason, we use
K to refer to the number ofquestionsshown within the context.
<<FIGURE>>
Figure G.16:Formatted dataset example for ARC (Easy). When predicting, we normalize by the unconditional
probability of each answer as described in 2.
<<FIGURE>>
Figure G.17:Formatted dataset example for StoryCloze
<<FIGURE>>
Figure G.18:Formatted dataset example for CoQA
<<FIGURE>>
Figure G.24:Formatted dataset example for Natural Questions
<<FIGURE>>
Figure G.26:Formatted dataset example for Symbol Insertion
<<FIGURE>>
Figure G.30:Formatted dataset example for CB
<<FIGURE>>
Figure G.32:Formatted dataset example for WiC
<<FIGURE>>
Figure G.36:Formatted dataset example for De!En. This is the format for one- and few-shot learning, for this and
other langauge tasks, the format for zero-shot learning is “Q: What is theflanguagegtranslation offsentencegA:
ftranslationg.”
<<FIGURE>>
Figure G.49:Formatted dataset example for Arithmetic 4D+
<<FIGURE>>
Figure G.50:Formatted dataset example for Arithmetic 5D
<<FIGURE>>
Figure G.51:Formatted dataset example for Arithmetic 5D+
H Results on All Tasks for All Model Sizes
<<TABLE>>
Table H.1:Scores for every task, setting and model that we investigate in this paper.
<<FIGURE>>
Figure H.1:All results for all SuperGLUE tasks.
<<FIGURE>> <<FIGURE>>
Figure H.2:Results for SAT task. Figure H.3:All results for all Winograd tasks.
<<FIGURE>>
Figure H.4:All results for all Arithmetic tasks.
<<FIGURE>>
Figure H.5:All results for all Cloze and Completion tasks.
<<FIGURE>>
Figure H.6:All results for all Common Sense Reasoning tasks.
<<FIGURE>>
Figure H.7:All results for all QA tasks.
<<FIGURE>>
Figure H.8:All results for all Reading Comprehension tasks.
<<FIGURE>>
Figure H.9:All results for all ANLI rounds.
<<FIGURE>>
Figure H.10:All results for all Scramble tasks.
<<FIGURE>>
Figure H.11:All results for all Translation tasks.
References
[ADG + 16]Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,
Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.
InAdvances in neural information processing systems, pages 39813989, 2016.
[AI19]WeChat AI. Tr-mt (ensemble), December 2019.
[AJF19]Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
[BBDIW20]Su Lin Blodgett, Solon Barocas, Hal Daume III, and Hanna Wallach. Language (technology) is power:´
A critical survey of “bias” in nlp.arXiv preprint arXiv:2005.14050, 2020.
[BCFL13]Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from
question-answer pairs. InProceedings of the 2013 conference on empirical methods in natural language
processing, pages 15331544, 2013.
[BES10]Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: an enhanced lexical
resource for sentiment analysis and opinion mining. InLrec, volume 10, pages 22002204, 2010.
[BHT + 20]Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella
Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, et al. Experience grounds language.
arXiv preprint arXiv:2004.10151, 2020.
[BLC13]Yoshua Bengio, Nicholas Leonard, and Aaron C. Courville. Estimating or propagating gradients through´
stochastic neurons for conditional computation.Arxiv, 2013.
[BZB + 19]Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about
physical commonsense in natural language.arXiv preprint arXiv:1911.11641, 2019.
[Car97]Rich Caruana. Multitask learning.Machine learning, 28(1), 1997.
[CB78]Susan Carey and Elsa Bartlett. Acquiring a single new word.Proceedings of the Stanford Child Language
Conference, 1978.
[CCE + 18]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and
Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv,
abs/1803.05457, 2018.
[CGRS19]Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers, 2019.
[CHI + 18]Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke
Zettlemoyer. Quac : Question answering in context.Arxiv, 2018.
[CLY + 19]Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
Jingjing Liu. Uniter: Learning universal image-text representations.arXiv preprint arXiv:1909.11740,
2019.
[Cra17]Kate Crawford. The trouble with bias.NIPS 2017 Keynote, 2017.
[DCLT18]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
[DGV + 18]Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal
transformers.Arxiv, 2018.
[DHKH14] Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburghs phrase-based machine
translation systems for wmt-14. InProceedings of the Ninth Workshop on Statistical Machine Translation,
pages 97104, 2014.
[DL15]Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. InAdvances in neural information
processing systems, 2015.
[DSC + 16]Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2 : Fast
reinforcement learning via slow reinforcement learning.ArXiv, abs/1611.02779, 2016.
[DWD + 19]Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner.
Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs.arXiv preprint
arXiv:1903.00161, 2019.
[DYY + 19]Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language models beyond a fixed-length context.Arxiv, 2019.
[EOAG18]Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale.
arXiv preprint arXiv:1808.09381, 2018.
[FAL17]Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of
deep networks.ArXiv, abs/1703.03400, 2017.
[Fyo00]Yaroslav Fyodorov. A natural logic inference system, 2000.
[GG19]Hila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing methods cover up systematic gender biases
in word embeddings but do not remove them.arXiv preprint arXiv:1903.03862, 2019.
[GLT + 20] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-
augmented language model pre-training.arXiv preprint arXiv:2002.08909, 2020.
[Gra16]Alex Graves. Adaptive computation time for recurrent neural networks.Arxiv, 2016.
[GSL + 18]Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A
Smith. Annotation artifacts in natural language inference data.arXiv preprint arXiv:1803.02324, 2018.
[GSR19]Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. Gltr: Statistical detection and visualiza-
tion of generated text.arXiv preprint arXiv: 1906.04043, 2019.
[GWC + 18]Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource
neural machine translation.arXiv preprint arXiv:1808.08437, 2018.
[HB20]Daniel Hernandez and Tom Brown. Ai and efficiency, May 2020.
[HBFC19]Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.
CoRR, abs/1904.09751, 2019.
[HLW + 20]Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song.
Pretrained transformers improve out of distribution robustness.arXiv preprint arXiv:2004.06100, 2020.
[HNA + 17]Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md.
Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.
arXiv preprint arXiv:1712.00409, 2017.
[HR18] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv
preprint arXiv:1801.06146, 2018.
[HVD15]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv
preprint arXiv:1503.02531, 2015.
[HYC01]Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to Learn Using Gradient Descent.
InInternational Conference on Artificial Neural Networks, pages 8794. Springer, 2001.
[HZJ + 19]Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini,
Dani Yogatama, and Pushmeet Kohli. Reducing sentiment bias in language models via counterfactual
evaluation.arXiv preprint arXiv:1911.03064, 2019.
[IBGC + 14]Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daume III. A neural ´
network for factoid question answering over paragraphs. InEmpirical Methods in Natural Language
Processing, 2014.
[IDCBE19]Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of
generated text is easiest when humans are fooled.arXiv preprint arXiv:1911.00650, 2019.
[JCWZ17]Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly
supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017.
[JN20]Zheng Junyuan and Gamma Lab NYC. Numeric transformer - albert, March 2020.
[JVS + 16]Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits
of language modeling.arXiv preprint arXiv:1602.02410, 2016.
[JYS + 19]Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351, 2019.
[JZC + 19]Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng, Xuefeng Yang, and Yunfeng Liu. Technical report on
conversational question answering.arXiv preprint arXiv:1909.10772, 2019.
[KKS + 20]Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi.
Unifiedqa: Crossing format boundaries with a single qa system.arXiv preprint arXiv:2005.00700, 2020.
[KMB20]Sarah E. Kreps, Miles McCain, and Miles Brundage. All the news thats fit to fabricate: Ai-generated
text as a tool of media misinformation, 2020.
[KMH + 20]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
[KPR + 19]Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural ques-
tions: a benchmark for question answering research.Transactions of the Association of Computational
Linguistics, 2019.
[KR16]Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation.Arxiv, 2016.
[LB02]Edward Loper and Steven Bird. Nltk: The natural language toolkit, 2002.
[LC19]Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint
arXiv:1901.07291, 2019.
[LCG + 19]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Sori-
cut. ALBERT: A lite BERT for self-supervised learning of language representations.arXiv preprint
arXiv:1909.11942, 2019.
[LCH + 20]Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao.
Adversarial training for large neural language models.arXiv preprint arXiv:2004.08994, 2020.
[LDL19]Zhongyang Li, Xiao Ding, and Ting Liu. Story ending prediction by transferable bert.arXiv preprint
arXiv:1905.07504, 2019.
[LDM12]Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. InThirteenth
International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
[LGG + 20]Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and
Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation.arXiv preprint
arXiv:2001.08210, 2020.
[LGH + 15]Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Representation
learning using multi-task deep neural networks for semantic classification and information retrieval. In
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 2015.
[LH17]Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101, 2017.
[LHCG19a]Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural
networks via knowledge distillation for natural language understanding.arXiv preprint arXiv:1904.09482,
2019.
[LHCG19b]Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for
natural language understanding.arXiv preprint arXiv:1901.11504, 2019.
[Lin20]Tal Linzen. How can we accelerate progress towards human-like linguistic generalization?arXiv preprint
arXiv:2005.00955, 2020.
[LLG + 19]Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural
language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019.
[LM17]Ke Li and Jitendra Malik. Learning to optimize neural nets.arXiv preprint arXiv:1703.00441, 2017.
[LOG + 19]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.
arXiv preprint arXiv:1907.11692, 2019.
[LPP + 20]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Kiela Douwe.¨
Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401,
2020.
[LSP + 18]Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198, 2018.
[LWS + 20]Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez.
Train large, then compress: Rethinking model size for efficient training and inference of transformers,
2020.
[LXL + 17]Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading
comprehension dataset from examinations.arXiv preprint arXiv:1704.04683, 2017.
[LYN + 20]Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, and Jimmy
Lin. Tttttackling winogrande schemas.arXiv preprint arXiv:2003.08380, 2020.
[Mac92]David. MacKay. Information-based objective functions for active data selection.Neural Computation,
1992.
[MBXS17]Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Con-
textualized word vectors. InAdvances in Neural Information Processing Systems, pages 62946305,
2017.
[MCCD13]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations
in vector space.arXiv preprint arXiv:1301.3781, 2013.
[MCH + 16]Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of
commonsense stories.arXiv preprint arXiv:1604.01696, 2016.
[MCKS18]Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity?
a new dataset for open book question answering.ArXiv, abs/1809.02789, 2018.
[MKAT18]Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of
large-batch training, 2018.
[MKM + 94]Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson,
Karen Katz, and Britta Schasberger. The penn treebank: annotating predicate argument structure.
InProceedings of the workshop on Human Language Technology, pages 114119. Association for
Computational Linguistics, 1994.
[MKXS18]Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language
decathlon: Multitask learning as question answering.arXiv preprint arXiv:1806.08730, 2018.
[MPL19]R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic
heuristics in natural language inference.arXiv preprint arXiv:1902.01007, 2019.
[MWZ + 18]Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson,
Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting, 2018.
[NBR20]Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained
language models.arXiv preprint arXiv:2004.09456, 2020.
[NK19]Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural language arguments.
arXiv preprint arXiv:1907.07355, 2019.
[Nor09]Peter Norvig. Natural language corpus data, 2009.
[NvNvdG19]Malvina Nissim, Rik van Noord, and Rob van der Goot. Fair is better than sensational: Man is to doctor
as woman is to doctor.arXiv preprint arXiv:1905.09866, 2019.
[NWD + 19]Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial
nli: A new benchmark for natural language understanding.arXiv preprint arXiv:1910.14599, 2019.
[oR16]University of Regensburg. Fascha, 2016.
[PFB18]Jason Phang, Thibault Fevry, and Samuel R. Bowman. Sentence encoders on STILTs: Supplementary´
training on intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088, 2018.
[PKL + 16]Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro´
Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The lambada dataset: Word prediction´
requiring a broad discourse context.arXiv preprint arXiv:1606.06031, 2016.
[PNZtY18]Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen tau Yih. Dissecting contextual word
embeddings: Architecture and representation, 2018.
[Pos18]Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771, 2018.
[PSM14]Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word
representation. InProceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), 2014.
[QIA20]QIANXIN. Sa-net on albert (ensemble), April 2020.
[QMZH19]Yusu Qian, Urwa Muaz, Ben Zhang, and Jae Won Hyun. Reducing gender bias in word-level language
models with a gender-equalizing loss function.arXiv preprint arXiv:1905.12801, 2019.
[RCM19]Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering
challenge.Transactions of the Association for Computational Linguistics, 7:249266, 2019.
[RCP + 17]Scott Reed, Yutian Chen, Thomas Paine, Aaron van den Oord, SM Eslami, Danilo Rezende, Oriol¨
Vinyals, and Nando de Freitas. Few-shot autoregressive density estimation: Towards learning to learn
distributions.arXiv preprint arXiv:1710.10304, 2017.
[RJL18]Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you dont know: Unanswerable questions for
squad.arXiv preprint arXiv:1806.03822, 2018.
[RL16]Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning.ICLR 2017 (oral),
2016.
[RLL + 19]Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. NumNet: Machine reading comprehension
with numerical reasoning. InProceedings of EMNLP, 2019.
[RNLVD18]Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in
coreference resolution.arXiv preprint arXiv:1804.09301, 2018.
[RNSS18]Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding
by generative pre-training, 2018.
[Ros12]R.S. Ross. Guide for conducting risk assessments.NIST Special Publication, 2012.
[RRBS19]Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of
the generalization error across scales, 2019.
[RRS20]Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters
of a language model?arXiv preprint arXiv:2002.08910, 2020.
[RSR + 19]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer, 2019.
[RWC + 19]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners, 2019.
[SBBC19]Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial
winograd schema challenge at scale, 2019.
[SBC + 19]Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford,
Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris
McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models, 2019.
[SCNP19]Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a
babysitter: On biases in language generation.arXiv preprint arXiv:1909.01326, 2019.
[SDCW19]Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of
BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019.
[SDSE19]Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI.CoRR, abs/1907.10597, 2019.
[SHB15]Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with
monolingual data.arXiv preprint arXiv:1511.06709, 2015.
[SMM + 17]Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint
arXiv:1701.06538, 2017.
[SPP + 19]Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro.
Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019.
[SS20]Timo Schick and Hinrich Schutze. Exploiting cloze questions for few-shot text classification and natural¨
language inference.arXiv preprint arXiv:2001.07676, 2020.
[STQ + 19]Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence
pre-training for language generation.arXiv preprint arXiv:1905.02450, 2019.
[TFR + 17]Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain
randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ
international conference on intelligent robots and systems (IROS), pages 2330. IEEE, 2017.
[TL05]Peter D. Turney and Michael L. Littman. Corpus-based learning of analogies and semantic relations.
CoRR, abs/cs/0508103, 2005.
[TL18]Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning. arXiv preprint
arXiv:1806.02847, 2018.
[TLBS03]Peter D. Turney, Michael L. Littman, Jeffrey Bigham, and Victor Shnayder. Combining independent
modules to solve multiple-choice synonym and analogy problems.CoRR, cs.CL/0309035, 2003.
[Tur20]Project Turing. Microsoft research blog, Feb 2020.
[VBL + 16]Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching Networks for One
Shot Learning. InAdvances in neural information processing systems, pages 36303638, 2016.
[VSP + 17]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing
systems, 2017.
[WPN + 19]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understand-
ing systems. InAdvances in Neural Information Processing Systems, pages 32613275, 2019.
[WXH + 18]Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Multi-agent
dual learning.ICLR 2019, 2018.
[XDH + 19]Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data
augmentation for consistency training, 2019.
[YdC + 19]Dani Yogatama, Cyprien de Masson dAutume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski,
Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. Learning and evaluating
general linguistic intelligence.arXiv preprint arXiv:1901.11373, 2019.
[YDY + 19]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet:
Generalized autoregressive pretraining for language understanding.arXiv preprint arXiv:1906.08237,
2019.
[ZHB + 19]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine
really finish your sentence?arXiv preprint arXiv:1905.07830, 2019.
[ZHR + 19]Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin
Choi. Defending against neural fake news.arXiv preprint arXiv:1905.12616, 2019.
[ZSW + 19a] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul
Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019.
[ZSW + 19b]Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Chris-
tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.ArXiv, abs/1909.08593,
2019.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Learning both Weights and Connections for Efficient Neural Networks
Song Han Jeff Pool
Stanford University NVIDIA
songhan@stanford.edu jpool@nvidia.com
John Tran William J. Dally
NVIDIA Stanford University
johntran@nvidia.com NVIDIA
dally@stanford.edu
Abstract
Neural networks are both computationally intensive and memory intensive, making
them difficult to deploy on embedded systems. Also, conventional networks fix
the architecture before training starts; as a result, training cannot improve the
architecture. To address these limitations, we describe a method to reduce the
storage and computation required by neural networks by an order of magnitude
without affecting their accuracy by learning only the important connections. Our
method prunes redundant connections using a three-step method. First, we train
the network to learn which connections are important. Next, we prune the
unimportant connections. Finally, we retrain the network to fine tune the weights of the
remaining connections. On the ImageNet dataset, our method reduced the number
of parameters of AlexNet by a factor of9, from 61 million to 6.7 million, without
incurring accuracy loss. Similar experiments with VGG-16 found that the total
number of parameters can be reduced by13, from 138 million to 10.3 million,
again with no loss of accuracy.
1 Introduction
Neural networks have become ubiquitous in applications ranging from computer vision [1] to speech
recognition [2] and natural language processing [3]. We consider convolutional neural networks used
for computer vision tasks which have grown over time. In 1998 LeCun et al.designed a CNN model
LeNet-5 with less than 1M parameters to classify handwritten digits [4], while in 2012, Krizhevsky
et al.[1] won the ImageNet competition with 60M parameters. Deepface classified human faces with
120M parameters [5], and Coateset al.[6] scaled up a network to 10B parameters.
While these large neural networks are very powerful, their size consumes considerable storage,
memory bandwidth, and computational resources. For embedded mobile applications, these resource
demands become prohibitive. Figure 1 shows the energy cost of basic arithmetic and memory
operations in a 45nm CMOS process. From this data we see the energy per connection is dominated
by memory access and ranges from 5pJ for 32 bit coefficients in on-chip SRAM to 640pJ for 32bit
coefficients in off-chip DRAM [7]. Large networks do not fit in on-chip storage and hence require
the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at
20Hz would require(20Hz)(1G)(640pJ) = 12:8Wjust for DRAM access - well beyond the power
envelope of a typical mobile device. Our goal in pruning networks is to reduce the energy required to
run such large networks so they can run in real time on mobile devices. The model size reduction
from pruning also facilitates storage and transmission of mobile applications incorporating DNNs.
<<FIGURE>>
Figure 1: Energy table for 45nm CMOS process [7]. Memory access is 3 orders of magnitude more
energy expensive than simple arithmetic.
To achieve this goal, we present a method to prune network connections in a manner that preserves the
original accuracy. After an initial training phase, we remove all connections whose weight is lower
than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first
phase learns the topology of the networks — learning which connections are important and removing
the unimportant connections. We then retrain the sparse network so the remaining connections can
compensate for the connections that have been removed. The phases of pruning and retraining may
be repeated iteratively to further reduce network complexity. In effect, this training process learns
the network connectivity in addition to the weights - much as in the mammalian brain [8][9], where
synapses are created in the first few months of a childs development, followed by gradual pruning of
little-used connections, falling to typical adult values.
2 Related Work
Neural networks are typically over-parameterized, and there is significant redundancy for deep learn-
ing models [10]. This results in a waste of both computation and memory. There have been various
proposals to remove the redundancy: Vanhouckeet al.[11] explored a fixed-point implementation
with 8-bit integer (vs 32-bit floating point) activations. Dentonet al. [12] exploited the linear
structure of the neural network by finding an appropriate low-rank approximation of the parameters
and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gonget al.
[13] compressed deep convnets using vector quantization. These approximation and quantization
techniques are orthogonal to network pruning, and they can be used together to obtain further gains
[14].
There have been other attempts to reduce the number of parameters of neural networks by replacing
the fully connected layer with global average pooling. The Network in Network architecture [15]
and GoogLenet [16] achieves state-of-the-art results on several benchmarks by adopting this idea.
However, transfer learning, i.e. reusing features learned on the ImageNet dataset and applying them
to new tasks by only fine-tuning the fully connected layers, is more difficult with this approach. This
problem is noted by Szegedyet al.[16] and motivates them to add a linear layer on the top of their
networks to enable transfer learning.
Network pruning has been used both to reduce network complexity and to reduce over-fitting. An
early approach to pruning was biased weight decay [17]. Optimal Brain Damage [18] and Optimal
Brain Surgeon [19] prune networks to reduce the number of connections based on the Hessian of the
loss function and suggest that such pruning is more accurate than magnitude-based pruning such as
weight decay. However, second order derivative needs additional computation.
HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly
group connection weights into hash buckets, so that all connections within the same hash bucket
share a single parameter value. This technique may benefit from pruning. As pointed out in Shiet al.
[21] and Weinbergeret al.[22], sparsity will minimize hash collision making feature hashing even
more effective. HashedNets may be used together with pruning to give even better parameter savings.
<<FIGURE>>
Figure 3: Synapses and neurons before and after
<<FIGURE>>
Figure 2: Three-Step Training Pipeline. pruning.
3 Learning Connections in Addition to Weights
Our pruning method employs a three-step process, as illustrated in Figure 2, which begins by learning
the connectivity via normal network training. Unlike conventional training, however, we are not
learning the final values of the weights, but rather we are learning which connections are important.
The second step is to prune the low-weight connections. All connections with weights below a
threshold are removed from the network — converting a dense network into a sparse network, as
shown in Figure 3. The final step retrains the network to learn the final weights for the remaining
sparse connections. This step is critical. If the pruned network is used without retraining, accuracy is
significantly impacted.
3.1 Regularization
Choosing the correct regularization impacts the performance of pruning and retraining. L1 regularization
penalizes non-zero parameters resulting in more parameters near zero. This gives better accuracy
after pruning, but before retraining. However, the remaining connections are not as good as with L2
regularization, resulting in lower accuracy after retraining. Overall, L2 regularization gives the best
pruning results. This is further discussed in experiment section.
3.2 Dropout Ratio Adjustment
Dropout [23] is widely used to prevent over-fitting, and this also applies to retraining. During
retraining, however, the dropout ratio must be adjusted to account for the change in model capacity.
In dropout, each parameter is probabilistically dropped during training, but will come back during
inference. In pruning, parameters are dropped forever after pruning and have no chance to come back
during both training and inference. As the parameters get sparse, the classifier will select the most
informative predictors and thus have much less prediction variance, which reduces over-fitting. As
pruning already reduced model capacity, the retraining dropout ratio should be smaller.
Quantitatively, letCi be the number of connections in layeri,Cio for the original network,Cir for
the network after retraining,Ni be the number of neurons in layer i. Since dropout works on neurons,
andCi varies quadratically withNi , according to Equation 1 thus the dropout ratio after pruning the
parameters should follow Equation 2, whereDo represent the original dropout rate,Dr represent the
dropout rate during retraining.
<<FORMULA>> (1)
<<FORMULA>> (2)
3.3 Local Pruning and Parameter Co-adaptation
During retraining, it is better to retain the weights from the initial training phase for the connections
that survived pruning than it is to re-initialize the pruned layers. CNNs contain fragile co-adapted
features [24]: gradient descent is able to find a good solution when the network is initially trained,
but not after re-initializing some layers and retraining them. So when we retrain the pruned layers,
we should keep the surviving parameters instead of re-initializing them.
Table 1: Network pruning can save 9% to 13% parameters with no drop in predictive performance.
<<TABLE>>
Retraining the pruned layers starting with retained weights requires less computation because we
dont have to back propagate through the entire network. Also, neural networks are prone to suffer
the vanishing gradient problem [25] as the networks get deeper, which makes pruning errors harder to
recover for deep networks. To prevent this, we fix the parameters for CONV layers and only retrain
the FC layers after pruning the FC layers, and vice versa.
3.4 Iterative Pruning
Learning the right connections is an iterative process. Pruning followed by a retraining is one iteration,
after many such iterations the minimum number connections could be found. Without loss of accuracy,
this method can boost pruning rate from 5% to 9% on AlexNet compared with single-step aggressive
pruning. Each iteration is a greedy search in that we find the best connections. We also experimented
with probabilistically pruning parameters based on their absolute value, but this gave worse results.
3.5 Pruning Neurons
After pruning connections, neurons with zero input connections or zero output connections may be
safely pruned. This pruning is furthered by removing all connections to or from a pruned neuron.
The retraining phase automatically arrives at the result where dead neurons will have both zero input
connections and zero output connections. This occurs due to gradient descent and regularization.
A neuron that has zero input connections (or zero output connections) will have no contribution
to the final loss, leading the gradient to be zero for its output connection (or input connection),
respectively. Only the regularization term will push the weights to zero. Thus, the dead neurons will
be automatically removed during retraining.
4 Experiments
We implemented network pruning in Caffe [26]. Caffe was modified to add a mask which disregards
pruned parameters during network operation for each weight tensor. The pruning threshold is chosen
as a quality parameter multiplied by the standard deviation of a layers weights. We carried out the
experiments on Nvidia TitanX and GTX980 GPUs.
We pruned four representative networks: Lenet-300-100 and Lenet-5 on MNIST, together with
AlexNet and VGG-16 on ImageNet. The network parameters and accuracy 1 before and after pruning
are shown in Table 1.
4.1 LeNet on MNIST
We first experimented on MNIST dataset with the LeNet-300-100 and LeNet-5 networks [4]. LeNet-
300-100 is a fully connected network with two hidden layers, with 300 and 100 neurons each, which
achieves 1.6% error rate on MNIST. LeNet-5 is a convolutional network that has two convolutional
layers and two fully connected layers, which achieves 0.8% error rate on MNIST. After pruning,
the network is retrained with1=10of the original networks original learning rate. Table 1 shows
1 Reference model is from Caffe model zoo, accuracy is measured without data augmentation
Table 2: For Lenet-300-100, pruning reduces the number of weights by 12% and computation by 12%.
<<TABLE>>
Table 3: For Lenet-5, pruning reduces the number of weights by 12% and computation by 6%.
<<TABLE>>
<<FIGURE>>
Figure 4: Visualization of the first FC layers sparsity pattern of Lenet-300-100. It has a banded
structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images,
since the digits are written in the center.
pruning saves 12% parameters on these networks. For each layer of the network the table shows (left
to right) the original number of weights, the number of floating point operations to compute that
layers activations, the average percentage of activations that are non-zero, the percentage of non-zero
weights after pruning, and the percentage of actually required floating point operations.
An interesting byproduct is that network pruning detects visual attention regions. Figure 4 shows the
sparsity pattern of the first fully connected layer of LeNet-300-100, the matrix size is 784x300. It
has 28 bands, each bands width 28, corresponding to the 28x28 input pixels. The colored regions
of the figure, indicating non-zero parameters, correspond to the center of the image. Because digits
are written in the center of the image, these are the important parameters. The graph is sparse on the
left and right, corresponding to the less important regions on the top and bottom of the image. After
pruning, the neural network finds the center of the image more important, and the connections to the
peripheral regions are more heavily pruned.
4.2 AlexNet on ImageNet
We further examine the performance of pruning on the ImageNet ILSVRC-2012 dataset, which
has 1.2M training examples and 50k validation examples. We use the AlexNet Caffe model as the
reference model, which has 61 million parameters across 5 convolutional layers and 3 fully connected
layers. The AlexNet Caffe model achieved a top-1 accuracy of 57.2% and a top-5 accuracy of 80.3%.
The original AlexNet took 75 hours to train on NVIDIA Titan X GPU. After pruning, the whole
network is retrained with1=100of the original networks initial learning rate. It took 173 hours to
retrain the pruned AlexNet. Pruning is not used when iteratively prototyping the model, but rather
used for model reduction when the model is ready for deployment. Thus, the retraining time is less
a concern. Table 1 shows that AlexNet can be pruned to 1-9% of its original size without impacting
accuracy, and the amount of computation can be reduced by 3%.
Table 4: For AlexNet, pruning reduces the number of weights by 9% and computation by 3%.
<<TABLE>>
Table 5: For VGG-16, pruning reduces the number of weights by 12% and computation by 5%.
<<TABLE>>
4.3 VGG-16 on ImageNet
With promising results on AlexNet, we also looked at a larger, more recent network, VGG-16 [27],
on the same ILSVRC-2012 dataset. VGG-16 has far more convolutional layers but still only three
fully-connected layers. Following a similar methodology, we aggressively pruned both convolutional
and fully-connected layers to realize a significant reduction in the number of weights, shown in
Table 5. We used five iterations of pruning an retraining.
The VGG-16 results are, like those for AlexNet, very promising. The network as a whole has
been reduced to 7.5% of its original size (13% smaller). In particular, note that the two largest
fully-connected layers can each be pruned to less than 4% of their original size. This reduction is
critical for real time image processing, where there is little reuse of fully connected layers across
images (unlike batch processing during training).
5 Discussion
The trade-off curve between accuracy and number of parameters is shown in Figure 5. The more
parameters pruned away, the less the accuracy. We experimented with L1 and L2 regularization, with
and without retraining, together with iterative pruning to give five trade off lines. Comparing solid and
dashed lines, the importance of retraining is clear: without retraining, accuracy begins dropping much
sooner with 1-3% of the original connections, rather than with1=10of the original connections.
Its interesting to see that we have the “free lunch” of reducing 2% the connections without losing
accuracy even without retraining; while with retraining we are ably to reduce connections by 9%.
<<FIGURE>>
Figure 5: Trade-off curve for parameter reduction and loss in top-5 accuracy. L1 regularization
performs better than L2 at learning the connections without retraining, while L2 regularization
performs better than L1 at retraining. Iterative pruning gives the best result.
<<FIGURE>>
Figure 6: Pruning sensitivity for CONV layer (left) and FC layer (right) of AlexNet.
L1 regularization gives better accuracy than L2 directly after pruning (dotted blue and purple lines)
since it pushes more parameters closer to zero. However, comparing the yellow and green lines shows
that L2 outperforms L1 after retraining, since there is no benefit to further pushing values towards
zero. One extension is to use L1 regularization for pruning and then L2 for retraining, but this did not
beat simply using L2 for both phases. Parameters from one mode do not adapt well to the other.
The biggest gain comes from iterative pruning (solid red line with solid circles). Here we take the
pruned and retrained network (solid green line with circles) and prune and retrain it again. The
leftmost dot on this curve corresponds to the point on the green line at 80% (5% pruning) pruned to
8%. Theres no accuracy loss at 9%. Not until 10% does the accuracy begin to drop sharply.
Two green points achieve slightly better accuracy than the original model. We believe this accuracy
improvement is due to pruning finding the right capacity of the network and hence reducing overfitting.
Both CONV and FC layers can be pruned, but with different sensitivity. Figure 6 shows the sensitivity
of each layer to network pruning. The figure shows how accuracy drops as parameters are pruned on
a layer-by-layer basis. The CONV layers (on the left) are more sensitive to pruning than the fully
connected layers (on the right). The first convolutional layer, which interacts with the input image
directly, is most sensitive to pruning. We suspect this sensitivity is due to the input layer having only
3 channels and thus less redundancy than the other convolutional layers. We used the sensitivity
results to find each layers threshold: for example, the smallest threshold was applied to the most
sensitive layer, which is the first convolutional layer.
Storing the pruned layers as sparse matrices has a storage overhead of only 15.6%. Storing relative
rather than absolute indices reduces the space taken by the FC layer indices to 5 bits. Similarly,
CONV layer indices can be represented with only 8 bits.
Table 6: Comparison with other model reduction methods on AlexNet. Data-free pruning [28]
saved only 1-5% parameters with much loss of accuracy. Deep Fried Convnets [29] worked on fully
connected layers only and reduced the parameters by less than 4%. [30] reduced the parameters by
4% with inferior accuracy. Naively cutting the layer size saves parameters but suffers from 4% loss
of accuracy. [12] exploited the linear structure of convnets and compressed each layer individually,
where model compression on a single layer incurred 0.9% accuracy penalty with biclustering + SVD.
<<FIGURE>>
Figure 7: Weight distribution before and after parameter pruning. The right figure has 10% smaller
scale.
After pruning, the storage requirements of AlexNet and VGGNet are are small enough that all weights
can be stored on chip, instead of off-chip DRAM which takes orders of magnitude more energy to
access (Table 1). We are targeting our pruning method for fixed-function hardware specialized for
sparse DNN, given the limitation of general purpose hardware on sparse computation.
Figure 7 shows histograms of weight distribution before (left) and after (right) pruning. The weight
is from the first fully connected layer of AlexNet. The two panels have different y-axis scales.
The original distribution of weights is centered on zero with tails dropping off quickly. Almost all
parameters are between <<FORMULA>>. After pruning the large center region is removed. The
network parameters adjust themselves during the retraining phase. The result is that the parameters
form a bimodal distribution and become more spread across the x-axis, between <<FORMULA>>.
6 Conclusion
We have presented a method to improve the energy efficiency and storage of neural networks without
affecting accuracy by finding the right connections. Our method, motivated in part by how learning
works in the mammalian brain, operates by learning which connections are important, pruning
the unimportant connections, and then retraining the remaining sparse network. We highlight our
experiments on AlexNet and VGGNet on ImageNet, showing that both fully connected layer and
convolutional layer can be pruned, reducing the number of connections by 9% to 13% without loss of
accuracy. This leads to smaller memory capacity and bandwidth requirements for real-time image
processing, making it easier to be deployed on mobile systems.
References
[1]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional
neural networks. InAdvances in neural information processing systems, pages 10971105, 2012.
[2]Alex Graves and Jurgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other¨
neural network architectures.Neural Networks, 18(5):602610, 2005.
[3]Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. ´
Natural language processing (almost) from scratch.JMLR, 12:24932537, 2011.
[4] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition.Proceedings of the IEEE, 86(11):22782324, 1998.
[5]Yaniv Taigman, Ming Yang, MarcAurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to
human-level performance in face verification. InCVPR, pages 17011708. IEEE, 2014.
[6]Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with
cots hpc systems. In30th ICML, pages 13371345, 2013.
[7]Mark Horowitz. Energy table for 45nm process, Stanford VLSI wiki.
[8] JP Rauschecker. Neuronal mechanisms of developmental plasticity in the cats visual system.Human
neurobiology, 3(2):109114, 1983.
[9]Christopher A Walsh. Peter huttenlocher (1931-2013).Nature, 502(7470):172172, 2013.
[10] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning.
InAdvances in Neural Information Processing Systems, pages 21482156, 2013.
[11]Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus.
InProc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011.
[12]Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure
within convolutional networks for efficient evaluation. InNIPS, pages 12691277, 2014.
[13]Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks
using vector quantization.arXiv preprint arXiv:1412.6115, 2014.
[14]Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with
pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015.
[15]Min Lin, Qiang Chen, and Shuicheng Yan. Network in network.arXiv preprint arXiv:1312.4400, 2013.
[16]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.arXiv preprint
arXiv:1409.4842, 2014.
[17]Stephen Jose Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with´
back-propagation. InAdvances in neural information processing systems, pages 177185, 1989.
[18]Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information
Processing Systems, pages 598605. Morgan Kaufmann, 1990.
[19]Babak Hassibi, David G Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon.
Advances in neural information processing systems, pages 164164, 1993.
[20]Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural
networks with the hashing trick.arXiv preprint arXiv:1504.04788, 2015.
[21]Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan. Hash
kernels for structured data.The Journal of Machine Learning Research, 10:26152637, 2009.
[22]Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing
for large scale multitask learning. InICML, pages 11131120. ACM, 2009.
[23]Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:
A simple way to prevent neural networks from overfitting.JMLR, 15:19291958, 2014.
[24]Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural
networks? InAdvances in Neural Information Processing Systems, pages 33203328, 2014.
[25]Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient
descent is difficult.Neural Networks, IEEE Transactions on, 5(2):157166, 1994.
[26]Yangqing Jia, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014.
[27]Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
tion.CoRR, abs/1409.1556, 2014.
[28] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.arXiv
preprint arXiv:1507.06149, 2015.
[29]Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang.
Deep fried convnets.arXiv preprint arXiv:1412.7149, 2014.
[30]Maxwell D Collins and Pushmeet Kohli. Memory bounded deep convolutional networks.arXiv preprint
arXiv:1412.1442, 2014.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Learning Efficient Convolutional Networks through Network Slimming
Abstract
The deployment of deep convolutional neural networks (CNNs) in many real world applications is largely hindered by their high computational cost. In this paper, we propose a novel learning scheme for CNNs to simultaneously 1) reduce the model size; 2) decrease the run-time memory footprint; and 3) lower the number of computing operations, without compromising accuracy. This is achieved by en.forcing channel-level sparsity in the network in a simple but effective way. Different from many existing approaches, the proposed method directly applies to modern CNN architectures, introduces minimum overhead to the training process, and requires no special software/hardware accelerators for the resulting models. We call our approach network slimming, which takes wide and large networks as input models, but during training insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy. We empirically demonstrate the effectiveness of our approach with several state-of-the-art CNN models, including VGGNet, ResNet and DenseNet, on various image classification datasets. For VGGNet, a multi-pass version of network slimming gives a 20. reduction in model size and a 5. reduction in computing operations.
1. Introduction
In recent years, convolutional neural networks (CNNs) have become the dominant approach for a variety of computer vision tasks, e.g., image classification [22], object detection [8], semantic segmentation [26]. Large-scale datasets, high-end modern GPUs and new network architectures allow the development of unprecedented large CNN models. For instance, from AlexNet [22], VGGNet [31] and GoogleNet [34] to ResNets [14], the ImageNet Classification Challenge winner models have evolved from 8 layers to more than 100 layers.
This work was done when Zhuang Liu and Zhiqiang Shen were interns at Intel Labs China. Jianguo Li is the corresponding author.
However, larger CNNs, although with stronger representation power, are more resource-hungry. For instance, a 152-layer ResNet [14] has more than 60 million parameters and requires more than 20 Giga float-point-operations (FLOPs) when inferencing an image with resolution 224.
224. This is unlikely to be affordable on resource con.strained platforms such as mobile devices, wearables or Internet of Things (IoT) devices.
The deployment of CNNs in real world applications are mostly constrained by 1) Model size: CNNs strong representation power comes from their millions of trainable parameters. Those parameters, along with network structure information, need to be stored on disk and loaded into mem.ory during inference time. As an example, storing a typical CNN trained on ImageNet consumes more than 300MB space, which is a big resource burden to embedded devices.
2) Run-time memory: During inference time, the intermediate activations/responses of CNNs could even take more memory space than storing the model parameters, even with batch size 1. This is not a problem for high-end GPUs, but unaffordable for many applications with low computational power. 3) Number of computing operations: The convolution operations are computationally intensive on high resolution images. A large CNN may take several minutes to process one single image on a mobile device, making it un.realistic to be adopted for real applications.
Many works have been proposed to compress large CNNs or directly learn more Efficient CNN models for fast inference. These include low-rank approximation [7], network quantization [3, 12] and binarization [28, 6], weight pruning [12], dynamic inference [16], etc. However, most of these methods can only address one or two challenges mentioned above. Moreover, some of the techniques require specially designed software/hardware accelerators for execution speedup [28, 6, 12].
Another direction to reduce the resource consumption of large CNNs is to sparsify the network. Sparsity can be im.posed on different level of structures [2, 37, 35, 29, 25], which yields considerable model-size compression and inference speedup. However, these approaches generally re.
<<FIGURE>>
Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then fine-tuned to achieve comparable (or even higher) accuracy as normally trained full network.
quire special software/hardware accelerators to harvest the gain in memory or time savings, though it is easier than non-structured sparse weight matrix as in [12].
In this paper, we propose network slimming, a simple yet effective network training scheme, which addresses all the aforementioned challenges when deploying large CNNs under limited resources. Our approach imposes L1 regularization on the scaling factors in batch normalization (BN) layers, thus it is easy to implement without introducing any change to existing CNN architectures. Pushing the val.ues of BN scaling factors towards zero with L1 regularization enables us to identify insignificant channels (or neurons), as each scaling factor corresponds to a specific convolutional channel (or a neuron in a fully-connected layer). This facilitates the channel-level pruning at the followed step. The additional regularization term rarely hurt the performance. In fact, in some cases it leads to higher generalization accuracy. Pruning unimportant channels may sometimes temporarily degrade the performance, but this effect can be compensated by the followed fine-tuning of the pruned network. After pruning, the resulting narrower network is much more compact in terms of model size, run.time memory, and computing operations compared to the initial wide network. The above process can be repeated for several times, yielding a multi-pass network slimming scheme which leads to even more compact network.
Experiments on several benchmark datasets and different network architectures show that we can obtain CNN models with up to 20x mode-size compression and 5x reduction in computing operations of the original ones, while achieving the same or even higher accuracy. Moreover, our method achieves model compression and inference speedup with conventional hardware and deep learning software packages, since the resulting narrower model is free of any sparse storing format or computing operations.
2. Related Work
In this section, we discuss related work from five aspects.
Low-rank Decomposition approximates weight matrix in neural networks with low-rank matrix using techniques like Singular Value Decomposition (SVD) [7]. This method works especially well on fully-connected layers, yield.ing 3x model-size compression however without notable speed acceleration, since computing operations in CNN mainly come from convolutional layers.
Weight Quantization. HashNet [3] proposes to quantize the network weights. Before training, network weights are hashed to different groups and within each group weight the value is shared. In this way only the shared weights and hash indices need to be stored, thus a large amount of stor.age space could be saved. [12] uses a improved quantization technique in a deep compression pipeline and achieves 35x to 49x compression rates on AlexNet and VGGNet. How.ever, these techniques can neither save run-time memory nor inference time, since during inference shared weights need to be restored to their original positions.
[28, 6] quantize real-valued weights into binary/ternary weights (weight values restricted to {-1, 1} or {-1, 0, 1}). This yields a large amount of model-size saving, and significant speedup could also be obtained given bitwise operation libraries. However, this aggressive low-bit approximation method usually comes with a moderate accuracy loss.
Weight Pruning / Sparsifying. [12] proposes to prune the unimportant connections with small weights in trained neu.ral networks. The resulting network's weights are mostly zeros thus the storage space can be reduced by storing the model in a sparse format. However, these methods can only achieve speedup with dedicated sparse matrix operation libraries and/or hardware. The run-time memory saving is also very limited since most memory space is consumed by the activation maps (still dense) instead of the weights.
In [12], there is no guidance for sparsity during training.
[32] overcomes this limitation by explicitly imposing sparse constraint over each weight with additional gate variables, and achieve high compression rates by pruning connections with zero gate values. This method achieves better compression rate than [12], but suffers from the same drawback.
Structured Pruning / Sparsifying. Recently, [23] pro.poses to prune channels with small incoming weights in trained CNNs, and then fine-tune the network to regain accuracy. [2] introduces sparsity by random deactivating input-output channel-wise connections in convolutional layers before training, which also yields smaller networks with moderate accuracy loss. Compared with these works, we explicitly impose channel-wise sparsity in the optimization objective during training, leading to smoother channel pruning process and little accuracy loss.
[37] imposes neuron-level sparsity during training thus some neurons could be pruned to obtain compact networks.
[35] proposes a Structured Sparsity Learning (SSL) method to sparsify different level of structures (e.g. filters, channels or layers) in CNNs. Both methods utilize group sparsity regularization during training to obtain structured sparsity. Instead of resorting to group sparsity on convolutional weights, our approach imposes simple L1 sparsity on channel-wise scaling factors, thus the optimization objective is much simpler.
Since these methods prune or sparsify part of the network structures (e.g., neurons, channels) instead of individual weights, they usually require less specialized libraries
(e.g. for sparse computing operation) to achieve inference speedup and run-time memory saving. Our network slimming also falls into this category, with absolutely no special libraries needed to obtain the benefits.
Neural Architecture Learning. While state-of-the-art CNNs are typically designed by experts [22, 31, 14], there are also some explorations on automatically learning network architectures. [20] introduces sub-modular/super.modular optimization for network architecture search with a given resource budget. Some recent works [38, 1] propose to learn neural architecture automatically with reinforcement learning. The searching space of these methods are extremely large, thus one needs to train hundreds of models to distinguish good from bad ones. Network slimming can also be treated as an approach for architecture learning, despite the choices are limited to the width of each layer. However, in contrast to the aforementioned methods, network slimming learns network architecture through only a single training process, which is in line with our goal of efficiency.
3. Network slimming
We aim to provide a simple scheme to achieve channel-level sparsity in deep CNNs. In this section, we first discuss the advantages and challenges of channel-level sparsity, and introduce how we leverage the scaling layers in batch normalization to effectively identify and prune unimportant channels in the network.
Advantages of Channel-level Sparsity. As discussed in prior works [35, 23, 11], sparsity can be realized at differ.ent levels, e.g., weight-level, kernel-level, channel-level or layer-level. Fine-grained level (e.g., weight-level) sparsity gives the highest flexibility and generality leads to higher compression rate, but it usually requires special software or hardware accelerators to do fast inference on the sparsified model [11]. On the contrary, the coarsest layer-level sparsity does not require special packages to harvest the inference speedup, while it is less flexible as some whole layers need to be pruned. In fact, removing layers is only effective when the depth is sufficiently large, e.g., more than 50 layers [35, 18]. In comparison, channel-level sparsity provides a nice tradeoff between flexibility and ease of implementation. It can be applied to any typical CNNs or fully-connected networks (treat each neuron as a channel), and the resulting network is essentially a "thinned" version of the unpruned network, which can be Efficiently inferenced on conventional CNN platforms.
Challenges. Achieving channel-level sparsity requires pruning all the incoming and outgoing connections associated with a channel. This renders the method of directly pruning weights on a pre-trained model ineffective, as it is unlikely that all the weights at the input or output end of a channel happen to have near zero values. As reported in [23], pruning channels on pre-trained ResNets can only lead to a reduction of 10% in the number of parameters without suffering from accuracy loss. [35] addresses this problem by enforcing sparsity regularization into the training objective. specifically, they adopt group LASSO to push all the filter weights corresponds to the same channel towards zero simultaneously during training. However, this approach re.quires computing the gradients of the additional regularization term with respect to all the filter weights, which is non.trivial. We introduce a simple idea to address the above challenges, and the details are presented below.
Scaling Factors and Sparsity-induced Penalty. Our idea is introducing a scaling factor . for each channel, which is multiplied to the output of that channel. Then we jointly train the network weights and these scaling factors, with sparsity regularization imposed on the latter. Finally we prune those channels with small factors, and fine-tune the pruned network. specifically, the training objective of our approach is given by
<<FORMULA>> (1)
where <<FORMULA>> denote the train input and target, W denotes the trainable weights, the first sum-term corresponds to the normal training loss of a CNN, <<FORMULA>> is a sparsity-induced penalty on the scaling factors, and <<FORMULA>> balances the two terms. In our experiment, we choose <<FORMULA>>, which is known as
<<FIGURE>>
Figure 2: Flow-chart of network slimming procedure. The dotted-line is for the multi-pass/iterative scheme.
L1-norm and widely used to achieve sparsity. Subgradient descent is adopted as the optimization method for the non-smooth L1 penalty term. An alternative option is to replace the L1 penalty with the smooth-L1 penalty [30] to avoid using sub-gradient at non-smooth point.
As pruning a channel essentially corresponds to removing all the incoming and outgoing connections of that chan.nel, we can directly obtain a narrow network (see Figure 1) without resorting to any special sparse computation packages. The scaling factors act as the agents for channel se.lection. As they are jointly optimized with the network weights, the network can automatically identity insignificant channels, which can be safely removed without greatly affecting the generalization performance.
Leveraging the Scaling Factors in BN Layers. Batch normalization [19] has been adopted by most modern CNNs as a standard approach to achieve fast convergence and bet.ter generalization performance. The way BN normalizes the activations motivates us to design a simple and efficient method to incorporates the channel-wise scaling fac.tors. Particularly, BN layer normalizes the internal activations using mini-batch statistics. Let z_in and z_out be the input and output of a BN layer, B denotes the current mini-batch, BN layer performs the following transformation:
<<FORMULA>>
where <<FORMULA>> and <<FORMULA>> are the mean and standard deviation val.ues of input activations over <<FORMULA>> and <<FORMULA>> are trainable affine transformation parameters (scale and shift) which provides the possibility of linearly transforming normalized activations back to any scales.
It is common practice to insert a BN layer after a convolutional layer, with channel-wise scaling/shifting parameters. Therefore, we can directly leverage the . parameters in BN layers as the scaling factors we need for network slimming. It has the great advantage of introducing no overhead to the network. In fact, this is perhaps also the most effective way we can learn meaningful scaling factors for chan.nel pruning. 1), if we add scaling layers to a CNN without BN layer, the value of the scaling factors are not meaning.ful for evaluating the importance of a channel, because both convolution layers and scaling layers are linear transformations. One can obtain the same results by decreasing the scaling factor values while amplifying the weights in the convolution layers. 2), if we insert a scaling layer before a BN layer, the scaling effect of the scaling layer will be completely canceled by the normalization process in BN. 3), if we insert scaling layer after BN layer, there are two consecutive scaling factors for each channel.
Channel Pruning and Fine-tuning. After training under channel-level sparsity-induced regularization, we obtain a model in which many scaling factors are near zero (see Figure 1). Then we can prune channels with near-zero scaling factors, by removing all their incoming and outgoing connections and corresponding weights. We prune channels with a global threshold across all layers, which is defined as a certain percentile of all the scaling factor values. For instance, we prune 70% channels with lower scaling factors by choosing the percentile threshold as 70%. By doing so, we obtain a more compact network with less parameters and run-time memory, as well as less computing operations.
Pruning may temporarily lead to some accuracy loss, when the pruning ratio is high. But this can be largely compensated by the followed fine-tuning process on the pruned network. In our experiments, the fine-tuned narrow network can even achieve higher accuracy than the original unpruned network in many cases.
Multi-pass Scheme. We can also extend the proposed method from single-pass learning scheme (training with sparsity regularization, pruning, and fine-tuning) to a multi-pass scheme. specifically, a network slimming procedure results in a narrow network, on which we could again apply the whole training procedure to learn an even more compact model. This is illustrated by the dotted-line in Figure 2. Experimental results show that this multi-pass scheme can lead to even better results in terms of compression rate.
Handling Cross Layer Connections and Pre-activation Structure. The network slimming process introduced above can be directly applied to most plain CNN architectures such as AlexNet [22] and VGGNet [31]. While some adaptations are required when it is applied to modern networks with cross layer connections and the pre-activation design such as ResNet [15] and DenseNet [17]. For these networks, the output of a layer may be treated as the input of multiple subsequent layers, in which a BN layer is placed before the convolutional layer. In this case, the sparsity is achieved at the incoming end of a layer, i.e., the layer selectively uses a subset of channels it received. To harvest the parameter and computation savings at test time, we need to place a channel selection layer to mask out insignificant channels we have identified.
4. Experiments
We empirically demonstrate the effectiveness of network slimming on several benchmark datasets. We implement
<<TABLE>>
Table 1: Results on CIFAR and SVHN datasets. "Baseline" denotes normal training without sparsity regularization. In column-1, 60% pruned denotes the fine-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy could typically be maintained with  60% channels pruned.
our method based on the publicly available Torch [5] implementation for ResNets by [10]. The code is available at https://github.com/liuzhuang13/slimming.
4.1. Datasets
CIFAR. The two CIFAR datasets [21] consist of natural im.
ages with resolution 32.32. CIFAR-10 is drawn from 10 and CIFAR-100 from 100 classes. The train and test sets contain 50,000 and 10,000 images respectively. On CIFAR.10, a validation set of 5,000 images is split from the training set for the search of . (in Equation 1) on each model. We report the final test errors after training or fine-tuning on all training images. A standard data augmentation scheme (shifting/mirroring) [14, 18, 24] is adopted. The input data is normalized using channel means and standard deviations. We also compare our method with [23] on CIFAR datasets.
SVHN. The Street View House Number (SVHN) dataset
[27] consists of 32x32 colored digit images. Following common practice [9, 18, 24] we use all the 604,388 training images, from which we split a validation set of 6,000 im.ages for model selection during training. The test set con.tains 26,032 images. During training, we select the model with the lowest validation error as the model to be pruned (or the baseline model). We also report the test errors of the models with lowest validation errors during fine-tuning.
ImageNet. The ImageNet dataset contains 1.2 million training images and 50,000 validation images of 1000 classes. We adopt the data augmentation scheme as in [10]. We report the single-center-crop validation error of the final model.
MNIST. MNIST is a handwritten digit dataset containing 60,000 training images and 10,000 test images. To test the effectiveness of our method on a fully-connected network (treating each neuron as a channel with 1.1 spatial size), we compare our method with [35] on this dataset.
4.2. Network Models
On CIFAR and SVHN dataset, we evaluate our method on three popular network architectures: VGGNet[31], ResNet [14] and DenseNet [17]. The VGGNet is originally designed for ImageNet classification. For our experiment a variation of the original VGGNet for CIFAR dataset is taken from [36]. For ResNet, a 164-layer pre-activation ResNet with bottleneck structure (ResNet-164) [15] is used. For DenseNet, we use a 40-layer DenseNet with growth rate 12 (DenseNet-40).
On ImageNet dataset, we adopt the 11-layer (8-conv + 3 FC) VGG-A network [31] model with batch normalization from [4]. We remove the dropout layers since we use relatively heavy data augmentation. To prune the neurons in fully-connected layers, we treat them as convolutional channels with 1.1 spatial size.
On MNIST dataset, we evaluate our method on the same 3-layer fully-connected network as in [35].
4.3. Training, Pruning and Fine-tuning
Normal Training. We train all the networks normally from scratch as baselines. All the networks are trained using SGD. On CIFAR and SVHN datasets we train using mini-batch size 64 for 160 and 20 epochs, respectively. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. On Im.ageNet and MNIST datasets, we train our models for 60 and 30 epochs respectively, with a batch size of 256, and an initial learning rate of 0.1 which is divided by 10 after 1/3 and 2/3 fraction of training epochs. We use a weight de.cay of 10.4 and a Nesterov momentum [33] of 0.9 without dampening. The weight initialization introduced by [13] is adopted. Our optimization settings closely follow the orig.inal implementation at [10]. In all our experiments, we initialize all channel scaling factors to be 0.5, since this gives higher accuracy for the baseline models compared with de.fault setting (all initialized to be 1) from [10].
Training with Sparsity. For CIFAR and SVHN datasets, when training with channel sparse regularization, the hyper.parameteer ., which controls the tradeoff between empirical loss and sparsity, is determined by a grid search over 10.3, 10.4, 10.5 on CIFAR-10 validation set. For VG-GNet we choose 10.4 and for ResNet and DenseNet 10.5. For VGG-A on ImageNet, we set 10.5 . All other settings are kept the same as in normal training.
Pruning. When we prune the channels of models trained with sparsity, a pruning threshold on the scaling factors needs to be determined. Unlike in [23] where different lay.ers are pruned by different ratios, we use a global pruning threshold for simplicity. The pruning threshold is deter.mined by a percentile among all scaling factors , e.g., 40% or 60% channels are pruned. The pruning process is implemented by building a new
narrower model and copying the corresponding weights from the model trained with sparsity.
Fine-tuning. After the pruning we obtain a narrower and more compact model, which is then fine-tuned. On CIFAR, SVHN and MNIST datasets, the fine-tuning uses the same optimization setting as in training. For ImageNet dataset, due to time constraint, we fine-tune the pruned VGG-A with a learning rate of 10.3 for only 5 epochs.
<<FIGURE>>
Figure 3: Comparison of pruned models with lower test errors on CIFAR-10 than the original models. The blue and green bars are parameter and FLOP ratios between pruned and original models.
4.4. Results
CIFAR and SVHN The results on CIFAR and SVHN are shown in Table 1. We mark all lowest test errors of a model in boldface.
Parameter and FLOP reductions. The purpose of network slimming is to reduce the amount of computing re.sources needed. The last row of each model has  60% channels pruned while still maintaining similar accuracy to the baseline. The parameter saving can be up to 10.. The FLOP reductions are typically around 50%. To highlight network slimming's efficiency, we plot the resource savings in Figure 3. It can be observed that VGGNet has a large amount of redundant parameters that can be pruned. On ResNet-164 the parameter and FLOP savings are relatively insignificant, we conjecture this is due to its "bottleneck" structure has already functioned as selecting channels. Also, on CIFAR-100 the reduction rate is typically slightly lower than CIFAR-10 and SVHN, which is possibly due to the fact that CIFAR-100 contains more classes.
Regularization Effect. From Table 1, we can observe that, on ResNet and DenseNet, typically when 40% channels are pruned, the fine-tuned network can achieve a lower test er.ror than the original models. For example, DenseNet-40 with 40% channels pruned achieve a test error of 5.19% on CIFAR-10, which is almost 1% lower than the original model. We hypothesize this is due to the regularization effect of L1 sparsity on channels, which naturally provides feature selection in intermediate layers of a network. We will analyze this effect in the next section.
<<TABLE>>
Table 3: Results on MNIST.
ImageNet. The results for ImageNet dataset are summarized in Table 2. When 50% channels are pruned, the parameter saving is more than 5%, while the FLOP saving is only 30.4%. This is due to the fact that only 378 (out of 2752) channels from all the computation-intensive convolutional layers are pruned, while 5094 neurons (out of 8192) from the parameter-intensive fully-connected layers are pruned. It is worth noting that our method can achieve the savings with no accuracy loss on the 1000-class Im.ageNet dataset, where other methods for Efficient CNNs [2, 23, 35, 28] mostly report accuracy loss.
MNIST. On MNIST dataset, we compare our method with the Structured Sparsity Learning (SSL) method [35] in Ta.
ble 3. Despite our method is mainly designed to prune channels in convolutional layers, it also works well in pruning neurons in fully-connected layers. In this experiment, we observe that pruning with a global threshold sometimes completely removes a layer, thus we prune 80% of the neurons in each of the two intermediate layers. Our method slightly outperforms [35], in that a slightly lower test error is achieved while pruning more parameters.
We provide some additional experimental results in the supplementary materials, including (1) detailed structure of a compact VGGNet on CIFAR-10; (2) wall-clock time and run-time memory savings in practice. (3) comparison with a previous channel pruning method [23];
4.5. Results for Multi-pass Scheme
We employ the multi-pass scheme on CIFAR datasets using VGGNet. Since there are no skip-connections, pruning away a whole layer will completely destroy the models. Thus, besides setting the percentile threshold as 50%, we also put a constraint that at each layer, at most 50% of channels can be pruned.
The test errors of models in each iteration are shown in Table 4. As the pruning process goes, we obtain more and
<<TABLE>>
Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR.100 datasets, using VGGNet. The baseline model has test errors of 6.34% and 26.74%. Trained and Fine-tuned columns denote the test errors (%) of the model trained with sparsity, and the fine-tuned model after channel pruning, respectively. The parameter and FLOP pruned ratios correspond to the fine-tuned model in that row and the trained model in the next row.
more compact models. On CIFAR-10, the trained model achieves the lowest test error in iteration 5. This model achieves 20. parameter reduction and 5. FLOP reduction, while still achieving lower test error. On CIFAR-100, after iteration 3, the test error begins to increase. This is pos.sibly due to that it contains more classes than CIFAR-10, so pruning channels too aggressively will inevitably hurt the performance. However, we can still prune near 90% parameters and near 70% FLOPs without notable accuracy loss.
5. Analysis
There are two crucial hyper-parameters in network slimming, the pruned percentage t and the coEfficient of the sparsity regularization term . (see Equation 1). In this section, we analyze their effects in more detail.
Effect of Pruned Percentage. Once we obtain a model trained with sparsity regularization, we need to decide what percentage of channels to prune from the model. If we prune too few channels, the resource saving can be very limited. However, it could be destructive to the model if we prune too many channels, and it may not be possible to recover the accuracy by fine-tuning. We train a DenseNet.40 model with 10.5 on CIFAR-10 to show the effect of pruning a varying percentage of channels. The results are summarized in Figure 5.
From Figure 5, it can be concluded that the classification performance of the pruned or fine-tuned models degrade only when the pruning ratio surpasses a threshold. The fine.
<<FIGURE>>
Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter). With the increase of , scaling factors become sparser.
<<FIGURE>>
Figure 5: The effect of pruning varying percentages of channels, from DenseNet-40 trained on CIFAR-10 with =10.5 .
tuning process can typically compensate the possible accuracy loss caused by pruning. Only when the threshold goes beyond 80%, the test error of fine-tuned model falls behind the baseline model. Notably, when trained with sparsity, even without fine-tuning, the model performs better than the original model. This is possibly due the the regularization effect of L1 sparsity on channel scaling factors.
Channel Sparsity Regularization. The purpose of the L1 sparsity term is to force many of the scaling factors to be near zero. The parameter <<FORMULA>> in Equation 1 controls its significance compared with the normal training loss. In Figure 4 we plot the distributions of scaling factors in the whole network with different . values. For this experiment we use a VGGNet trained on CIFAR-10 dataset.
It can be observed that with the increase of ., the scaling factors are more and more concentrated near zero. When 0, i.e., there's no sparsity regularization, the distribution is relatively flat. When 10.4 , almost all scaling factors fall into a small region near zero. This process can be seen as a feature selection happening in intermediate layers of deep networks, where only channels with non-negligible scaling factors are chosen. We further visualize this process by a heatmap. Figure 6 shows the magnitude of scaling factors from one layer in VGGNet, along the training process. Each channel starts with equal weights; as the training
<<FIGURE>>
Figure 6: Visulization of channel scaling factorsfi change in scale along the training process, taken from the 11th conv-layer in VG-GNet trained on CIFAR-10. Brighter color corresponds to larger value. The bright lines indicate the selected channels, the dark lines indicate channels that can be pruned.
progresses, some channels scaling factors become larger (brighter) while others become smaller (darker).
6. Conclusion
We proposed the network slimming technique to learn more compact CNNs. It directly imposes sparsity-induced regularization on the scaling factors in batch normalization layers, and unimportant channels can thus be automatically identified during training and then pruned. On multiple datasets, we have shown that the proposed method is able to significantly decrease the computational cost (up to 20.) of state-of-the-art networks, with no accuracy loss. More importantly, the proposed method simultaneously reduces the model size, run-time memory, computing operations while introducing minimum overhead to the training process, and the resulting models require no special libraries/hardware for Efficient inference.
Acknowledgements.
Gao Huang is supported by the International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council (No.20150015). Changshui Zhang is supported by NSFC and DFG joint project NSFC 61621136008/DFG TRR-169.
References
[1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu.ral network architectures using reinforcement learning. In ICLR, 2017.
[2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power of sparsity in convolutional neural networks. arXiv preprint arXiv:1702.06257, 2017.
[3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and
Y. Chen. Compressing neural networks with the hashing trick. In ICML, 2015.
[4] S. Chintala. Training an object classifier in torch-7 on multiple gpus over imagenet. https://github.com/soumith/imagenet-multiGPU.torch.
[5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.
[6] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
[7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer.gus. Exploiting linear structure within convolutional networks for Efficient evaluation. In NIPS, 2014.
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea.ture hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580fi587, 2014.
[9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.
[10] S. Gross and M. Wilber. Training and investigating residual nets. https://github.com/szagoruyko/cifar-torch.
[11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com.pressing deep neural network with pruning, trained quanti.zation and huffman coding. In ICLR, 2016.
[12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for Efficient neural network. In NIPS, pages 1135fi1143, 2015.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, pages 630fi645. Springer, 2016.
[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense convolutional networks for Efficient prediction. arXiv preprint arXiv:1703.09844, 2017.
[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network architecture optimization through submodularity and super-modularity. arXiv preprint arXiv:1609.00074, 2016.
[21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. In Tech Report, 2009.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097fi1105, 2012.
[23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for Efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
[24] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
[25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni.tion, pages 806fi814, 2015.
[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431fi 3440, 2015.
[27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised fea.ture learning, 2011. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
[28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor.net: Imagenet classification using binary convolutional neu.ral networks. In ECCV, 2016.
[29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural networks. arXiv preprint arXiv:1607.00485, 2016.
[30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, pages 286fi297, 2007.
[31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse neural networks. CoRR, abs/1611.06694, 2016.
[33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, et al. Going deeper with convolutions. In CVPR, pages 1fi9, 2015.
[35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
[36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://github.com/szagoruyko/cifar.torch.
[37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact cnns. In ECCV, 2016.
[38] B. Zoph and Q. V. Le. Neural architecture search with rein.forcement learning. In ICLR, 2017.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Learning Structured Sparsity in Deep Neural Networks
Wei Wen Chunpeng Wu Yandan Wang
University of Pittsburgh University of Pittsburgh University of Pittsburgh
wew57@pitt.edu chw127@pitt.edu yaw46@pitt.edu
Yiran Chen Hai Li
University of Pittsburgh University of Pittsburgh
yic52@pitt.edu hal66@pitt.edu
Abstract
High demand for computation resources severely hinders deployment of large-scale
Deep Neural Networks (DNN) in resource constrained devices. In this work, we
propose aStructured Sparsity Learning(SSL) method to regularize the structures
(i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1)
learn a compact structure from a bigger DNN to reduce computation cost; (2)
obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate
the DNNs evaluation. Experimental results show that SSL achieves on average
5.1%and 3.1%speedups of convolutional layer computation of AlexNet against
CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about
twice speedups of non-structured sparsity; (3) regularize the DNN structure to
improve classification accuracy. The results show that for CIFAR-10, regularization
on layer depth can reduce 20 layers of a Deep Residual Network ( ResNet ) to
18 layers while improve the accuracy from 91.25% to 92.60%, which is still
slightly higher than that of original ResNet with 32 layers. For AlexNet , structure
regularization by SSL also reduces the error by%1%. Our source code can be
found athttps://github.com/wenwei202/caffe/tree/scnn
1 Introduction
Deep neural networks (DNN), especially deep convolutional neural networks (CNN), made
remarkable success in visual tasks[1][2][3][4][5] by leveraging large-scale networks learning from
a huge volume of data. Deployment of such big models, however, is computation-intensive and
memory-intensive. To reduce computation cost, many studies are performed to compress the scale of
DNN, including sparsity regularization[6], connection pruning[7][8] and low rank approximation
[9][10][11][12][13]. Sparsity regularization and connection pruning approaches, however, often pro-
duce non-structured random connectivity in DNN and thus, irregular memory access that adversely
impacts practical acceleration in hardware platforms. Figure 1 depicts practical speedup of each
layer of a AlexNet , which is non-structurally sparsified by l1-norm. Compared to original model,
the accuracy loss of the sparsified model is controlled within 2%. Because of the poor data locality
associated with the scattered weight distribution, the achieved speedups are either very limited or
negative even the actual sparsity is high, say, >95%. We define sparsity as the ratio of zeros in this
paper. In recently proposed low rank approximation approaches, the DNN is trained first and then
each trained weight tensor is decomposed and approximated by a product of smaller factors. Finally,
fine-tuning is performed to restore the model accuracy. Low rank approximation is able to achieve
practical speedups because it coordinates model parameters in dense matrixes and avoids the locality
problem of non-structured sparsity regularization. However, low rank approximation can only obtain
<<FIGURE>>
Figure 1: Evaluation speedups of AlexNet on GPU platforms and the sparsity. conv1 refers to
convolutional layer 1, and so forth. Baseline is profiled by GEMM of cuBLAS. The sparse matrixes
are stored in the format of Compressed Sparse Row (CSR) and accelerated by cuSPARSE.
the compact structure within each layer, and the structures of the layers are fixed during fine-tuning
such that costly reiterations of decomposing and fine-tuning are required to find an optimal weight
approximation for performance speedup and accuracy retaining.
Inspired by the facts that (1) there is redundancy across filters and channels [11]; (2) shapes of
filters are usually fixed as cuboid but enabling arbitrary shapes can potentially eliminate unnecessary
computation imposed by this fixation; and (3) depth of the network is critical for classification
but deeper layers cannot always guarantee a lower error because of the exploding gradients and
degradation problem [5], we propose Structured Sparsity Learning (SSL) method to directly learn
a compressed structure of deep CNNs by group Lasso regularization during the training. SSL is a
generic regularization to adaptively adjust multiple structures in DNN, including structures of filters,
channels, and filter shapes within each layer, and structure of depth beyond the layers. SSL combines
structure regularization (on DNN for classification accuracy) with locality optimization (on memory
access for computation efficiency), offering not only well-regularized big models with improved
accuracy but greatly accelerated computation (e.g. 5.1% on CPU and 3.1% on GPU for AlexNet ).
2 Related works
Connection pruning and weight sparsifying. Hanet al.[7][8] reduced number of parameters of
AlexNet by 9% andVGG-16by 13% using connection pruning. Since most reduction is achieved
on fully-connected layers, the authors obtained 3% to 4% layer-wise speedup for fully-connected
layers. However, no practical speedups of convolutional layers are observed because of the issue
shown in Figure 1. As convolution is the computational bottleneck and many new DNNs use fewer
fully-connected layers,e.g., only 3.99% parameters of ResNet -152in [5] are from fully-connected
layers, compression and acceleration on convolutional layers become essential. Liuet al.[6] achieved
>90% sparsity of convolutional layers in AlexNet with 2% accuracy loss, and bypassed the issue
shown in Figure 1 by hardcoding the sparse weights into program, achieving layer-wise 4.59%
speedup on a CPU. In this work, we also focus on convolutional layers. Compared to the above
techniques, our SSL method can coordinate sparse weights in adjacent memory space and achieve
higher speedups with the same accuracy. Note that hardware and program optimizations can further
boost the system performance on top of the level of SSL but are not covered in this work.
Low rank approximation. Denilet al.[9] predicted 95% parameters in a DNN by exploiting the
redundancy across filters and channels. Inspired by it, Jaderberget al.[11] achieved 4.5% speedup
on CPUs for scene text character recognition and Dentonet al.[10] achieved 2% speedups on both
CPUs and GPUs for the first two layers. Both of the works usedLow Rank Approximation(LRA)
with%1% accuracy drop. [13][12] improved and extended LRA to larger DNNs. However, the
network structure compressed by LRA is fixed; reiterations of decomposing, training/fine-tuning,
and cross-validating are still needed to find an optimal structure for accuracy and speed trade-off.
As number of hyper-parameters in LRA method increases linearly with layer depth [10][13], the
search space increases linearly or even polynomially for very deep DNNs. Comparing to LRA, our
contributions are: (1) SSL can dynamically optimize the compactness of DNN structure with only
one hyper-parameter and no reiterations; (2) besides the redundancy within the layers, SSL also
exploits the necessity of deep layers and reduce them; (3) DNN filters regularized by SSL have lower
rank approximation, so it can work together with LRA for more efficient model compression.
Model structure learning.Group Lasso [14] is an efficient regularization to learn sparse structures.
Kimet al.[15] used group Lasso to regularize the structure of correlation tree for multi-task regression
problem and reduced prediction errors. Liuet al.[6] utilized group Lasso to constrain the scale
<<FORMULA>>
<<FORMULA>>
Figure 2: The proposed structured sparsity learning (SSL) for DNNs. Weights in filters are split W(l)
into multiple groups. Through group Lasso regularization, a more compact DNN is obtained by :,c l ,:,:
removing some groups. The figure illustrates the filter-wise, channel-wise, shape-wise, and depth-wise
structured sparsity that were explored in the work.
<<FORMULA>>
of the structure of LRA. To adapt DNN structure to different databases, Fenget al.[16] learned
the appropriate number of filters in DNN. Different from these prior arts, we apply group Lasso to
regularize multiple DNN structures (filters, channels, filter shapes, and layer depth). Our source code
can be found at https://github.com/wenwei202/caffe/tree/scnn.
3 Structured Sparsity Learning Method for DNNs
We focus mainly on theStructured Sparsity Learning(SSL) on convolutional layers to regularize the
structure of DNNs. We first propose a generic method to regularize structures of DNN in Section 3.1, 1
and then specify the method to structures of filters, channels, filter shapes and depth in section 3.2.
Variants of formulations are also discussed from computational efficiency viewpoint in Section 3.3.
3.1 Proposed structured sparsity learning for generic structures
Suppose weights of convolutional layers in a DNN form a sequence of 4-D tensors
<<FORMULA>>, where <<FORMULA>> and <<FORMULA>> are the dimensions of the l-th
weight tensor along the axes of filter, channel, spatial height and spatial width, respectively.
L denotes the number of convolutional layers. Then the proposed generic optimization target of a DNN with
structured sparsity regularization can be formulated as: 1
<<FORMULA>> (1)
Here W represents the collection of all weights in the <<FORMULA>> is the loss on data <<FORMULA>> is
non-structured regularization applying on every weight,e.g., l2-norm; and <<FORMULA>> is the structured
sparsity regularization on each layer. Because Group Lasso can effectively zero out all weights in
some groups [14][15], we adopt it in our SSL. The regularization of group Lasso on a set of weights
Pw can be represented as <<FORMULA>>, where <<FORMULA>> is a group of partial weights in w
and G is the total number of groups. Different groups may overlap. Here <<FORMULA>>, where
<<FORMULA>> the number of weights in <<FORMULA>>.
3.2 Structured sparsity learning for structures of filters, channels, filter shapes and depth
In SSL, the learned “structure” is decided by the way of splitting groups ofw(g) . We investigate and
formulate thefiler-wise,channel-wise,shape-wise, and depth-wise structured sparsity in Figure 2.
For simplicity, the <<FORMULA>> term of Eq. (1) is omitted in the following formulation expressions.
Penalizing unimportant filers and channels. Suppose <<FORMULA>> is then l-th filter and <<FORMULA>> is the
cl-th channel of all filters in the l-th layer. The optimization target of learning the filter-wise and
channel-wise structured sparsity can be defined as
<<FORMULA>> (2)
As indicated in Eq. (2), our approach tends to remove less important filters and channels. Note
that zeroing out a filter in the l-th layer results in a dummy zero output feature map, which in turn
makes a corresponding channel in the (l+ 1)-th layer useless. Hence, we combine the filter-wise and
channel-wise structured sparsity in the learning simultaneously.
Learning arbitrary shapes of filers. As illustrated in Figure 2, <<FORMULA>> denotes the vector of
:;c l ;m l ;k all corresponding weights located at spatial position of <<FORMULA>> in the 2D filters across the cl-th
channel. Thus, we defineW(l) as the shape fiber related to learning arbitrary filter shape <<FORMULA>> because a
homogeneous non-cubic filter shape can be learned by zeroing out some shape fibers. The l
optimization target of learning shapes of filers becomes:
<<FORMULA>> (3)
Regularizing layer depth. We also explore the depth-wise sparsity to regularize the depth of DNNs
in order to improve accuracy and reduce computation cost. The corresponding optimization target is
Different from other discussed sparsification techniques,
zeroing out all the filters in a layer will cut off the message propagation in the DNN so that the output
neurons cannot perform any classification. Inspired by the structure of highway networks [17] and
deep residual networks [5], we propose to leverage the shortcuts across layers to solve this issue. As
illustrated in Figure 2, even when SSL removes an entire unimportant layers, feature maps will still
be forwarded through the shortcut.
3.3 Structured sparsity learning for computationally efficient structures
All proposed schemes in section 3.2 can learn a compact DNN for computation cost reduction.
Moreover, some variants of the formulations of these schemes can directly learn structures that can
be efficiently computed.
2D-filter-wise sparsity for convolution. 3D convolution in DNNs essentially is a composition of 2D
convolutions. To perform efficient convolution, we explored a fine-grain variant of filter-wise sparsity,
namely,2D-filter-wise sparsity, to spatially enforce group Lasso on each 2D filter ofW(l)nl ;c l ;:;: . The
saved convolution is proportional to the percentage of the removed 2D filters. The fine-grain version
of filter-wise sparsity can more efficiently reduce the computation associated with convolution:
Because the group sizes are much smaller and thus the weight updating gradients are shaper, it helps
group Lasso to quickly obtain a high ratio of zero groups for a large-scale DNN.
Combination of filter-wise and shape-wise sparsity for GEMM. Convolutional computation in
DNNs is commonly converted to modality of general Matrix Multiplication (GEMM) by lowering
weight tensors and feature tensors to matrices [18]. For example, in Caffe [19], a 3D filter <<FORMULA>> is
reshaped to a row in the weight matrix where each column is the collection of weights <<FORMULA>>
related to shape-wise sparsity. Combining filter-wise and shape-wise sparsity can directly reduce the
dimension of weight matrix in GEMM by removing zero rows and columns. In this context, we use
row-wise and column-wise sparsity as the interchangeable terminology of filter-wise and shape-wise
sparsity, respectively.
4 Experiments
We evaluated the effectiveness of our SSL using published models on three databases MNIST,
CIFAR-10, and ImageNet. Without explicit explanation, SSL starts with the network whose weights
are initialized by the baseline, and speedups are measured in matrix-matrix multiplication by Caffe in
a single-thread Intel Xeon E5-2630 CPU .
Table 1: Results after penalizing unimportant filters and channels inLeNet
<<TABLE>>
4.1 LeNet and multilayer perceptron on MNIST
In the experiment of MNIST, we examined the effectiveness of SSL in two types of networks:
LeNet[20] implemented by Caffe and amultilayer perceptron(MLP) network. Both networks were
trained without data augmentation.
LeNet:When applying SSL toLeNet, we constrain the network with filter-wise and channel-wise
sparsity in convolutional layers to penalize unimportant filters and channels. Table 1 summarizes
the remained filters and channels,floating-point operations(FLOP), and practical speedups. In the
table,LeNet 1is the baseline and the others are the results after applying SSL in different strengths
of structured sparsity regularization. The results show that our method achieves the similar error
(0.1%) with much fewer filters and channels, and saves significant FLOP and computation time.
To demonstrate the impact of SSL on the structures of filters, we present all learned conv1 filters
in Figure 3. It can be seen that most filters inLeNet 2are entirely zeroed out except for five most
important detectors of stroke patterns that are sufficient for feature extraction. The accuracy of
LeNet 3(that further removes the weakest and redundant stroke detector) drops only 0.2% from that
ofLeNet 2. Compared to the random and blurry filter patterns inLeNet 1that resulted from the high
freedom of parameter space, the filters inLeNet 2 & 3are regularized and converge to smoother and
more natural patterns. This explains why our proposed SSL obtains the same-level accuracy but has
much less filters. The smoothness of the filters are also observed in the deeper layers.
The effectiveness of the shape-wise sparsity on LeNet is summarized in Table 2. The baselineLeNet 1
has conv1 filters with a regular 5x5 square (size = 25) whileLeNet 5reduces the dimension that
can be constrained by a 2x4 rectangle (size = 7). The 3D shape of conv2 filters in the baseline is
also regularized to the 2D shape inLeNet 5within only one channel, indicating that only one filter in
conv1 is needed. This fact significantly saves FLOP and computation time.
<<FIGURE>>
Figure 3: Learned conv1 filters in LeNet 1(top),LeNet 2(middle) and LeNet 3(bottom)
MLP:Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.the
number of neurons) of fully-connected layers. We enforce the group Lasso regularization on all the
input (or output) connections of each neuron. A neuron whose input connections are all zeroed out
can degenerate to a bias neuron in the next layer; similarly, a neuron can degenerate to a removable
dummy neuron if all of its output connections are zeroed out. Figure 4(a) summarizes the learned
structure and FLOP of differentMLPnetworks. The results show that SSL can not only remove
hidden neurons but also discover the sparsity of images. For example, Figure 4(b) depicts the number
of connections of each input neuron inMLP 2, where 40.18% of input neurons have zero connections
and they concentrate at the boundary of the image. Such a distribution is consistent with our intuition:
Table 2: Results after learning filter shapes inLeNet
<<TABLE>>
Figure 4: The normalized reconstructure error of weight matrix vs. the percent of ranks.Principal
Component Analysis(PCA) is utilized to explore the redundancy among filters.% ranks of eigenvectors
corresponding to the largest eigenvalues are selected as basis to perform low rank approximation.
Left:LeNet2 in Table 1; middle: ConvNet2 in Table 4; right: AlexNet 4 in Table 5. Dash lines
indicate baselines and solid lines indicate results of SSL.
170 detectors of stroke patterns which are sufficient for feature extraction. The accuracy ofLeNet 3
171 (that further removes one weakest and one redundant stroke detector) compared withLeNet 2drops
172 only 0.2%. Although the training processes of three networks are independent, the corresponding
173 regularized filters inLeNet 2andLeNet 3demonstrate very high similarity and represent certain level
174 of alikeness to those inLeNet 1. Comparing with random and blurry filter patterns inLeNet 1resulted
175 from the high freedom of parameter space, the filters inLeNet 2 & 3are regularized through the
176 filter-wise and channel-wise sparsity and therefore converge at smoother and more natural patterns.
177 This explains why our proposed SSL obtains the same-level accuracy but having much less filters.
178 These regularity and similarity phenomena are also observed in deeper layers. Different from low
179 rank decomposition which only explore the redundancy and does not change the rank, SSL can reduce
180 the redundancy as shown in Figure 4.
181 We also explore the effectiveness of the shape-wise sparsity onLeNetin Table 2. The baselineLeNet
182 1has a regular5⇥5square size of conv1 filters, whileLeNet 5reduces the dimension to less than
183 2⇥4. And the 3D shape of filters inconv2ofLeNet 1are regularized to 2D shape ofLeNet 5with
184 only one channel, indicating that only one filter in conv1 is needed. This saves significant FLOP and
185 computing time.
186 MLP:Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.
187 the number of neurons) in fully-connected layers. Here, the baselineMLPnetwork composed of
188 two hidden layers with 500 and 300 neurons respectively obtains a test error of 1.43%. We enforced
189 the group Lasso regularization on all the input (or output) connections of every neuron, including
190 those of the input layer. Note that a neuron with all the input connections zeroed out degenerate
191 to a bias neuron in the next layer; similarly, a neuron degenerates to a removable dummy neuron
192 if all of its output connections are zeroed out. As such, the computation ofGEneral Matrix Vector
193 (GEMV) product in fully-connected layers can be significantly reduced. Table 3 summarizes the
Table 3: Learning the number of neurons in multi-layer perceptron
<<TABLE>>
Figure 4: (a) Results of learning the number of neurons inMLP. (b) the connection numbers of input
<<FIGURE>>
handwriting digits are usually written in the center and pixels close to the boundary contain little
discriminative classification information.
4.2 ConvNet and ResNet on CIFAR-10
We implemented the ConvNet of [1] and deep residual networks( ResNet ) [5] on CIFAR-10. When
regularizing filters, channels, and filter shapes, the results and observations of both networks are
similar to that of the MNIST experiment. Moreover, we simultaneously learn the filter-wise and
shape-wise sparsity to reduce the dimension of weight matrix in GEMM ofConvNet. We also learn
the depth-wise sparsity of ResNet to regularize the depth of the DNNs.
ConvNet:We use the network from Alex Krizhevskyet al.[1] as the baseline and implement it
using Caffe. All the configurations remain the same as the original implementation except that we
added a dropout layer with a ratio of 0.5 in the fully-connected layer to avoid over-fitting.ConvNetis
trained without data augmentation. Table 3 summarizes the results of threeConvNetnetworks. Here,
the row/column sparsity of a weight matrix is defined as the percentage of all-zero rows/columns.
Figure 5 shows their learned conv1 filters. In Table 3, SSL can reduce the size of weight matrix
inConvNet 2by 50%, 70.7% and 36.1% for each convolutional layer and achieve good speedups
without accuracy drop. Surprisingly, without SSL, four conv1 filters of the baseline are actually
all-zeros as shown in Figure 5, demonstrating the great potential of filter sparsity. When SSL is
applied, half of conv1 filters inConvNet 2can be zeroed out without accuracy drop.
On the other hand, inConvNet 3, SSL achieves 1.0% (0.16%) lower error with a model even smaller
than the baseline. In this scenario, SSL performs as a structure regularization to dynamically learn a
better network structure (including the number of filters and filer shapes) to reduce the error.
<<FIGURE>>
Figure 5: Learned conv1 filters inConvNet 1(top),ConvNet 2(middle) andConvNet 3(bottom)
ResNet :To investigate the necessary depth of DNNs required by SSL, we use a 20-layer deep residual
networks ( ResNet -20) proposed in [5] as the baseline. The network has 19 convolutional layers and
1 fully-connected layer.Identity shortcuts are utilized to connect the feature maps with the same
dimension while 1%1 convolutional layers are chosen as shortcuts between the feature maps with
different dimensions. Batch normalization [21] is adopted after convolution and before activation.
We use the same data augmentation and training hyper-parameters as that in [5]. The final error of
baseline is 8.82%. In SSL, the depth of ResNet -20is regularized by depth-wise sparsity. Group Lasso
regularization is only enforced on the convolutional layers between each pair of shortcut endpoints,
excluding the first convolutional layer and all convolutional shortcuts. After SSL converges, layers
<<FIGURE>>
Figure 6: Error vs. layer number after depth regularization by SSL.
in [ 1412 5] with # layers.SSL- ResNet -#is the depth-regularized ResNet by SSL with # layers, including
the last fully-connected layer indicates the convolutional layers with an output map size of 32,64 32, and so forth
with all zero weights are removed and the net is finally fine-tuned with a base learning rate of 0.01,
Figure 6 plots the trend of the error vs. the number of layers under different strengths of depth
regularizations. Compared with original ResNet in [5], SSL learns a ResNet with 14 layers (SSL-
ResNet -14) that reaching a lower error than the one of the baseline with 20 layers ( ResNet -20);
SSL- ResNet -18and ResNet -32achieve an error of 7.40% and 7.51%, respectively. This result implies
that SSL can work as a depth regularization to improve classification accuracy. Note that SSL can
efficiently learn shallower DNNs without accuracy loss to reduce computation cost; however, it
does not mean the depth of the network is not important. The trend in Figure 6 shows that the test
error generally declines as more layers are preserved. A slight error rise of SSL-ResNet-20 from
SSL- ResNet -18shows the suboptimal selection of the depth in the group of “32x32”.
4.3 AlexNet on ImageNet
To show the generalization of our method to large scale DNNs, we evaluate SSL using AlexNet with
ILSVRC 2012.CaffeNet[19] the replication of AlexNet [1] with mirror changes, is used in our
experiment. All training images are rescaled to the size of 256x256. A 227%227 image is randomly
cropped from each scaled image and mirrored for data augmentation and only the center crop is
used for validation. The final top-1 validation error is 42.63%. In SSL, AlexNet is first trained with
structure regularization; when it converges, zero groups are removed to obtain a DNN with the new
structure; finally, the network is fine-tuned without SSL to regain the accuracy.
We first studied 2D-filter-wise and shape-wise sparsity by exploring the trade-offs between
computation complexity and classification accuracy. Figure 7(a) shows the 2D-filter sparsity (the ratio
between the removed 2D filters and total 2D filters) and the saved FLOP of 2D convolutions vs. the
validation error. In Figure 7(a), deeper layers generally have higher sparsity as the group size shrinks
<<FIGURE>>
Figure 7: (a) 2D-filter-wise sparsity and FLOP reduction vs. top-1 error. Vertical dash line shows the
error of original AlexNet ; (b) The reconstruction error of weight tensor vs. dimensionality.Principal
Component Analysis(PCA) is utilized to perform dimensionality reduction to exploit filter redundancy.
The eigenvectors corresponding to the largest eigenvalues are selected as basis of lower-dimensional
space. Dash lines denote the results of the baselines and solid lines indicate the ones of the AlexNet 5
in Table 4; (c) Speedups of1 -norm and SSL on various CPU and GPU platforms (In labels of x-axis,
T# is number of the maximum physical threads in Xeon CPU). AlexNet 1and AlexNet 2in Table 4
are used as test benches.
and the number of 2D filters grows. 2D-filter sparsity regularization can reduce the total FLOP by
30%40% without accuracy loss or reduce the error of AlexNet by%1% down to 41.69% by retaining
the original number of parameters. Shape-wise sparsity also obtains similar results In Table 4, for
example, AlexNet 5achieves on average 1.4%layer-wise speedup on both CPU and GPU without
accuracy loss after shape regularization; The top-1 error can also be reduced down to 41.83% if
the parameters are retained. In Figure 7(a), the obtained DNN with the lowest error has a very low
sparsity, indicating that the number of parameters in a DNN is still important to maintain learning
capacity. In this case, SSL works as a regularization to add restriction of smoothness to the model in
order to avoid over-fitting. Figure 7(b) compares the results of dimensionality reduction of weight
tensors in the baseline and our SSL-regularized AlexNet . The results show that the smoothness restriction
enforces parameter searching in lower-dimensional space and enables lower rank approximation
of the DNNs. Therefore, SSL can work together with low rank approximation to achieve even higher
model compression.
Besides the above analyses, the computation efficiencies of structured sparsity and non-structured
sparsity are compared in Caffe using standard off-the-shelf libraries,i.e., Intel Math Kernel Library
on CPU and CUDA cuBLAS and cuSPARSE on GPU. We use SSL to learn a AlexNet with high
column-wise and row-wise sparsity as the representative of structured sparsity method.1 -norm is
selected as the representative of non-structured sparsity method instead of connection pruning in
[7] because1 -norm get a higher sparsity on convolutional layers as the results of AlexNet 3and
AlexNet 4depicted in Table 4. Speedups achieved by SSL are measured by subroutines of GEMM
where nonzero rows and columns in each weight matrix are concatenated in consecutive memory
space. Note that compared to GEMM, the overhead of concatenation can be ignored. To measure the
speedups of1 -norm, sparse weight matrices are stored in the format of Compressed Sparse Row
(CSR) and computed by sparse-dense matrix multiplication subroutines.
Table 4 compares the obtained sparsity and speedups of1 -norm and SSL on CPU (Intel Xeon)
and GPU (GeForce GTX TITAN Black) under approximately the same errors,e.g., with acceptable
or no accuracy loss. For a fair comparison, after1 -norm regularization, the DNN is also fine-
tuned by disconnecting all zero-weighted connections so that 1.39% accuracy is recovered for the
AlexNet 1. Our experiments show that the DNNs require a very high non-structured sparsity to achieve
a reasonable speedup (The speedups are even negative when the sparsity is low). SSL, however, can
always achieve positive speedups. With an acceptable accuracy loss, our SSL achieves on average
5.1% and 3.1% layer-wise acceleration on CPU and GPU, respectively. Instead,1 -norm achieves
on average only 3.0% and 0.9% layer-wise acceleration on CPU and GPU, respectively. We note
that at the same accuracy, our average speedup is indeed higher than that of [6] which adopts heavy
hardware customization to overcome the negative impact of non-structured sparsity. Figure 7(c)
shows the speedups of1 -norm and SSL on various platforms, including both GPU (Quadro, Tesla
Table 4: Sparsity and speedup of AlexNet on ILSVRC 2012
<<TABLE>>
and Titan) and CPU (Intel Xeon E5-2630). SSL can achieve on average%3%speedup on GPU while
non-structured sparsity obtain no speedup on GPU platforms. On CPU platforms, both methods can
achieve good speedups and the benefit grows as the processors become weaker. Nonetheless, SSL
can always achieve averagely%2%speedup compared to non-structured sparsity.
5 Conclusion
In this work, we have proposed aStructured Sparsity Learning(SSL) method to regularize filter,
channel, filter shape, and depth structures in deep neural networks (DNN). Our method can enforce
the DNN to dynamically learn more compact structures without accuracy loss. The structured
compactness of the DNN achieves significant speedups for the DNN evaluation both on CPU
and GPU with off-the-shelf libraries. Moreover, a variant of SSL can be performed as structure
regularization to improve classification accuracy of state-of-the-art DNNs.
Acknowledgments
This work was supported in part by NSF XPS-1337198 and NSF CCF-1615475. The authors thank
Drs. Sheng Li and Jongsoo Park for valuable feedback on this work.
References
[1]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional
neural networks. InAdvances in Neural Information Processing Systems, pages 10971105. 2012.
[2]Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate
object detection and semantic segmentation. InThe IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2014.
[3]Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
tion.arXiv preprint arXiv:1409.1556, 2014.
[4]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.arXiv preprint
arXiv:1409.4842, 2015.
[5]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
arXiv preprint arXiv:1512.03385, 2015.
[6]Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional
neural networks. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[7]Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient
neural network. InAdvances in Neural Information Processing Systems, pages 11351143. 2015.
[8]Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with
pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015.
[9] Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. Predicting
parameters in deep learning. InAdvances in Neural Information Processing Systems, pages 21482156.
2013.
[10]Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure
within convolutional networks for efficient evaluation. InAdvances in Neural Information Processing
Systems, pages 12691277. 2014.
[11]Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with
low rank expansions.arXiv preprint arXiv:1405.3866, 2014.
[12]Yani Ioannou, Duncan P. Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training
cnns with low-rank filters for efficient image classification.arXiv preprint arXiv:1511.06744, 2015.
[13]Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with low-rank
regularization.arXiv preprint arXiv:1511.06067, 2015.
[14]Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.Journal of
the Royal Statistical Society. Series B (Statistical Methodology), 68(1):4967, 2006.
[15]Seyoung Kim and Eric P Xing. Tree-guided group lasso for multi-task regression with structured sparsity.
InProceedings of the 27th International Conference on Machine Learning, 2010.
[16]Jiashi Feng and Trevor Darrell. Learning the structure of deep convolutional networks. InThe IEEE
International Conference on Computer Vision (ICCV), 2015.
[17]Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint
arXiv:1505.00387, 2015.
[18]Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and
Evan Shelhamer. cudnn: Efficient primitives for deep learning.arXiv preprint arXiv:1410.0759, 2014.
[19]Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio
Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding.arXiv
preprint arXiv:1408.5093, 2014.
[20]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition.Proceedings of the IEEE, 86(11):22782324, 1998.
[21]Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift.arXiv preprint arXiv:1502.03167, 2015.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
MIXED PRECISION TRAINING
Sharan Narang % , Gregory Diamos, Erich Elsen y
Baidu Research
fsharan, gdiamosg@baidu.com
Paulius Micikevicius % , Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston,
Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu
NVIDIA
fpauliusm, alben, dagarcia, bginsburg, mhouston,
okuchaiev, gavenkatesh, skywg@nvidia.com
ABSTRACT
Increasing the size of a neural network typically improves accuracy but also in-
creases the memory and compute requirements for training the model. We intro-
duce methodology for training deep neural networks using half-precision float-
ing point numbers, without losing model accuracy or having to modify hyper-
parameters. This nearly halves memory requirements and, on recent GPUs,
speeds up arithmetic. Weights, activations, and gradients are stored in IEEE half-
precision format. Since this format has a narrower range than single-precision we
propose three techniques for preventing the loss of critical information. Firstly,
we recommend maintaining a single-precision copy of weights that accumulates
the gradients after each optimizer step (this copy is rounded to half-precision for
the forward- and back-propagation). Secondly, we propose loss-scaling to pre-
serve gradient values with small magnitudes. Thirdly, we use half-precision arith-
metic that accumulates into single-precision outputs, which are converted to half-
precision before storing to memory. We demonstrate that the proposed methodology
works across a wide variety of tasks and modern large scale (exceeding 100
million parameters) model architectures, trained on large datasets.
1 INTRODUCTION
Deep Learning has enabled progress in many different applications, ranging from image recognition
(He et al., 2016a) to language modeling (Jozefowicz et al., 2016) to machine translation (Wu et al.,
2016) and speech recognition (Amodei et al., 2016). Two trends have been critical to these results
- increasingly large training data sets and increasingly complex models. For example, the neural
network used in Hannun et al. (2014) had 11 million parameters which grew to approximately 67
million for bidirectional RNNs and further to 116 million for the latest forward only Gated Recurrent
Unit (GRU) models in Amodei et al. (2016).
Larger models usually require more compute and memory resources to train. These requirements
can be lowered by using reduced precision representation and arithmetic. Performance (speed) of
any program, including neural network training and inference, is limited by one of three factors:
arithmetic bandwidth, memory bandwidth, or latency. Reduced precision addresses two of these
limiters. Memory bandwidth pressure is lowered by using fewer bits to to store the same number of
values. Arithmetic time can also be lowered on processors that offer higher throughput for reduced
precision math. For example, half-precision math throughput in recent GPUs is 2% to 8% higher
than for single-precision. In addition to speed improvements, reduced precision formats also reduce
the amount of memory required for training.
Modern deep learning training systems use single-precision (FP32) format. In this paper, we address
the training with reduced precision while maintaining model accuracy. Specifically, we train various
neural networks using IEEE half-precision format (FP16). Since FP16 format has a narrower
dynamic range than FP32, we introduce three techniques to prevent model accuracy loss: maintain-
ing a master copy of weights in FP32, loss-scaling that minimizes gradient values becoming zeros,
and FP16 arithmetic with accumulation in FP32. Using these techniques we demonstrate that a
wide variety of network architectures and applications can be trained to match the accuracy FP32
training. Experimental results include convolutional and recurrent network architectures, trained
for classification, regression, and generative tasks. Applications include image classification, image
generation, object detection, language modeling, machine translation, and speech recognition. The
proposed methodology requires no changes to models or training hyper-parameters.
2 RELATED WORK
There have been a number of publications on training Convolutional Neural Networks (CNNs) with
reduced precision. Courbariaux et al. (2015) proposed training with binary weights, all other tensors
and arithmetic were in full precision. Hubara et al. (2016a) extended that work to also binarize
the activations, but gradients were stored and computed in single precision. Hubara et al. (2016b)
considered quantization of weights and activations to 2, 4 and 6 bits, gradients were real numbers.
Rastegari et al. (2016) binarize all tensors, including the gradients. However, all of these approaches
lead to non-trivial loss of accuracy when larger CNN models were trained for ILSVRC classification
task (Russakovsky et al., 2015). Zhou et al. (2016) quantize weights, activations, and gradients
to different bit counts to further improve result accuracy. This still incurs some accuracy loss and
requires a search over bit width configurations per network, which can be impractical for larger
models. Mishra et al. improve on the top-1 accuracy achieved by prior weight and activation
quantizations by doubling or tripling the width of layers in popular CNNs. However, the gradients are
still computed and stored in single precision, while quantized model accuracy is lower than that of
the widened baseline. Gupta et al. (2015) demonstrate that 16 bit fixed point representation can be
used to train CNNs on MNIST and CIFAR-10 datasets without accuracy loss. It is not clear how
this approach would work on the larger CNNs trained on large datasets or whether it would work for
Recurrent Neural Networks (RNNs).
There have also been several proposals to quantize RNN training. He et al. (2016c) train quantized
variants of the GRU (Cho et al., 2014) and Long Short Term Memory (LSTM) (Hochreiter and
Schmidhuber, 1997) cells to use fewer bits for weights and activations, albeit with a small loss in
accuracy. It is not clear whether their results hold for larger networks needed for larger datasets
Hubara et al. (2016b) propose another approach to quantize RNNs without altering their structure.
Another approach to quantize RNNs is proposed in Ott et al. (2016). They evaluate binary, ternary
and exponential quantization for weights in various different RNN models trained for language
modelling and speech recognition. All of these approaches leave the gradients unmodified in single-
precision and therefore the computation cost during back propagation is unchanged.
The techniques proposed in this paper are different from the above approaches in three aspects.
First, all tensors and arithmetic for forward and backward passes use reduced precision, FP16 in
our case. Second, no hyper-parameters (such as layer width) are adjusted. Lastly, models trained
with these techniques do not incur accuracy loss when compared to single-precision baselines. We
demonstrate that this technique works across a variety of applications using state-of-the-art models
trained on large scale datasets.
3 IMPLEMENTATION
We introduce the key techniques for training with FP16 while still matching the model accuracy of
FP32 training session: single-precision master weights and updates, loss-scaling, and accumulating
FP16 products into FP32. Results of training with these techniques are presented in Section 4.
3.1 FP32 MASTER COPY OF WEIGHTS
In mixed precision training, weights, activations and gradients are stored as FP16. In order to match
the accuracy of the FP32 networks, an FP32 master copy of weights is maintained and updated with
the weight gradient during the optimizer step. In each iteration an FP16 copy of the master weights is
used in the forward and backward pass, halving the storage and bandwidth needed by FP32 training.
Figure 1 illustrates this mixed precision training process.
While the need for FP32 master weights is not universal, there are two possible reasons why a
number of networks require it. One explanation is that updates (weight gradients multiplied by the
learning rate) become too small to be represented in FP16 - any value whose magnitude is smaller
than2%24 becomes zero in FP16. We can see in Figure 2b that approximately 5% of weight gradient
values have exponents smaller than%24. These small valued gradients would become zero in the
optimizer when multiplied with the learning rate and adversely affect the model accuracy. Using a
single-precision copy for the updates allows us to overcome this problem and recover the accuracy.
Another explanation is that the ratio of the weight value to the weight update is very large. In
this case, even though the weight update is representable in FP16, it could still become zero when
addition operation right-shifts it to align the binary point with the weight. This can happen when
the magnitude of a normalized weight value is at least 2048 times larger that of the weight update.
Since FP16 has 10 bits of mantissa, the implicit bit must be right-shifted by 11 or more positions to
potentially create a zero (in some cases rounding can recover the value). In cases where the ratio is
larger than 2048, the implicit bit would be right-shifted by 12 or more positions. This will cause the
weight update to become a zero which cannot be recovered. An even larger ratio will result in this
effect for de-normalized numbers. Again, this effect can be counteracted by computing the update
in FP32.
To illustrate the need for an FP32 master copy of weights, we use the Mandarin speech model
(described in more detail in Section 4.3) trained on a dataset comprising of approximately 800 hours
of speech data for 20 epochs. As shown in 2a, we match FP32 training results when updating an
FP32 master copy of weights after FP16 forward and backward passes, while updating FP16 weights
results in 80% relative accuracy loss.
Even though maintaining an additional copy of weights increases the memory requirements for the
weights by 50% compared with single precision training, impact on overall memory usage is much
smaller. For training memory consumption is dominated by activations, due to larger batch sizes
and activations of each layer being saved for reuse in the back-propagation pass. Since activations
are also stored in half-precision format, the overall memory consumption for training deep neural
networks is roughly halved.
3.2 LOSS SCALING
FP16 exponent bias centers the range of normalized value exponents to[%14;15]while gradient
values in practice tend to be dominated by small magnitudes (negative exponents). For example,
consider Figure 3 showing the histogram of activation gradient values, collected across all layers
during FP32 training of Multibox SSD detector network (Liu et al., 2015a). Note that much of
the FP16 representable range was left unused, while many values were below the minimum representable
range and became zeros. Scaling up the gradients will shift them to occupy more of the
representable range and preserve values that are otherwise lost to zeros. This particular network
diverges when gradients are not scaled, but scaling them by a factor of 8 (increasing the exponents
by 3) is sufficient to match the accuracy achieved with FP32 training. This suggests that activation
gradient values below2%27 in magnitude were irrelevant to the training of this model, but values in
the[2 %27 ;2%24 )range were important to preserve.
One efficient way to shift the gradient values into FP16-representable range is to scale the loss value
computed in the forward pass, prior to starting back-propagation. By chain rule back-propagation
ensures that all the gradient values are scaled by the same amount. This requires no extra operations
during back-propagation and keeps the relevant gradient values from becoming zeros. Weight gradients
must be unscaled before weight update to maintain the update magnitudes as in FP32 training. It
is simplest to perform this unscaling right after the backward pass but before gradient clipping or any
other gradient-related computations, ensuring that no hyper-parameters (such as gradient clipping
threshold, weight decay, etc.) have to be adjusted.
There are several options to choose the loss scaling factor. The simplest one is to pick a constant
scaling factor. We trained a variety of networks with scaling factors ranging from 8 to 32K
(many networks did not require a scaling factor). A constant scaling factor can be chosen empirically
or, if gradient statistics are available, directly by choosing a factor so that its product with
the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16).
There is no downside to choosing a large scaling factor as long as it does not cause overflow during
back-propagation - overflows will result in infinities and NaNs in the weight gradients which will
irreversibly damage the weights after an update. Note that overflows can be efficiently detected by
inspecting the computed weight gradients, for example, when weight gradient values are unscaled.
One option is to skip the weight update when an overflow is detected and simply move on to the
next iteration.
<<FIGURE>>
Figure 2: Figure 2a shows the results of three experiments; baseline (FP32), pseudo FP16 with
FP32 master copy, pseudo FP16 without FP32 master copy. Figure 2b shows the histogram for the
exponents of weight gradients for Mandarin speech recognition training with FP32 weights. The
gradients are sampled every 4,000 iterations during training for all the layers in the model.
<<FIGURE>>
Figure 3: Histogram of activation gradient values during the training of Multibox SSD network.
Note that the bins on the x-axis cover varying ranges and theres a separate bin for zeros. For
example, 2% of the values are in the[2 %34 ;2%32 )range, 2% of values are in the[2 %24 ;2%23 )range,
and 67% of values are zero.
3.3 ARITHMETIC PRECISION
By and large neural network arithmetic falls into three categories: vector dot-products, reductions,
and point-wise operations. These categories benefit from different treatment when it comes to
reduced precision arithmetic. To maintain model accuracy, we found that some networks require that
FP16 vector dot-product accumulates the partial products into an FP32 value, which is converted
to FP16 before writing to memory. Without this accumulation in FP32, some FP16 models did not
match the accuracy of the baseline models. Whereas previous GPUs supported only FP16 multiply-
add operation, NVIDIA Volta GPUs introduce Tensor Cores that multiply FP16 input matrices and
accumulate products into either FP16 or FP32 outputs (NVIDIA, 2017).
Large reductions (sums across elements of a vector) should be carried out in FP32. Such reductions
mostly come up in batch-normalization layers when accumulating statistics and softmax layers.
Both of the layer types in our implementations still read and write FP16 tensors from memory,
performing the arithmetic in FP32. This did not slow down the training process since these layers
are memory-bandwidth limited and not sensitive to arithmetic speed.
Point-wise operations, such as non-linearities and element-wise matrix products, are memory-
bandwidth limited. Since arithmetic precision does not impact the speed of these operations, either
FP16 or FP32 math can be used.
4 RESULTS
We have run experiments for a variety of deep learning tasks covering a wide range of deep learning
models. We conducted the following experiments for each application:
%Baseline (FP32): Single-precision storage is used for activations, weights and gradients.
All arithmetic is also in FP32.
%Mixed Precision (MP): FP16 is used for storage and arithmetic. Weights, activations and
gradients are stored using in FP16, an FP32 master copy of weights is used for updates.
Loss-scaling is used for some applications. Experiments with FP16 arithmetic used Tensor
Core operations with accumulation into FP32 for convolutions, fully-connected layers, and
matrix multiplies in recurrent layers.
The Baseline experiments were conducted on NVIDIAs Maxwell or Pascal GPU. Mixed Precision
experiments were conducted on Volta V100 that accumulates FP16 products into FP32. The mixed
precision speech recognition experiments (Section 4.3) were conducted using Maxwell GPUs using
FP16 storage only. This setup allows us to emulate the TensorCore operations on non-Volta hard-
ware. A number of networks were trained in this mode to confirm that resulting model accuracies
are equivalent to MP training run on Volta V100 GPUs. This is intuitive since MP arithmetic was
accumulating FP16 products into FP32 before converting the result to FP16 on a memory write.
4.1 CNN S FOR ILSVRC CLASSIFICATION
We trained several CNNs for ILSVRC classification task (Russakovsky et al., 2015) using mixed
precision: Alexnet, VGG-D, GoogLeNet, Inception v2, Inception v3, and pre-activation Resnet-50.
In all of these cases we were able to match the top-1 accuracy of baseline FP32 training session
using identical hyper-parameters. Networks were trained using Caffe (Jia et al., 2014) framework
modified to use Volta TensorOps, except for Resnet50 which used PyTorch (Paszke et al., 2017).
Training schedules were used from public repositories, when available (training schedule for VGG-
D has not been published). Top-1 accuracy on ILSVRC validation set are shown in Table 1. Baseline
(FP32) accuracy in a few cases is different from published results due to single-crop testing and a
simpler data augmentation. Our data augmentation in Caffe included random horizontal flipping and
random cropping from 256x256 images, Resnet50 training in PyTorch used the full augmentation in
the training script from PyTorch vision repository.
Table 1: ILSVRC12 classification top-1 accuracy.
<<TABLE>>
Loss-scaling technique was not required for successful mixed precision training of these networks.
While all tensors in the forward and backward passes were in FP16, a master copy of weights was
updated in FP32 as outlined in Section 3.1.
4.2 DETECTION CNN'S
Object detection is a regression task, where bounding box coordinate values are predicted by the
network (compared to classification, where the predicted values are passed through a softmax layer
to convert them to probabilities). Object detectors also have a classification component, where prob-
abilities for an object type are predicted for each bounding box. We trained two popular detection
approaches: Faster-RCNN (Ren et al., 2015) and Multibox-SSD (Liu et al., 2015a). Both detectors
used VGG-16 network as the backbone. Models and training scripts were from public repositories
(Girshick; Liu). Mean average precision (mAP) was computed on Pascal VOC 2007 test set. Faster-
RCNN was trained on VOC 2007 training set, whereas SSD was trained on a union of VOC 2007
and 2012 data, which is the reason behind baseline mAP difference in Table 2.
Table 2: Detection network average mean precision.
<<TABLE>>
As can be seen in table 2, SSD detector failed to train in FP16 without loss-scaling. By losing
small gradient values to zeros, as described in Section 3.2, poor weights are learned and training
diverges. As described in Section 3.2, loss-scaling factor of 8 recovers the relevant gradient values
and mixed-precision training matches FP32 mAP.
4.3 SPEECH RECOGNITION
We explore mixed precision training for speech data using the DeepSpeech 2 model for both English
and Mandarin datasets. The model used for training on the English dataset consists of two 2D con-
volution layers, three recurrent layers with GRU cells, 1 row convolution layer and Connectionist
temporal classification (CTC) cost layer (Graves et al., 2006). It has approximately 115 million
parameters. This model is trained on our internal dataset consisting of 6000 hours of English speech.
The Mandarin model has a similar architecture with a total of 215 million parameters. The Man-
darin model was trained on 2600 hours of our internal training set. For these models, we run the
Baseline and Pseudo FP16 experiments. All the models were trained for 20 epochs using Nesterov
Stochastic Gradient Descent (SGD). All hyper-parameters such as learning rate, annealing schedule
and momentum were the same for baseline and pseudo FP16 experiments. Table 3 shows the results
of these experiments on independent test sets.
Table 3: Character Error Rate (CER) using mixed precision training for speech recognition. English
results are reported on the WSJ 92 test set. Mandarin results are reported on our internal test set.
<<TABLE>>
Similar to classification and detection networks, mixed precision training works well for recurrent
neural networks trained on large scale speech datasets. These speech models are the largest models
trained using this technique. Also, the number of time-steps involved in training a speech model are
unusually large compared to other applications using recurrent layers. As shown in table 3, Pseudo
FP16 results are roughly 5 to 10% better than the baseline. This suggests that the half-precision
storage format may act as a regularizer during training.
<<TABLE>>
Figure 4: English to French translation network training perplexity, 3x1024 LSTM model with
attention. Ref1, ref2 and ref3 represent three different FP32 training runs.
4.4 MACHINE TRANSLATION
For language translation we trained several variants of the model in TensorFlow tutorial for
English to French translation (Google). The model used word-vocabularies, 100K and 40K entries for
English and French, respectively. The networks we trained had 3 or 5 layers in the encoder and
decoder, each. In both cases a layer consisted of 1024 LSTM cells. SGD optimizer was used to
train on WMT15 dataset. There was a noticeable variation in accuracy of different training sessions
with the same settings. For example, see the three FP32 curves in Figure 4, which shows the 3-layer
model. Mixed-precision with loss-scaling matched the FP32 results, while no loss-scaling resulted
in a slight degradation in the results. The 5-layer model exhibited the same training behavior.
4.5 LANGUAGE MODELING
We trained English language model, designated as big LSTM (Jozefowicz et al., 2016), on the 1
billion word dataset. The model consists of two layers of 8192 LSTM cells with projection to a
1024-dimensional embedding. This model was trained for 50 epochs using the Adagrad optimizer.
The the vocabulary size is 793K words. During training, we use a sampled softmax layer with 8K
negative samples. Batch size aggregated over 4 GPUs is 1024. To match FP32 perplexity training
this network with FP16 requires loss-scaling, as shown in Figure 5. Without loss scaling the training
perplexity curve for FP16 training diverges, compared with the FP32 training, after 300K iterations.
Scaling factor of 128 recovers all the relevant gradient values and the accuracy of FP16 training
matches the baseline run.
4.6 DCGAN RESULTS
Generative Adversarial Networks (GANs) combine regression and discrimination tasks during train-
ing. For image tasks, the generator network regresses pixel colors. In our case, the generator predicts
three channels of 8-bit color values each. The network was trained to generate 128x128 pixel im-
ages of faces, using DCGAN methodology (Radford et al., 2015) and CelebFaces dataset (Liu et al.,
2015b). The generator had 7 layers of fractionally-strided convolutions, 6 with leaky ReLU activa-
tions, 1 withtanh. The discriminator had 6 convolutions, and 2 fully-connected layers. All used
leaky ReLU activations except for the last layer, which used sigmoid. Batch normalization was ap-
plied to all layers except the last fully-connected layer of the discriminator. Adam optimizer was
used to train for 100K iterations. An set of output images in Figure 6. Note that we show a randomly
selected set of output images, whereas GAN publications typically show a curated set of outputs by
excluding poor examples. Unlike other networks covered in this paper, GANs do not have a widely-
accepted quantification of their result quality. Qualitatively the outputs of FP32 and mixed-precision
training appear comparable. This network did not require loss-scaling to match FP32 results.
<<FIGURE>>
Figure 5: bigLSTM training perplexity
<<FIGURE>>
Figure 6: An uncurated set of face images generated by DCGAN. FP32 training (left) and mixed-
precision training (right).
5 CONCLUSIONS AND FUTURE WORK
Mixed precision training is an important technique that allows us to reduce the memory consumption
as well as time spent in memory and arithmetic operations of deep neural networks. We have
demonstrated that many different deep learning models can be trained using this technique with no
loss in accuracy without any hyper-parameter tuning. For certain models with a large number of
small gradient values, we introduce the gradient scaling method to help them converge to the same
accuracy as FP32 baseline models.
DNN operations benchmarked with DeepBench 1 on Volta GPU see 2-6x speedups compared to
FP32 implementations if they are limited by memory or arithmetic bandwidth. Speedups are lower
when operations are latency-limited. Full network training and inference speedups depend on library
and framework optimizations for mixed precision and are a focus of future work (experiments in this
paper were carried out with early versions of both libraries and frameworks).
We would also like to extend this work to include generative models like text-to-speech systems
and deep reinforcement learning applications. Furthermore, automating loss-scaling factor selection
would further simplify training with mixed precision. Loss-scaling factor could be dynamically
increased or decreased by inspecting the weight gradients for overflow, skipping weight updates
when an overflow is detected.
REFERENCES
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski,
A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and
mandarin. InProceedings of The 33rd International Conference on Machine Learning, pages
173182, 2016.
K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.¨
Learning phrase representations using rnn encoder-decoder for statistical machine translation.
arXiv preprint arXiv:1406.1078, 2014.
M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with
binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages
31233131. Curran Associates, Inc., 2015. URLhttp://papers.nips.cc/paper/
5647-binaryconnect-training-deep-neural-networks-with-binary-weights-during-propagations.
pdf.
R. Girshick. Faster r-cnn github repository. https://github.com/rbgirshick/
py-faster-rcnn.
Google. Tensorflow tutorial: Sequence-to-sequence models. URL https://www.
tensorflow.org/tutorials/seq2seq.
A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification:´
labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd
international conference on Machine learning, pages 369376. ACM, 2006.
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical
precision. InProceedings of the 32nd International Conference on Machine Learning (ICML-15),
pages 17371746, 2015.
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sen-
gupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition.arXiv preprint
arXiv:1412.5567, 2014.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings
of the IEEE conference on computer vision and pattern recognition, pages 770778, 2016a.
K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. InECCV, 2016b.
Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou. Effective quantization methods for
recurrent neural networks.arXiv preprint arXiv:1611.10176, 2016c.
S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Comput., 9(8):17351780, Nov.
1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URLhttp://dx.doi.org/10.
1162/neco.1997.9.8.1735.
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In
Advances in Neural Information Processing Systems, pages 41074115, 2016a.
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural net-
works: Training neural networks with low precision weights and activations. arXiv preprint
arXiv:1609.07061, 2016b.
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reduc-
ing internal covariate shift. In F. R. Bach and D. M. Blei, editors,ICML, volume 37 of
JMLR Workshop and Conference Proceedings, pages 448456. JMLR.org, 2015. URLhttp:
//dblp.uni-trier.de/db/conf/icml/icml2015.html#IoffeS15.
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
Caffe: Convolutional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093,
2014.
R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language
modeling, 2016. URLhttps://arxiv.org/pdf/1602.02410.pdf.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convo-
lutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-
berger, editors, Advances in Neural Information Processing Systems 25, pages 1097
1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks.
pdf.
W. Liu. Ssd github repository.https://github.com/weiliu89/caffe/tree/ssd.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. Ssd: Single shot multibox detec-
tor.CoRR, abs/1512.02325, 2015a. URLhttp://dblp.uni-trier.de/db/journals/
corr/corr1512.html#LiuAESR15.
Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. InProceedings of
International Conference on Computer Vision (ICCV), 2015b.
A. Mishra, E. Nurvitadhi, J. Cook, and D. Marr. Wrpn: Wide reduced-precision networks.arXiv
preprint arXiv:1709.01134, year=2017.
NVIDIA. Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/
volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf,
2017.
J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio. Recurrent neural networks with limited numerical
precision.arXiv preprint arXiv:1608.06902, 2016.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,
and A. Lerer. Automatic differentiation in pytorch. 2017.
A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-
tional generative adversarial networks. CoRR, abs/1511.06434, 2015. URLhttp://dblp.
uni-trier.de/db/journals/corr/corr1511.html#RadfordMC15.
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi.XNOR-Net: ImageNet Classification Using
Binary Convolutional Neural Networks, pages 525542. Springer International Publishing, Cham,
2016. ISBN 978-3-319-46493-0. doi: 10.1007/978-3-319-46493-032. URLhttps://doi.
org/10.1007/978-3-319-46493-0_32.
S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with
region proposal networks. InNeural Information Processing Systems (NIPS), 2015.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
lenge.International Journal of Computer Vision (IJCV), 115(3):211252, 2015. doi: 10.1007/
s11263-015-0816-y.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recogni-
tion.arXiv preprint arXiv:1409.1556, 2014.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-
binovich. Going deeper with convolutions. InComputer Vision and Pattern Recognition (CVPR),
2015. URLhttp://arxiv.org/abs/1409.4842.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec-
ture for computer vision. InThe IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2016.
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,
K. Macherey, et al. Googles neural machine translation system: Bridging the gap between human
and machine translation.arXiv preprint arXiv:1609.08144, 2016.
S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth con-
volutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016. URL
http://arxiv.org/abs/1606.06160.
<<END>> <<END>> <<END>>
<<START>> <<START>> <<START>>
Learning to Generalize
SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING
MANFRED OPPER
Neural Computation Research Group Aston University Birmingham B4 7ET, United Kingdom
Introduction
Neural networks learn from examples. This statement is obviously true for the brain, but also artificial networks (or neural networks), which have become a powerful new tool for many pattern-recognition problems, adapt their synaptic couplings to a set of examples. Neural nets usually consist of many simple computing units which are combined in an architecture which is often independent from the problem. The parameters which control the interaction among the units can be changed during the learning phase and these are often called synaptic couplings. After the learning phase, a network adopts some ability to generalize from the examples; it can make predictions about inputs which it has not seen before; it has begun to understand a
Theories that try to understand the ability of neural networks to generalize from learned examples are discussed. Also, an approach that is based on ideas from statistical physics which aims to model typical learning behavior is compared with a worst-case framework.
rule. To what extent is it possible to understand the complexity of learning from examples by mathematical models and their solutions? This question is the focus of this article. I concentrate on the use of neural networks for classification. Here, one can take characteristic features (e.g., the pixels of an image) as an input pattern to the network. In the simplest case, it should decide whether a given pattern belongs (at least more likely) to a certain class of objects and respond with the output 1 or 1. To learn the under.lying Classification rule, the network is trained on a set of patterns together with the Classification labels, which are provided by a trainer. A heuristic strategy for training is to tune the parameters of the machine (the couplings of the network) using a learning algorithm, in such a way that the errors made on the set of training examples are small, in the hope that this helps to reduce the errors on new data. How well will the trained network be able to classify an in.
put that it has not seen before? This performance on new data defines the generalization ability of the network. This ability will be affected by the problem of realizability: The network may not be sufficiently complex to learn the rule completely or there may be ambiguities in Classification. Here, I concentrate on a second problem arising from the fact that learning will mostly not be exhaustive and the in.formation about the rule contained in the examples is not complete. Hence, the performance of a network may vary from one training set to another. In order to treat the generalization ability in a quantitative way, a common model assumes that all input patterns, those from the training set and the new one on which the network is tested, have a pre.assigned probability distribution (which characterizes the feature that must be classified), and they are produced in.dependently at random with the same probability distribution from the network's environment. Sometimes the probability distribution used to extract the examples and the Classification of these examples is called the rule. The network's performance on novel data can now be quantified by the so-called generalization error, which is the probability of misclassifying the test input and can be measured by repeating the same learning experiment many times with different data.
Within such a probabilistic framework, neural networks are often viewed as statistical adaptive models which should give a likely explanation of the observed data. In this frame.work, the learning process becomes mathematically related to a statistical estimation problem for optimal network parameters. Hence, mathematical statistics seems to be a most appropriate candidate for studying a neural network's behavior. In fact, various statistical approaches have been ap.plied to quantify the generalization performance. For ex.ample, expressions for the generalization error have been obtained in the limit, where the number of examples is large compared to the number of couplings (Seung et al., 1992; for the case of realizable rules they are also independent of the specific algorithm, as long as the training examples are perfectly learned. Because it is able to cover even bad situations which are unfavorable for improvement of the learning process, it is not surprising that this theory may in some cases provide too pessimistic results which are also too crude to reveal interesting behavior in the intermediate region of the learning curve.
In this article, I concentrate mainly on a different approach, which has its origin in statistical physics rather than in mathematical statistics, and compare its results with the worst-case results. This method aims at studying the typical rather than the worst-case behavior and often enables the exact calculations of the entire learning curve for models of simple networks which have many parameters. Since both biological and artificial neural networks are composed of many elements, it is hoped that such an approach may actually reveal some relevant and interesting structures.
At first, it may seem surprising that a problem should simplify when the number of its constituents becomes large. However, this phenomenon is well-known for macroscopic physical systems such as gases or liquids which consist of a huge number of molecules. Clearly, it is not possible to study the complete microscopic state of such a system, which is described by the rapidly fluctuating positions and velocities of all particles. On the other hand, macroscopic quantities such as density, temperature, and pressure are usually collective properties influenced by all elements. For such quantities, fluctuations are averaged out in the thermodynamic limit of a large number of particles and the collective properties become, to some extent, independent of the microstate. Similarly, the generalization ability of a neu.ral network is a collective property of all the network parameters, and the techniques of statistical physics allow, at least for some simple but nontrivial models, for exact computations in the thermodynamic limit. Before explaining these ideas in detail, I provide a short description of feed-forward neural networks.
Amari and Murata, 1993). In such a case, one can expect that learning is almost exhaustive, such that the statistical fluctuations of the parameters around their optimal values are small. However, in practice the number of parameters is
artificial Neural Networks often large so that the network can be flexible, and it is not clear how many examples are needed for the asymptotic theory to become valid. The asymptotic theory may actually miss interesting behavior of the so-called learning curve, which displays the progress of generalization ability with an increasing amount of training data.
A second important approach, which was introduced into mathematical statistics in the 1970s by Vapnik and Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact bounds for the generalization error which are valid for any number of training examples. Moreover, they are entirely independent of the underlying distribution of inputs,
and Based on highly idealized models of brain function, artificial neural networks are built from simple elementary computing units, which are sometimes termed neurons after their biological counterparts. Although hardware implementations have become an important research topic, neu.ral nets are still simulated mostly on standard computers.
Each computing unit of a neural net has a single output and several ingoing connections which receive the outputs of other units. To every ingoing connection (labeled by the index i) a real number is assigned, the synaptic weight w_i, which is the basic adjustable parameter of the network. To compute a unit's output, all incoming values xi are multi.plied by the weights wi and then added.
Figure 1a shows an example of such a computation with three couplings.
Finally, the result, <<FORMULA>>, is passed through an activation function which is typically of the shape of the red curve in Fig. 1a (a sigmoidal function), which allows for a soft, ambiguous Classification between 1 and 1.
Other important cases are the step function (green curve) and the linear function (yellow curve; used in the output neuron for problems of fitting continuous functions). In the following, to keep matters simple, I restrict the discussion mainly to the step function. Such simple units can develop a remarkable computational power when connected in a suitable architectures. An important network type is the feedforward architecture shown in Fig. 1b, which has two layers of computing units and adjustable couplings. The input nodes (which do not compute) are coupled to the so-called hidden units, which feed their outputs into one or more output units. With such an architecture and sigmoidal activation functions, any continuous function of the inputs can be arbitrarily closely approximated when the number of hidden units is sufficiently large.
<<FIGURE>>
FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numerical values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information.
LEARNING TO GENERALIZE
The Perceptron
The simplest type of network is the perceptron (Fig. 2a). There are N inputs, N synaptic couplings <<FORMULA>>, and the output is simply
<<FORMULA>>
It has a single-layer architecture and the step function (green curve in Fig. 1a) as its activation function. Despite
.
its simple structure, it can for many learning problems give a nontrivial generalization performance and may be used as a first step to an unknown Classification task. As can be seen by comparing Figs. 2a and 1b, it is also a building block for the more complex multilayer networks. Hence, understanding its performance theoretically may also pro.vide insight into the more complex machines. To learn a set of examples, a network must adjust its couplings appropriately (I often use the word couplings for their numerical strengths, the weights <<FORMULA>>, for <<FORMULA>>). Remarkably, for the perceptron there exists a simple learning algorithm which always enables the network to find those parameter values whenever the examples can be learnt by a perceptron. In Rosenblatt's algorithm, the input patterns are presented sequentially (e.g., in cycles) to the network and the
<<FIGURE>>
FIGURE 2 (a) The perceptron. (b) Classification of inputs by a perceptron with two inputs. The arrow indicates the vector composed of the weights of the network, and the line per.pendicular to this vector is the boundary between the classes of input.
output is tested. Whenever a pattern is not classified correctly, all couplings are altered simultaneously. We increase by a fixed amount all weights for which the input unit and the correct value of the output neuron have the same sign but we decrease them for the opposite sign. This simple algorithm is reminiscent of the so-called Hebbian learning rule, a physiological model of a learning processes in the real brain. It assumes that synaptic weights are increased when two neurons are simultaneously active. Rosenblatt's theorem states that in cases in which there exists a choice of the wi which classify correctly all of the examples (i.e., perfectly learnable perceptron), this algorithm finds a solution in a finite number of steps, which is at worst equal to A N3, where A is an appropriate constant.
It is often useful to obtain an intuition of a perceptron's Classification performance by thinking in terms of a geo.metric picture. We may view the numerical values of the in.puts as the coordinates of a point in some (usually) high-dimensional space. The case of two dimensions is shown in Fig. 2b. A corresponding point is also constructed for the couplings wi. The arrow which points from the origin of the coordinate system to this latter point is called the weight vector or coupling vector. An application of linear algebra to the computation of the network shows that the line which is perpendicular to the coupling vector is the boundary be.tween inputs belonging to the two different classes. Input points which are on the same side as the coupling vector are classified as 1 (the green region in Fig. 2b) and those on the other side as 1 (red region in Fig. 2b).
Rosenblatt's algorithm aims to determine such a line when it is possible. This picture generalizes to higher dimensions, for which a hyperplane plays the same role of the line of the previous two-dimensional example. We can still obtain an intuitive picture by projecting on two-dimensional planes. In Fig. 3a, 200 input patterns with random coordinates (randomly labeled red and blue) in a 200-dimensional input space are projected on the plane spanned by two arbitrary coordinate axes. If we instead use a plane for projection which contains the coupling vector (determined from a variant of Rosenblatt's algorithm) we obtain the view shown in Fig. 3b, in which red and green points are clearly separated and there is even a gap between the two clouds.
It is evident that there are cases in which the two sets of points are too mixed and there is no line in two dimensions (or no hyperplane in higher dimensions which separates them). In these cases, the rule is too complex to be perfectly learned by a perceptron. If this happens, we must attempt to determine the choice of the coupling which minimizes the number of errors on a given set of examples. Here, Rosenblatt's algorithm does not work and the problem of finding the minimum is much more difficult from the algorithmic point. The training error, which is the number of errors made on the training set, is usually a non-smooth function of the network couplings (i.e., it may have large variations for small changes of the couplings). Hence, in general, in addition to the perfectly learnable perceptron case in which the final error is zero, minimizing the training error is usually a difficult task which could take a large amount of computer time. However, in practice, iterative approaches, which are based on the minimization of other smooth cost functions, are used to train a neural network (Bishop, 1995).
As previously shown, perceptrons are only able to realize a very restricted type of Classification rules, the so-called linearly separable ones. Hence, independently from the issue of finding the best algorithm to learn the rule, one may ask the following question: In how many cases will the perceptron be able to learn a given set of training examples perfectly if the output labels are chosen arbitrarily? In order to answer this question in a quantitative way, it is convenient to introduce some concepts such as capacity, VC dimension and Worst-Case Generalization.
<<FIGURE>>
FIGURE 3 (a) Projection of 200 random points (with ran.dom labels) from a 200-dimensional space onto the first two coordinate axes (x1 and x2). (b) Projection of the same points onto a plane which contains the coupling vector of a perfectly trained perceptron.
LEARNING TO GENERALIZE
<<FORMULA>>, where the function <<FORMULA>> vanishes for a 2 and it is positive for a 2. Such a threshold phenomenon is an example of a phase transition (i.e., a sharp change of behavior) which can occur in the thermodynamic limit of a large network size.
and worst-case generalization, which can be used in the case of the perceptron and have a more general meaning.
In the case of perceptrons, this question was answered in the 1960s by Cover (1965). He calculated for any set of in.put patterns, e.g., m, the fraction of all the 2m possible map.pings that can be linearly separated and are thus learnable by perceptrons. This fraction is shown in Fig. 4 as a function of the number of examples per coupling for different numbers of input nodes (couplings) N. Three regions can be distinguished:
Region in which m/N 1: Simple linear algebra shows that it is always possible to learn all mappings when the number m of input patterns is less than or equal to the number N of couplings (there are simply enough adjustable parameters).
Region in which m/N 1: For this region, there are examples of rules that cannot be learned. However, when the number of examples is less than twice the number of couplings (m/N 2), if the network is large enough almost all mappings can be learned. If the output labels for each of the m inputs are chosen randomly 1 or 1 with equal probability, the probability of finding a nonrealizable coupling goes to zero exponentially when N goes to infinity at fixed ratio m/N.
Region in which m/N 2: For m/N 2 the probability for a mapping to be realizable by perceptrons decreases to zero rapidly and it goes to zero exponentially when N goes to infinity at fixed ratio m/N (it is proportional to
<<FIGURE>>
FIGURE 4 Fraction of all mappings of m input patterns which are learnable by perceptrons as a function of m/N for different numbers of couplings N: N 10 (in green), N 20 (in blue), and N 100 (in red). fraction of realizable mappings
Generally, the point at which such a transition takes place defines the so-called capacity of the neural network. Although the capacity measures the ability of a network to learn random mappings of the inputs, it is also related to its ability to learn a rule (i.e., to generalize from examples). The question now is, how does the network perform on a new example after having been trained to learn m example on the training set?
To obtain an intuitive idea of the connection between capacity and ability to generalize, we assume a training set of size m and a single pattern for test. Suppose we define a possible rule by an arbitrary learnable mapping from inputs to outputs. If m 1 is much larger than the capacity, then for most rules the labels on the m training pat.terns which the perceptron is able to recognize will nearly uniquely determine the couplings (and consequently the answer of the learning algorithm on the test pattern), and the rule can be perfectly understood from the examples. Be.low capacity, in most cases there are two different choices of couplings which give opposite answers for the test pat.tern. Hence, a correct Classification will occur with probability 0.5 assuming all rules to be equally probable. Figure 5 displays the two types of situations for m^3 and N^2.
This intuitive connection can be sharpened. Vapnik and Chervonenkis established a relation between a capacity such as quantity and the generalization ability that is valid for general classifiers (Vapnik, 1982, 1995). The VC dimension is defined as the size of the largest set of inputs for which all mappings can be learned by the type of classifier. It equals N for the perceptron. Vapnik and Chervonenkis were able to show that for any training set of size m
<<FIGURE>>
FIGURE 5 Classification rules for four patterns based on a perceptron. The patterns colored in red represent the training examples, and triangles and circles represent different class la.bels. The question mark is a test pattern. (a) There are two possible ways of classifying the test point consistent with the examples; (b) only one Classification is possible.
larger than the VC dimension DVC, the growth of the number of realizable mappings is bounded by an expression which grows much slower than 2m (in fact, only like a polynomial in m).
They proved that a large difference between training er.ror (i.e., the minimum percentage of errors that is done on the training set) and generalization error (i.e., the probability of producing an error on the test pattern after having learned the examples) of classifiers is highly improbable if the number of examples is well above DVC. This theorem implies a small expected generalization error for perfect learning of the training set results. The expected generalization error is bounded by a quantity which increases proportionally to DVC and decreases (neglecting logarithmic corrections in m) inversely proportional to m.
than DVC is also necessary for good generalization. The VC results should, in practice, enable us to select the network with the proper complexity which guarantees the smallest bound on the generalization error. For example, in order to find the proper size of the hidden layer of a network with two layers, one could train networks of different sizes on the same data.
The relation among these concepts can be better under.stood if we consider a family of networks of increasing complexity which have to learn the same rule. A qualitative picture of the results is shown in Fig. 6. As indicated by the blue curve in Fig. 6, the minimal training error will decrease for increasing complexity of the nets. On the other hand, the VC dimension and the complexity of the networks in.crease with the increasing number of hidden units, leading to an increasing expected difference (confidence interval) between training error and generalization error as indicated by the red curve. The sum of both (green curve) will have a minimum, giving the smallest bound on the generalization error. As discussed later, this procedure will in some cases lead to not very realistic estimates by the rather pessimistic bounds of the theory. In other words, the rigorous bounds, which are obtained from an arbitrary network and rule, are much larger than those determined from the results for most of the networks and rules. Conversely, one can construct a worst-case distribution
Typical Scenario: The Approach
of input patterns, for which a size of the training set larger of Statistical Physics When the number of examples is comparable to the size of the network, which for a perceptron equals the VC dimension, the VC theory states that one can construct malicious situations which prevent generalizations. However, in gen.eral, we would not expect that the world acts as an adver.sary. Therefore, how should one model a typical situation? As a first step, one may construct rules and pattern dis.tributions which act together in a nonadversarial way. The teacher<65>student paradigm has proven to be useful in such a situation. Here, the rule to be learned is modeled by a sec.ond network, the teacher network; in this case, if the teacher and the student have the same architecture and the same
<<FIGURE>>
FIGURE 6 As the complexity of the network varies (i.e., of the number of hidden units, as shown schematically below), the generalization error (in red), calculated from the sum of the training error (in green) and the confidence interval (in blue) according to the theory of Vapnik Chervonenkis, shows a minimum; this corresponds to the network with the best generalization ability.
number of units, the rule is evidently realizable. The correct class labels for any inputs are given by the outputs of the teacher. Within this framework, it is often possible to ob.tain simple expressions for the generalization error. For a perceptron, we can use the geometric picture to visualize the generalization error. A misClassification of a new in.put vector by a student perceptron with coupling vector ST occurs only if the input pattern is between the separating planes (dashed region in Fig. 7) defined by ST and the vector of teacher couplings TE. If the inputs are drawn randomly from a uniform distribution, the generalization error is directly proportional to the angle between ST and TE. Hence, the generalization error is small when teacher and student vectors are close together and decreases to zero when both coincide.
In the limit, when the number of examples is very large all the students which learn the training examples perfectly will not differ very much from and their couplings will be close to those of the teacher. Such cases with a small generalization error have been successfully treated by asymptotic methods of statistics. On the other hand, when the number of examples is relatively small, there are many different students which are consistent with the teacher regarding the training examples, and the uncertainty about
LEARNING TO GENERALIZE
<<FIGURE>>
FIGURE 7 For a uniform distribution of patterns, the generalization error of a perceptron equals the area of the shaded region divided by the area of the entire circle. ST and TE represent the coupling vectors of the student and teacher, respectively.
the true couplings of the teacher is large. Possible generalization errors may range from zero (if, by chance, a learning algorithm converges to the teacher) to some worst-case value. We may say that the constraint which specifies the macrostate of the network (its training error) does not spec.ify the microstate uniquely. Nevertheless, it makes sense to speak of a typical value for the generalization error, which is defined as the value which is realized by the majority of the students. In the thermodynamic limit known from statistical physics, in which the number of parameters of the network is taken to be large, we expect that in fact almost all students belong to this majority, provided the quantity of interest is a cooperative effect of all components of the system. As the geometric visualization for the generalization error of the perceptron shows, this is actually the case. The following approach, which was pioneered by Elizabeth Gardner (Gardner, 1988; Gardner and Derrida, 1989), is based on the calculation of V(e), the volume of the space of couplings which both perfectly implement m training examples and have a given generalization error e. For an intuitive picture, consider that only discrete values for the couplings are allowed; then <<FORMULA>> would be proportional to the number of students. The typical value of the generalization error is the value of e, which maximizes V(e). It should be kept in mind that V(e) is a random number and fluctuates from one training set to another. A correct treatment of this randomness requires involved mathematical techniques (Mzard et al., 1987). To obtain a picture which is quite often qualitatively correct, we may replace it by its average over many realizations of training sets. From elementary probability theory we see that this average number can be found by calculating the volume A of the space of all students with generalization error e, irrespective of their behavior on the training set, and multiplying it by the probability B that a student with generalization error e gives m times the correct answers on independent drawings of the input patterns. Since A increases exponentially with the number of couplings N (like typical volumes in N-dimensional spaces) and B decreases exponentially with m (because it becomes more improbable to be correct m times for any e 0), both factors can balance each other when m increases like m aN. a is an effective measure for the size of the training set when N goes to infinity. In order to have quantities which remain finite as N Sq, it is also useful to take the logarithm of V(e) and divide by N, which transforms the product into a sum of two terms. The first one (which is often called the entropic term) increases with increasing generalization error (green curve in Fig. 8). This is true because there are many networks which are not similar to the teacher, but there is only one network equal to the teacher. For almost all networks (remember, the entropic term does not include the effect of the training examples) e 0.5, i.e., they are correct half of the time by random guessing. On the other hand, the second term (red curve in Fig. 8) decreases with increasing generalization er.ror because the probability of being correct on an input pattern increases when the student network becomes more similar to the teacher. It is often called the energetic contribution because it favors highly ordered (toward the teacher) network states, reminiscent of the states of physical systems at low energies. Hence, there will be a maximum (Fig. 8, ar.row) of <<FORMULA>> at some value of e which by definition is the typical generalization error.
The development of the learning process as the number of examples aN increases can be understood as a competition between the entropic term, which favors disordered network configurations that are not similar to the teacher, and the energetic term. The latter term dominates when the number of examples is large. It will later be shown that such a competition can lead to a rich and interesting behavior as the number of examples is varied. The result for the learning curve (Gyrgyi and Tishby, 1990; Sompolinsky et al.,
FIGURE 8 Logarithm of the average volume of students that have learned m examples and give e generalization error (green curve). The blue and red curves represent the energetic and entropic contributions, respectively.
<<FIGURE>>
student is free to ask the teacher questions, i.e., if the stu.dent can choose highly informative input patterns. For the simple perceptron a fruitful query strategy is to select a new input vector which is perpendicular to the current coupling vector of the student (Kinzel and Ruj<75>n, 1990). Such an input is a highly ambiguous pattern because small changes
in the student couplings produce different Classification answers. For more complicated networks it may be difficult to obtain similar ambiguous inputs by an explicit construction. A general algorithm has been proposed (Seung et al.,
1992a) which uses the principle of maximal disagreement
in a committee of several students as a selection process for training patterns. Using an appropriate randomized train.ing strategy, different students are generated which all learn
the same set of examples. Next, any new input vector is only
<<FIGURE>>
FIGURE 9 Learning curves for typical student perceptrons. a m/N is the ratio between the number of examples and the coupling number.
1990) of a perceptron obtained by the statistical physics approach (treating the random sampling the proper way) is shown by the red curve of Fig. 9. In contrast to the worst-case predictions of the VC theory, it is possible to have some generalization ability below VC dimension or capacity. As we might have expected, the generalization error decreases monotonically, showing that the more that is learned, the more that is understood. Asymptotically, the error is pro-accepted for training when the disagreement of its classification between the students is maximal. For a committee of two students it can be shown that when the number of examples is large, the information gain does not decrease but reaches a positive constant. This results in a much faster decrease of the generalization error. Instead of being in.versely proportional to the number of examples, the de.crease is now exponentially fast.
portional to N and inversely proportional to m, in agree-monotonically decreasing learning curve, the possibility ment with the VC predictions. This may not be true for that some concrete learning algorithm may result in a set more complicated networks. of student couplings which are untypical in the sense of our theory cannot be ruled out. For bad students, even non-monotic generalization behavior is possible. The problem
Bad Students and Good Students
Although the typical student perceptron has a smooth,
Query Learning
Soon after Gardner's pioneering work, it was realized that the approach of statistical physics is closely related to ideas in information theory and Bayesian statistics (Levin et al., 1989; Gyfirgyi and Tishby, 1990; Opper and Haussler, 1991), for which the reduction of an initial uncertainty about the true state of a system (teacher) by observing data is a cen.tral topic of interest. The logarithm of the volume of rele.vant microstates as defined in the previous section is a di.rect measure for such uncertainty. The moderate progress in generalization ability displayed by the red learning curve of Fig. 9 can be understood by the fact that as learning progresses less information about the teacher is gained from a new random example. Here, the information gain is defined as the reduction of the uncertainty when a new example is learned. The decrease in information gain is due to the in.crease in the generalization performance. This is plausible because inputs for which the majority of student networks give the correct answer are less informative than those for which a mistake is more likely. The situation changes if the of a concrete learning algorithm can be made to fit into the statistical physics framework if the algorithm minimizes a certain cost function. Treating the achieved values of the new cost function as a macroscopic constraint, the tools of statistical physics apply again.
As an example, it is convenient to consider a case in which the teacher and the student have a different architectures: In one of the simplest examples one tries to learn a Classification problem by interpreting it as a regression problem, i.e., a problem of fitting a continuous function through data points. To be specific, we study the situation in which the teacher network is still given by a perceptron which computes binary valued outputs of the form y i wixi , 1, but as the student we choose a network with a linear transfer function (the yellow curve in Fig. 1a)
<<FORMULA>>
and try to fit this linear expression to the binary labels of the teacher. If the number of couplings is sufficiently large (larger than the number of examples) the linear function
LEARNING TO GENERALIZE
(unlike the sign) is perfectly able to fit arbitrary continuous output values. This linear fit is an attempt to explain the data in a more complicated way than necessary, and the couplings have to be <20>nely tuned in order to achieve this goal. We find that the student trained in such a way does not generalize well (Opper and Kinzel, 1995). In order to compare the Classifications of teacher and student on a new random input after training, we have finally converted the student<6E>s output into a Classification label by taking the sign of its output. As shown in the red curve of Fig. 10, after an initial improvement of performance the generalization error increases again to the random guessing value e 0.5 at a 1 (Fig. 10, red curve). This phenomenon is called overfitting. For a 1 (i.e., for more data than parameters), it is no longer possible to have a perfect linear fit through the data, but a fit with a minimal deviation from a linear function leads to the second part of the learning curve. e de.creases again and approaches 0 asymptotically for a Sq. This shows that when enough data are available, the details of the training algorithm are less important.
The dependence of the generalization performance on the complexity of the assumed data model is well-known. If function class is used that is too complex, data values can be perfectly fitted but the predicted function will be very sen.sitive to the variations of the data sample, leading to very unreliable predictions on novel inputs. On the other hand, functions that are too simple make the best fit almost insen.sitive to the data, which prevents us from learning enough from them.
It is also possible to calculate the worst-case generalization ability of perceptron students learning from a perceptron teacher. The largest generalization error is obtained (Fig. 7) when the angle between the coupling vectors of teacher and student is maximized under the constraint that the student learns all examples perfectly. Although it may not be easy to construct a learning algorithm which per.forms such a maximization in practice, the resulting gener.alization error can be calculated using the statistical phys.ics approach (Engel and Van den Broeck, 1993). The result is in agreement with the VC theory: There is no prediction better than random guessing below the capacity.
Although the previous algorithms led to a behavior which is worse than the typical one, we now examine the op.posite case of an algorithm which does better. Since the generalization ability of a neural network is related to the fact that similar input vectors are mapped onto the same out.put, one can assume that such a property can be enhanced if the separating gap between the two classes is maximized, which defines a new cost function for an algorithm. This optimal margin perceptron can be practically realized and when applied to a set of data leads to the projection of Fig. 11. As a remarkable result, it can be seen that there is a relatively large fraction of patterns which are located at the gap. These points are called support vectors (SVs). In order to understand their importance for the generalization abil.ity, we make the following gedankenexperiment and assume that all the points which lie outside the gap (the nonsupport vectors) are eliminated from the training set of examples.
From the two-dimensional projection of Fig. 11, we may conjecture that by running the maximal margin algorithm on the remaining examples (the SVs) we cannot create a larger gap between the points. Hence, the algorithm will converge to the same separating hyperplane as before. This intuitive picture is actually correct. If the SVs of a training set were known beforehand (unfortunately, they are only identi<74>ed after running the algorithm), the margin classifier would have to be trained only on the SVs. It would au.tomatically classify the rest of the training inputs correctly.
FIGURE 11 Learning with a margin classifier and m 300 examples in an N 150-dimensional space.
Bias/Variance trade-off
Hence, if in an actual Classification experiment the number of SVs is small compared to the number of non-SVs, we may expect a good generalization ability.
The learning curve for a margin classifier (Opper and Kinzel, 1995) learning from a perceptron teacher (calculated by the statistical physics approach) is shown in Fig. 10 (blue curve). The concept of a margin classifier has recently ber of consistent students is small; nevertheless, the few re.maining ones must still differ in a finite fraction of bits from each other and from the teacher so that perfect generalization is still impossible. For a slightly above ac only the couplings of the teacher survive.
been generalized to the so-called support vector machines (Vapnik, 1995), for which the inputs of a perceptron are re.placed by suitable features which are cleverly chosen nonlinear functions of the original inputs. In this way, nonlinear separable rules can be learned, providing an interesting alternative to multilayer networks.
Learning with Errors
The example of the Ising perceptron teaches us that it will
not always be simple to obtain zero training error. Moreover, an algorithm trying to achieve this goal may get stuck in local minima. Hence, the idea of allowing errors explic.
itly in the learning procedure, by introducing an appropriate noise, can make sense. An early analysis of such a sto-
The Ising Perceptron
The approach of statistical physics can develop a specific predictive power in situations in which one would like to un.derstand novel network models or architectures for which currently no ef<65>cient learning algorithm is known. As the simplest example, we consider a perceptron for which the couplings wj are constrained to binary values 1 and 1 (Gardner and Derrida, 1989; Gy<47>rgyi, 1990; Seung et al., 1992b). For this so-called Ising perceptron (named after Ernst Ising, who studied coupled binary-valued elements as a model for a ferromagnet), perfect learning of examples is equivalent to a difficult combinatorial optimization prob.lem (integer linear programming), which in the worst case is believed to require a learning time that increases expo.nentially with the number of couplings N.
To obtain the learning curve for the typical student, we can proceed as before, replacing V(e) by the number of student configurations that are consistent with the teacher which results in changing the entropic term appropriately. When the examples are provided by a teacher network of the same binary type, one can expect that the generalization error will decrease monotonically to zero as a function of a. The learning curve is shown as the blue curve in Fig. 9. For sufficiently small a, the discreteness of the couplings has al.most no effect. However, in contrast to the continuous case, perfect generalization does not require infinitely many examples but is achieved already at a finite number ac 1.24. This is not surprising because the teacher<65>s couplings con.tain only a finite amount of information (one bit per coupling) and one would expect that it does not take much more than about N examples to learn them. The remark.able and unexpected result of the analysis is the fact that the transition to perfect generalization is discontinuous. The generalization error decreases immediately from a non.zero value to zero. This gives an impression about the com.plex structure of the space of all consistent students and also gives a hint as to why perfect learning in the Ising per.ceptron is a difficult task. For a slightly below ac, the num.chastic training procedure and its generalization ability for the learning in so-called Boolean networks (with elemen.tary computing units different from the ones used in neural networks) can be found in Carnevali and Patarnello (1987). A stochastic algorithm can be useful to escape local min.ima of the training error, enabling a better learning of the training set. Surprisingly, such a method can also lead to better generalization abilities if the Classification rule is also corrupted by some degree of noise (Gy<47>rgyi and Tishby, 1990). A stochastic training algorithm can be realized by the Monte Carlo metropolis method, which was invented to generate the effects of temperature in simulations of physical systems. Any changes of the network couplings which lead to a decrease of the training error during learning are allowed. However, with some probability that in.creases with the temperature, an increase of the training error is also accepted. Although in principle this algorithm may visit all the network's configurations, for a large sys.tem, with an overwhelming probability, only states close to some fixed training error will actually appear. The method of statistical physics applied to this situation shows that for sufficiently large temperatures (T) we often obtain a quali.tatively correct picture if we repeat the approximate calcu.lation for the noise-free case and replace the relative number of examples a by the effective number a/T. Hence, the learning curves become essentially stretched and good generalization ability is still possible at the price of an increase in necessary training examples.
Within the stochastic framework, learning (with errors) can now also be realized for the Ising perceptron, and it is interesting to study the number of relevant student con<6F>gu.rations as a function of e in more detail (Fig. 12). The green curve is obtained for a small value of a where a strong maxi.mum with high generalization error exists. By increasing a, this maximum decreases until it is the same as the second maximum at e 0.5, indicating a transition like that of the blue learning curve in Fig. 9. For larger a, the state of per.fect generalization should be the typical state. Neverthe.less, if the stochastic algorithm starts with an initial state
<<FIGURE>>
FIGURE 12 Logarithm of the number of relevant Ising stu.dents for different values of a.
which has no resemblance to the (unknown) teacher (i.e., with e 0.5), it will spend time that increases exponentially with N in the smaller local maximum, the metastable state. Hence, a sudden transition to perfect generalization will be observable only in examples which correspond to the blue curve of Fig. 12, where this metastable state disappears. For large vales of a (yellow curve), the stochastic algorithm will converge always to the state of perfect generalization. On the other hand, since the state with e 0.5 is always metastable, a stochastic algorithm which starts with the teacher<65>s couplings will never drive the student out of the state of perfect generalization. It should be made clear that the sharp phase transitions are the result of the thermody.namic limit, where the macroscopic state is entirely domi.nated by the typical configurations. For simulations of any finite system a rounding and softening of the transitions will be observed.
More Sophisticated Computations Are Needed for Multilayer Networks
As a first step to understand the generalization perfor.mance of multilayer networks, one can study an architectures which is simpler than the fully connected one of Fig. 1b. The tree architecture of Fig. 13 has become a popular model. Here, each hidden unit is connected to a different set of the input nodes. A further simpli<6C>cation is the replacement of adaptive couplings from the hidden units to the output node by a prewired fixed function which maps the states of the hidden units to the output.
Two such functions have been studied in great detail. For the first one, the output gives just the majority vote of the hidden unitsfithatis, if themajority of the hidden units is negative, then the total output is negative, and vice versa. This network is called a committee machine. For the second type of network, the parity machine, the output is the par.ity of the hidden outputsfithat is, a minus results from an odd number of negative hidden units and a plus from an even number. For both types of networks, the capacity has been calculated in the thermodynamic limit of a large number N of (first layer) couplings (Barkai et al., 1990; Monas.son and Zecchina, 1995). By increasing the number of hid.den units (but always keeping it much smaller than N), the capacity per coupling (and the VC dimension) can be made arbitrarily large. Hence, the VC theory predicts that the ability to generalize begins at a size of the training set which increases with the capacity. The learning curves of the typical parity machine (Fig. 14) being trained by a par.ity teacher for (from left to right) one, two, four, and six hidden units seem to partially support this prediction.
Below a certain number of examples, only memorization of the learned patterns occurs and not generalization. Then, a transition to nontrivial generalization takes place (Han.sel et al., 1992; Opper, 1994). Far beyond the transition, the decay of the learning curves becomes that of a simple per.ceptron (black curve in Fig. 14) independent of the number of hidden units, and this occurs much faster than for the bound given by VC theory. This shows that the typical learning curve can in fact be determined by more than one
<<TABLE>>
complexity parameter. In contrast, the learning curve of the committee machine with the tree architecture of Fig. 13 (Schwarze and Hertz, 1992) is smooth and resembles that of the simple perceptron. As the number of hidden units is increased (keeping N fixed and very large), the generalization error increases, but despite the diverging VC di.mension the curves converge to a limiting one having an asymptotic decay which is only twice as slow as that of the perceptron. This is an example for which typical and worst-case generalization behaviors are entirely different.
Recently, more light has been shed on the relation be.tween average and worst-case scenarios of the tree com-the same similarity to every teacher perceptron. Although this symmetric state allows for some degree of generalization, it is not able to recover the teacher<65>s rule completely. After a long plateau, the symmetry is broken and each of the student perceptrons specializes to one of the teacher perceptrons, and thus their similarity with the others is lost. This leads to a rapid (but continuous) decrease in the generalization error. Such types of learning curves with plateaus can actually be observed in applications of fully connected multilayer networks.
Outlook
mittee. A reduced worst-case scenario, in which a tree committee teacher was to be learned from tree committee students under an input distribution, has been analyzed from a statistical physics perspective (Urbanczik, 1996). As expected, few students show a much worse generalization ability than the typical one. Moreover, such students may also be difficult to find by most reasonable learning algorithms because bad students require very <20>ne tuning of their couplings. Calculation of the couplings with finite pre.cision requires many bits per coupling that increases faster than exponentially with a and which for sufficiently large a will be beyond the capability of practical algorithms. Hence, it is expected that, in practice, a bad behavior will not be observed.
Transitions of the generalization error such as those observed for the tree parity machine are a characteristic feature of large systems which have a symmetry that can be spontaneously broken. To explain this, consider the sim.plest case of two hidden units. The output of this parity ma.chine does not change if we simultaneously change the sign of all the couplings for both hidden units. Hence, if the teacher<65>s couplings are all equal to 1, a student with all couplings equal to 1 acts exactly as the same classifier. If there are few examples in the training set, the entropic contribution will dominate the typical behavior and the typical students will display the same symmetry. Their coupling vectors will consist of positive and negative random numbers. Hence, there is no preference for the teacher or the reversed one and generalization is not possible. If the number of examples is large enough, the symmetry is broken and there are two possible types of typical students, one with more positive and the other one with more negative couplings. Hence, any of the typical students will show some similarity with the teacher (or it's negative image) and generalization occurs. A similar type of symmetry break.ing also leads to a continuous phase transition in the fully connected committee machine. This can be viewed as a committee of perceptrons, one for each hidden unit, which share the same input nodes. Any permutation of these perceptrons obviously leaves the output invariant. Again, if few examples are learned, the typical state reflects the symmetry. Each student perceptron will show approximately The worst-case approach of the VC theory and the typical case approach of statistical physics are important theories for modeling and understanding the complexity of learning to generalize from examples. Although the VC approach plays an important role in a general theory of learnability, its practical applications for neural networks have been limited by the overall generality of the approach. Since only weak assumptions about probability distributions and machines are considered by the theory, the estimates for generalization errors have often been too pessimistic. Recent developments of the theory seem to overcome these problems. By using modified VC dimensions, which depend on the data that have actually occurred and which in favorable cases are much smaller than the general dimensions, more realistic results seem to be possible. For the support vector machines (Vapnik, 1995) (generalizations of the margin classifiers which allow for nonlinear boundaries that separate the two classes), Vapnik and collaborators have shown the effectiveness of the modified VC results for selecting the optimal type of model in practical applications.
The statistical physics approach, on the other hand, has revealed new and unexpected behavior of simple network models, such as a variety of phase transitions. Whether such transitions play a cognitive role in animal or human brains is an exciting topic. Recent developments of the theory aim to understand dynamical problems of learning. For ex.ample, online learning (Saad, 1998), in which the problems of learning and generalization are strongly mixed, has en.abled the study of complex multilayer networks and has stimulated research on the development of optimized algorithms. In addition to an extension of the approach to more complicated networks, an understanding of the robustness of the typical behavior, and an interpolation to the other extreme, the worst-case scenario is an important subject of research.
Acknowledgments
I thank members of the Department of Physics of Complex Sys.tems at the Weizmann Institute in Rehovot, Israel, where parts of this article were written, for their warm hospitality.
References Cited
AMARI, S., and MURATA, N. (1993). Statistical theory of learning curves under entropic loss. Neural Comput. 5, 140.
BARKAI, E., HANSEL, D., and KANTER, I. (1990). Statistical me.chanics of a multilayered neural network. Phys. Rev. Lett. 65, 2312.
BISHOP, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon/Oxford Univ. Press, Oxford/New York.
CARNEVALI, P., and PATARNELLO, S. (1987). Exhaustive thermo.dynamical analysis of Boolean learning networks. Europhys. Lett. 4, 1199.
COVER, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern rec.ognition. IEEE Trans. El. Comp. 14, 326.
ENGEL, A., and VAN DEN BROECK, C. (1993). Systems that can learn from examples: Replica calculation of uniform conver.gence bound for the perceptron. Phys. Rev. Lett. 71, 1772.
GARDNER, E. (1988). The space of interactions in neural networks. J. Phys. A 21, 257.
GARDNER, E., and DERRIDA, B. (1989). Optimal storage proper.ties of neural network models. J. Phys. A 21, 271.
GY<EFBFBD>RGYI, G. (1990). First order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A 41, 7097.
GY<EFBFBD>RGYI, G., and TISHBY, N. (1990). Statistical theory of learning a rule. In Neural Networks and Spin Glasses: Proceedings of the STATPHYS 17 Workshop on Neural Networks and Spin Glasses (W. K. Theumann and R. Koberle, Eds.). World Scien.ti<74>c, Singapore.
HANSEL, D., MATO, G., and MEUNIER, C. (1992). Memorization without generalization in a multilayered neural network. Eu.rophys. Lett. 20, 471.
KINZEL, W., and RUJ<55>N, P. (1990). Improving a network generalization ability by selecting examples. Europhys. Lett. 13, 473.
LEVIN, E., TISHBY,N.,andSOLLA, S. (1989). A statistical approach to learning and generalization in neural networks. In Proceed.ings of the Second Workshop on Computational Learning The.ory (R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan Kaufmann, San Mateo, CA.
M<EFBFBD>ZARD, M., PARISI, G., and VIRASORO, M. A. (1987). Spin glass theory and beyond. In Lecture Notes in Physics, Vol. 9. World Scienti<74>c, Singapore.
MONASSON, R., and ZECCHINA, R. (1995). Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Phys. Rev. Lett. 75, 2432.
OPPER, M. (1994). Learning and generalization in a two-layer neural network: The role of the Vapnik<69>Chervonenkis dimension. Phys. Rev. Lett. 72, 2113.
LEARNING TO GENERALIZE
OPPER, M., and HAUSSLER, M. (1991). Generalization perfor.mance of Bayes optimal Classification algorithm for learning a perceptron. Phys. Rev. Lett. 66, 2677.
OPPER, M., and KINZEL, W. (1995). Statistical mechanics of generalization. In Physics of Neural Networks III (J. L. van Hem-men, E. Domany, and K. Schulten, Eds.). Springer-Verlag, New York.
SAAD, D. (Ed.) (1998). Online Learning in Neural Networks. Cambridge Univ. Press, New York.
SCHWARZE, H., and HERTZ, J. (1992). Generalization in a large committee machine. Europhys. Lett. 20, 375.
SCHWARZE, H., and HERTZ, J. (1993). Generalization in fully con.nected committee machines. Europhys. Lett. 21, 785.
SEUNG, H. S., SOMPOLINSKY, H., and TISHBY, N. (1992a). Statis.tical mechanics of learning from examples. Phys. Rev. A 45, 6056.
SEUNG, H. S., OPPER, M., and SOMPOLINSKY, H. (1992b). Query by committee. In The Proceedings of the Vth Annual Workshop on Computational Learning Theory (COLT92), p. 287. Associ.ation for Computing Machinery, New York.
SOMPOLINSKY, H., TISHBY, N., and SEUNG, H. S. (1990). Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683.
URBANCZIK, R. (1996). Learning in a large committee machine: Worst case and average case. Europhys. Lett. 35, 553.
VALLET, F., CAILTON, J., and REFREGIER, P. (1989). Linear and nonlinear extension of the pseudo-inverse solution for learning Boolean functions. Europhys. Lett. 9, 315.
VAPNIK, V. N. (1982). Estimation of Dependencies Based on Em.pirical Data. Springer-Verlag, New York.
VAPNIK, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New York.
VAPNIK, V. N., and CHERVONENKIS, A. (1971). On the uniform convergence of relative frequencies of events to their probabil.ities. Theory Probability Appl. 16, 254.
General References
ARBIB, M. A. (Ed.) (1995). The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge, MA.
BERGER, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York.
HERTZ,J.A.,KROGH,A.,andPALMER, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Red.wood City, CA.
MINSKY, M., and PAPERT, S. (1969). Perceptrons. MIT Press, Cambridge, MA.
WATKIN, T. L. H., RAU, A., and BIEHL, M. (1993). The statistical mechanics of learning a rule. Rev. Modern Phys. 65, 499.
<<END>> <<END>> <<END>>