<<START>> <<START>> <<START>> 

   Neural Ordinary Differential Equations

Ricky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud
      University of Toronto, Vector Institute
   {rtqichen, rubanova, jessebett, duvenaud}@cs.toronto.edu

                                   Abstract
We introduce a new family of deep neural network models. Instead of specifying a
discrete sequence of hidden layers, we parameterize the derivative of the hidden
state using a neural network. The output of the network is computed using a black-
box differential equation solver. These continuous-depth models have constant
memory cost, adapt their evaluation strategy to each input, and can explicitly trade
numerical precision for speed. We demonstrate these properties in continuous-depth
residual networks and continuous-time latent variable models. We also construct
continuous normalizing flows, a generative model that can train by maximum
likelihood, without partitioning or ordering the data dimensions. For training, we
show how to scalably backpropagate through any ODE solver, without access to its
internal operations. This allows end-to-end training of ODEs within larger models.

                               1   Introduction
                                                                                                         
Models such as residual networks, recurrent neural network decoders, and normalizing flows build 
complicated transformations by composing a sequence of transformations to a hidden state:    

                          <<FORMULA>>           (1)          
                                                                        
where t ∈ {0 . . . T } and ht ∈ R . These iterative updates can be seen as an Euler discretization of a
continuous transformation (Lu et al., 2017; Haber and Ruthotto, 2017; Ruthotto and Haber, 2018).                    
What happens as we add more layers and take smaller steps? In the limit, we parameterize the continuous     
dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network:       
Starting from the input layer h(0), we can define the output layer h(T ) to be the solution to this

                          <<FORMULA>>           (2)                                  

ODE initial value problem at some time T . This value can be computed by a black-box differential
equation solver, which evaluates the hidden unit dynamics f wherever necessary to determine the
solution with the desired accuracy. Figure 1 contrasts these two approaches.
Defining and evaluating models using ODE solvers has several benefits:
Memory efficiency In Section 2, we show how to compute gradients of a scalar-valued loss with
respect to all inputs of any ODE solver, without backpropagating through the operations of the solver.
Not storing any intermediate quantities of the forward pass allows us to train our models with constant
memory cost as a function of depth, a major bottleneck of training deep models.
32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
Adaptive computation Euler’s method is perhaps the simplest method for solving ODEs. There
have since been more than 120 years of development of efficient and accurate ODE solvers (Runge,
1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth
of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to
achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with
problem complexity. After training, accuracy can be reduced for real-time or low-power applications.
Scalable and invertible normalizing flows An unexpected side-benefit of continuous transforma-
tions is that the change of variables formula becomes easier to compute. In Section 4, we derive
this result and use it to construct a new class of invertible density models that avoids the single-unit
bottleneck of normalizing flows, and can be trained directly by maximum likelihood.
Continuous time-series models Unlike recurrent neural networks, which require discretizing
observation and emission intervals, continuously-defined dynamics can naturally incorporate data
which arrives at arbitrary times. In Section 5, we construct and demonstrate such a model.
2    Reverse-mode automatic differentiation of ODE solutions
The main technical difficulty in training continuous-depth networks is performing reverse-mode
differentiation (also known as backpropagation) through the ODE solver. Differentiating through
the operations of the forward pass is straightforward, but incurs a high memory cost and introduces
additional numerical error.
We treat the ODE solver as a black box, and compute gradients using the adjoint sensitivity
method (Pontryagin et al., 1962). This approach computes gradients by solving a second, aug-
mented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly
with problem size, has low memory cost, and explicitly controls numerical error.
Consider optimizing a scalar-valued loss function L(), whose input is the result of an ODE solver:
                                      
                                    <<FORMULA>>       (3)               

To optimize L, we require gradients with respect to θ. The first step is to determining how the gradient 
of the loss depends on the hidden state z(t) at each instant. This quantity is called the adjoint a(t) = ∂L/∂z(t). 
Its dynamics are given by another ODE, which can be thought of as the State instantaneous analog of the chain rule:
 Adjoint State

                                    <<FORMULA>>       (4)

We can compute ∂L/∂z(t0 ) by another call to an ODE solver. This solver must run backwards, starting from the initial 
value of ∂L/∂z(t1 ). One complication is that solving this ODE requires the knowing value of z(t) along its entire tra-
jectory. However, we can simply recompute z(t) backwards in time together with the adjoint, starting from its final 
value z(t1 ).

If the loss depends directly on the state at multi- Computing the gradients with respect to the pa-
ple observation times, the adjoint state must be parameters θ requires evaluating a third integral,
updated in the direction of the partial derivative of which depends on both z(t) and a(t):
the loss with respect to each observation.                     

                                  <<FORMULA>>         (5)

The vector-Jacobian products <<FORMULA>> and <<FORMULA>> in (4) and (5) can be efficiently evaluated by
automatic differentiation, at a time cost similar to that of evaluating f . All integrals for solving z, 
and <<FORMULA>> can be computed in a single call to an ODE solver, which concatenates the original state, the
adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the
necessary dynamics, and call an ODE solver to compute all gradients at once.

                                 <<ALGORITHM>>

Most ODE solvers have the option to output the state z(t) at multiple times. When the loss depends
on these intermediate states, the reverse-mode derivative must be broken into a sequence of separate
solves, one between each consecutive pair of output times (Figure 2). At each observation, the adjoint
must be adjusted in the direction of the corresponding partial derivative ∂L/∂z(ti ).
The results above extend those of Stapor et al. (2018, section 2.4.2). An extended version of
Algorithm 1 including derivatives w.r.t. t0 and t1 can be found in Appendix C. Detailed derivations
are provided in Appendix B. Appendix D provides Python code which computes all derivatives for
scipy.integrate.odeint by extending the autograd automatic differentiation package. This
code also supports all higher-order derivatives. We have since released a PyTorch (Paszke et al.,
2017) implementation, including GPU-based implementations of several standard ODE solvers at
github.com/rtqichen/torchdiffeq.

                   Replacing residual networks with ODEs for supervised learning

In this section, we experimentally investigate the training of neural ODEs for supervised learning.
Software To solve ODE initial value problems numerically, we use the implicit Adams method
implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being
an implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires
solving a nonlinear optimization problem at every step. This setup makes direct backpropagation
through the integrator difficult. We implement the adjoint sensitivity method in Python’s autograd
framework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden
state dynamics and their derivatives on the GPU using Tensorflow, which were then called from the
Fortran ODE solvers, which were called from Python autograd code.

Model Architectures We experiment with a small residual network which downsamples the et al. (1998).
input twice then applies 6 standard residual blocks He et al. (2016b), which are replaced by an ODESolve 
module in the ODE-Net variant. We also test a network with the same architecture but where gradients are 
backpropagated directly through a Runge-Kutta integrator, re-ferred to as RK-Net. Table 1 shows test error,
number of parameters, and memory cost. L denotes the number of layers in the ResNet, and L̃ is the number 
of function evaluations that the ODE solver
requests in a single forward pass, which can be interpreted as an implicit number of layers. We find
that ODE-Nets and RK-Nets can achieve around the same performance as the ResNet.
Error Control in ODE-Nets ODE solvers can approximately ensure that the output is within a
given tolerance of the true solution. Changing this tolerance changes the behavior of the network.
We first verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is
proportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a
                                                      3
trade-off between accuracy and computational cost. One could train with high accuracy, but switch to
a lower accuracy at test time.
         Figure 3: Statistics of a trained ODE-Net. (NFE = number of function evaluations.)
Figure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly
half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory
efficient, but also more computationally efficient than directly backpropagating through the integrator,
because the latter approach will need to backprop through each function evaluation in the forward
pass.
Network Depth It’s not clear how to define the ‘depth‘ of an ODE solution. A related quantity is
the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver
and dependent on the initial state or input. Figure 3d shows that he number of function evaluations
increases throughout training, presumably adapting to increasing complexity of the model.

                     4    Continuous Normalizing Flows

The discretized equation (1) also appears in normalizing flows (Rezende and Mohamed, 2015) and
the NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to
compute exact changes in probability if samples are transformed through a bijective function f :

                      <<FORMULA>>                             (6)

An example is the planar normalizing flow (Rezende and Mohamed, 2015):

                     <<FORMULA>>                             (7)

Generally, the main bottleneck to using the change of variables formula is computing of the deter-
minant of the Jacobian ∂f/∂z, which has a cubic cost in either the dimension of z, or the number
of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow
layers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018).
Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the
computation of the change in normalizing constant:
Theorem 1 (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable
with probability p(z(t)) dependent on time. Let dz  dt = f (z(t), t) be a differential equation describing
a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z
and continuous in t, then the change in log probability also follows a differential equation,

                                 <<FORMULA>>                 (8)

Proof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation.
Also unlike standard finite flows, the differential equation f does not need to be bijective, since if
uniqueness is satisfied, then the entire transformation is automatically bijective.
As an example application of the instantaneous change of variables, we can examine the continuous
analog of the planar flow, and its change in normalization constant:

                               <<FORMULA>>                            (9)

Given an initial distribution p(z(0)), we can sample from p(z(t)) and evaluate its density by solving
this combined ODE.
Using multiple hiddenP units with  P linear cost While det is not a linear function, the trace function
is, which implies tr( n Jn ) = n tr(Jn ). Thus if our dynamics is given by a sum of functions then
the differential equation for the log density is also a sum:


                              <<FORMULA>>                              (10)
 
This means we can cheaply evaluate flow models having many hidden units, with a cost only linear in
the number of hidden units M . Evaluating such ‘wide’ flow layers using standard normalizing flows
costs O(M 3 ), meaning that standard NF architectures use many layers of only a single hidden unit.
Time-dependent dynamics We can specify the parameters of a flow as a function of t, making the
differential equation f (z(t), t) change with t. This is parameterization is a kind of hypernetwork       
(Ha et al., 2016). We also introduce a gating mechanism for each hidden unit, 

                              <<FORMULA>>

where σn (t) ∈ (0, 1) is a neural network that learns when the dynamic fn (z) should be applied. We
call these models continuous normalizing flows (CNF).

4.1    Experiments with Continuous Normalizing Flows

We first compare continuous and discrete planar flows at learning to sample from a known distribution.
We show that a planar CNF with M hidden units can be at least as expressive as a planar NF with
K = M layers, and sometimes much more expressive.
Density matching We configure the CNF as described above, and train for 10,000 iterations
using Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using
RMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we
minimize KL (q(x)kp(x)) as the loss function where q is the flow model and the target density p(·)
can be evaluated. Figure 4 shows that CNF generally achieves lower loss.
Maximum Likelihood Training A useful property of continuous-time normalizing flows is that
we can compute the reverse transformation for about the same cost as the forward pass, which cannot
be said for normalizing flows. This lets us train the flow on a density estimation task by performing
maximum likelihood estimation, which maximizes Ep(x) [log q(x)] where q(·) is computed using
the appropriate change of variables theorem, then afterwards reverse the CNF to generate random
samples from q(x).
For this task, we use 64 hidden units for CNF, and 64 stacked one-hidden-unit layers for NF. Figure 5
shows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the
transformed distribution after a small amount of time which shows the locations of the initial planar
flows. Interestingly, to fit the Two Circles distribution, the CNF rotates the planar flows so that
the particles can be evenly spread into circles. While the CNF transformations are smooth and
interpretable, we find that NF transformations are very unintuitive and this model has difficulty fitting
the two moons dataset in Figure 5b.

               5  A generative latent function time-series model

Applying neural networks to irregularly-sampled data such as medical records, network traffic, or
neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the
latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill-
defined latent variables. Missing data can be addressed using generative time-series models (Álvarez
and Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data
imputation (Che et al., 2018). Another approach concatenates time-stamp information to the input of
an RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017).
We present a continuous-time, generative approach to modeling time series. Our model represents
each time series by a latent trajectory. Each trajectory is determined from a local initial state, zt0 , and
a global set of latent dynamics shared across all time series. Given observation times t0 , t1 , . . . , tN
and an initial state zt0 , an ODE solver produces zt1 , . . . , ztN , which describe the latent state at each
observation.We define this generative model formally through a sampling procedure:
                             <<FORMULA>>                                    (11)
                             <<FORMULA>>                                    (12)
                             <<FORMULA>>                                    (13)
Function f is a time-invariant function that takes the value z at the current time step and outputs the
gradient: ∂z(t)/∂t = f (z(t), θf ). We parametrize this function using a neural net. Because f is time-

invariant, given any latent state z(t), the entire latent trajectory is uniquely defined. Extrapolating
this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.
Training and Prediction We can train this latent-variable model as a variational autoen-
coder (Kingma and Welling, 2014; Rezende et al., 2014), with sequence-valued observations. Our
recognition net is an RNN, which consumes the data sequentially backwards in time, and out-
puts qφ (z0 |x1 , x2 , . . . , xN ). A detailed algorithm can be found in Appendix E. Using ODEs as a
generative model allows us to make predictions for arbitrary time points t1 ...tM on a continuous
timeline.
Poisson Process likelihoods The fact that an observation oc-
curred often tells us something about the latent state. For ex-
ample, a patient may be more likely to take a medical test if           
they are sick. The rate of events can be parameterized by a
function of the latent state: p(event at time t| z(t)) = λ(z(t)).
Given this rate function, the likelihood of a set of indepen-
dent observation times in the interval [tstart , tend ] is given by an                        t
inhomogeneous Poisson process (Palm, 1943):                                

We can parameterize λ(·) using another neural network. Con-
veniently, we can evaluate both the latent trajectory and the
Poisson process likelihood together in a single call to an ODE solver. Figure 7 shows the event rate
learned by such a model on a toy dataset.
A Poisson process likelihood on observation
times can be combined with a data likelihood to
jointly model all observations and the times at
which they were made.

5.1   Time-series Latent ODE Experiments 

We investigate the ability of the latent ODE
model to fit and extrapolate time series. The
recognition network is an RNN with 25 hidden
units. We use a 4-dimensional latent space. We
parameterize the dynamics function f with a
one-hidden-layer network with 20 hidden units.
The decoder computing p(xti |zti ) is another              
neural network with one hidden layer with 20                       
hidden units. Our baseline was a recurrent neu-                   
ral net with 25 hidden units trained to minimize                  
negative Gaussian log-likelihood. We trained a                     
second version of this RNN whose inputs were
concatenated with the time difference to the next
observation to aid RNN with irregular observations.
Bi-directional spiral dataset We generated neural network. (b): Reconstructions and extrapo-
a dataset of 1000 2-dimensional spirals, each lations by a latent neural ODE. Blue curve shows
starting at a different point, sampled at 100 model prediction. Red shows extrapolation. (c) A
equally-spaced timesteps. The dataset contains projection of inferred 4-dimensional latent ODE
two types of spirals: half are clockwise while trajectories onto their first two dimensions. Color
the other half counter-clockwise. To make the indicates the direction of the corresponding trajec-
task more realistic, we add gaussian noise to the tory. The model has learned latent dynamics which
observations.                                     
                                                       
progression through time, starting at purple and ending at red. Note that the trajectories on the left
are counter-clockwise, while the trajectories on the right are clockwise.
Time series with irregular time points To generate irregular timestamps, we randomly sample
points from each trajectory without replacement (n = {30, 50, 100}). We report predictive root-
mean-squared error (RMSE) on 100 time points extending beyond those that were used for training.
Table 2 shows that the latent ODE has substantially lower predictive RMSE.

We observed that reconstructions and extrapolations are consistent with the ground truth
regardless of number of observed points and despite the noise.
Latent space interpolation Figure 8c shows latent trajectories projected onto the first two dimen-
sions of the latent space. The trajectories form two separate clusters of trajectories, one decoding to
clockwise spirals, the other to counter-clockwise. Figure 9 shows that the latent trajectories change
smoothly as a function of the initial point z(t0 ), switching from a clockwise to a counter-clockwise
spiral.

                        6    Scope and Limitations

Minibatching The use of mini-batches is less straightforward than for standard neural networks.
One can still batch together evaluations through the ODE solver by concatenating the states of each
batch element together, creating a combined ODE with dimension D × K. In some cases, controlling
error on all batch elements together might require evaluating the combined system K times more
often than if each system was solved individually. However, in practice the number of evaluations did
not increase substantially when using minibatches.
Uniqueness When do continuous dynamics have a unique solution? Picard’s existence theo-
rem (Coddington and Levinson, 1955) states that the solution to an initial value problem exists and is
unique if the differential equation is uniformly Lipschitz continuous in z and continuous in t. This
theorem holds for our model if the neural network has finite weights and uses Lipshitz nonlinearities,
such as tanh or relu.
Setting tolerances Our framework allows the user to trade off speed for precision, but requires
the user to choose an error tolerance on both the forward and reverse passes during training. For
sequence modeling, the default value of 1.5e-8 was used. In the classification and density estimation
experiments, we were able to reduce the tolerance to 1e-3 and 1e-5, respectively, without degrading
performance.
Reconstructing forward trajectories Reconstructing the state trajectory by running the dynamics
backwards can introduce extra numerical error if the reconstructed trajectory diverges from the
original. This problem can be addressed by checkpointing: storing intermediate values of z on the
forward pass, and reconstructing the exact forward trajectory by re-integrating from those points. We
did not find this to be a practical problem, and we informally checked that reversing many layers of
continuous normalizing flows with default tolerances recovered the initial states.
                                                     8
                        7    Related Work

The use of the adjoint method for training continuous-time neural networks was previously pro-
posed (LeCun et al., 1988; Pearlmutter, 1995), though was not demonstrated practically. The
interpretation of residual networks He et al. (2016a) as approximate ODE solvers spurred research
into exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al.,
2017). We demonstrate these same properties in more generality by directly using an ODE solver.
Adaptive computation One can adapt computation time by training secondary neural networks
to choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al.,
2016; Figurnov et al., 2017; Chang et al., 2018). However, this introduces overhead both at training
and test time, and extra parameters that need to be fit. In contrast, ODE solvers offer well-studied,
computationally cheap, and generalizable rules for adapting the amount of computation.
Constant memory backprop through reversibility Recent work developed reversible versions
of residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives
the same constant memory advantage as our approach. However, these methods require restricted
architectures, which partition the hidden units. Our approach does not have these restrictions.
Learning differential equations Much recent work has proposed learning differential equations
from data. One can train feed-forward or recurrent neural networks to approximate a differential
equation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applica-
tions such as fluid simulation (Wiewel et al., 2018). There is also significant work on connecting
Gaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to fit
differential equations (Raissi et al., 2018b) and can naturally model continuous-time effects and
interventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic
variational inference to recover the solution of a given stochastic differential equation.
Differentiating through ODE solvers The dolfin library (Farrell et al., 2013) implements adjoint
computation for general ODE and PDE solutions, but only by backpropagating through the individual
operations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient
estimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity
analysis is quadratic-time in the number of variables, whereas the adjoint sensitivity analysis is
linear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint
method to train bespoke latent dynamic models.
In contrast, by providing a generic vector-Jacobian product, we allow an ODE solver to be trained
end-to-end with any other differentiable model components. While use of vector-Jacobian products
for solving the adjoint method has been explored in optimal control (Andersson, 2013; Andersson
et al., In Press, 2018), we highlight the potential of a general integration of black-box ODE solvers
into automatic differentiation (Baydin et al., 2018) for deep learning and generative modeling.
8    Conclusion
We investigated the use of black-box ODE solvers as a model component, developing new models
for time-series modeling, supervised learning, and density estimation. These models are evaluated
adaptively, and allow explicit control of the tradeoff between computation speed and accuracy.
Finally, we derived an instantaneous version of the change of variables formula, and developed
continuous-time normalizing flows, which can scale to large layer sizes.
9    Acknowledgements
We thank Wenyi Wang and Geoff Roeder for help with proofs, and Daniel Duckworth, Ethan Fetaya,
Hossein Soleimani, Eldad Haber, Ken Caluwaerts, Daniel Flam-Shepherd, and Harry Braviner for
feedback. We thank Chris Rackauckas, Dougal Maclaurin, and Matthew James Johnson for helpful
discussions. We also thank Yuval Frommer for pointing out an unsupported claim about parameter
efficiency.
                                                    9
References
Mauricio A Álvarez and Neil D Lawrence. Computationally efficient convolved multiple output
   Gaussian processes. Journal of Machine Learning Research, 12(May):1459–1500, 2011.
Brandon Amos and J Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks.
   In International Conference on Machine Learning, pages 136–145, 2017.
Joel Andersson. A general-purpose software framework for dynamic optimization. PhD thesis, 2013.
Joel A E Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi – A
   software framework for nonlinear optimization and optimal control. Mathematical Programming
   Computation, In Press, 2018.
Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind.
   Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18
   (153):1–153, 2018.
Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
   normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betan-
   court. The Stan math library: Reverse-mode automatic differentiation in c++. arXiv preprint
   arXiv:1509.07164, 2015.
Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible
   architectures for arbitrarily deep residual neural networks. arXiv preprint arXiv:1709.03698, 2017.
Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks
   from dynamical systems view. In International Conference on Learning Representations, 2018.
   URL https://openreview.net/forum?id=SyJS-OgR-.
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural
   networks for multivariate time series with missing values. Scientific Reports, 8(1):6085, 2018.
   URL https://doi.org/10.1038/s41598-018-24271-9.
Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun.
   Doctor AI: Predicting clinical events via recurrent neural networks. In Proceedings of the 1st
   Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning
   Research, pages 301–318. PMLR, 18–19 Aug 2016. URL http://proceedings.mlr.press/
   v56/Choi16.html.
Earl A Coddington and Norman Levinson. Theory of ordinary differential equations. Tata McGraw-
   Hill Education, 1955.
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
   estimation. arXiv preprint arXiv:1410.8516, 2014.
Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song.
   Recurrent marked temporal point processes: Embedding event history to vector. In International
   Conference on Knowledge Discovery and Data Mining, pages 1555–1564. ACM, 2016.
Patrick Farrell, David Ham, Simon Funke, and Marie Rognes. Automated derivation of the adjoint of
   high-level transient finite element programs. SIAM Journal on Scientific Computing, 2013.
Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and
   Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint,
   2017.
J. Futoma, S. Hariharan, and K. Heller. Learning to Detect Sepsis with a Multitask Gaussian Process
   RNN Classifier. ArXiv e-prints, 2017.
Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network:
   Backpropagation without storing activations. In Advances in Neural Information Processing
   Systems, pages 2211–2221, 2017.
                                                    10
Alex Graves. Adaptive computation time for recurrent neural networks.                 arXiv preprint
   arXiv:1603.08983, 2016.
David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34
   (1):014004, 2017.
E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems.
   Springer, 1987.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
   recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
   pages 770–778, 2016a.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
   networks. In European conference on computer vision, pages 630–645. Springer, 2016b.
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture
   6a overview of mini-batch gradient descent, 2012.
Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in
   recurrent neural networks. arXiv preprint arXiv:1611.06188, 2016.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
   arXiv:1412.6980, 2014.
Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. International Conference
   on Learning Representations, 2014.
Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.
   Improved variational inference with inverse autoregressive flow. In Advances in Neural Information
   Processing Systems, pages 4743–4751, 2016.
W. Kutta. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für
   Mathematik und Physik, 46:435–453, 1901.
Yann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework for back-propagation.
   In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21–28. CMU,
   Pittsburgh, Pa: Morgan Kaufmann, 1988.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
   document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Yang Li. Time-dependent representation for neural event sequence prediction. arXiv preprint
   arXiv:1708.00065, 2017.
Zachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with
   RNNs: Improved classification of clinical time series. In Proceedings of the 1st Machine Learning
   for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 253–
   270. PMLR, 18–19 Aug 2016. URL http://proceedings.mlr.press/v56/Lipton16.html.
Z. Long, Y. Lu, X. Ma, and B. Dong. PDE-Net: Learning PDEs from Data. ArXiv e-prints, 2017.
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks:
   Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121,
   2017.
Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Autograd: Reverse-mode differentiation of
   native Python. In ICML workshop on Automatic Machine Learning, 2015.
Hongyuan Mei and Jason M Eisner. The neural Hawkes process: A neurally self-modulating
   multivariate point process. In Advances in Neural Information Processing Systems, pages 6757–
   6767, 2017.
                                                  11
Valdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives of likelihood functionals for
   ODE based models using adjoint-state method. Computational Statistics, 32(4):1621–1643, 2017.
Conny Palm. Intensitätsschwankungen im fernsprechverker. Ericsson Technics, 1943.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
   Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
   pytorch. 2017.
Barak A Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE
  Transactions on Neural networks, 6(5):1212–1228, 1995.
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathemat-
   ical theory of optimal processes. 1962.
M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial
   differential equations. Journal of Computational Physics, pages 125–141, 2018.
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data-
   driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236, 2018a.
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Numerical Gaussian processes for
   time-dependent and nonlinear partial differential equations. SIAM Journal on Scientific Computing,
  40(1):A172–A198, 2018b.
Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate
   inference in deep generative models. In Proceedings of the 31st International Conference on
  Machine Learning, pages 1278–1286, 2014.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv
   preprint arXiv:1505.05770, 2015.
C. Runge. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46:
  167–178, 1895.
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations.
   arXiv preprint arXiv:1804.04272, 2018.
T. Ryder, A. Golightly, A. S. McGough, and D. Prangle. Black-box Variational Inference for
   Stochastic Differential Equations. ArXiv e-prints, 2018.
Michael Schober, David Duvenaud, and Philipp Hennig. Probabilistic ODE solvers with Runge-Kutta
   means. In Advances in Neural Information Processing Systems 25, 2014.
Peter Schulam and Suchi Saria. What-if reasoning with counterfactual Gaussian processes. arXiv
   preprint arXiv:1703.10651, 2017.
Hossein Soleimani, James Hensman, and Suchi Saria. Scalable joint models for reliable uncertainty-
   aware event prediction. IEEE transactions on pattern analysis and machine intelligence, 2017a.
Hossein Soleimani, Adarsh Subbaswamy, and Suchi Saria. Treatment-response models for coun-
   terfactual reasoning with continuous-time, continuous-valued interventions. arXiv preprint
   arXiv:1704.02038, 2017b.
Jos Stam. Stable fluids. In Proceedings of the 26th annual conference on Computer graphics and
   interactive techniques, pages 121–128. ACM Press/Addison-Wesley Publishing Co., 1999.
Paul Stapor, Fabian Froehlich, and Jan Hasenauer. Optimization and uncertainty analysis of ODE
   models using second order adjoint sensitivity analysis. bioRxiv, page 272005, 2018.
Jakub M Tomczak and Max Welling. Improving variational auto-encoders using Householder flow.
   arXiv preprint arXiv:1611.09630, 2016.
Steffen Wiewel, Moritz Becher, and Nils Thuerey. Latent-space physics: Towards learning the
   temporal evolution of fluid flow. arXiv preprint arXiv:1802.10123, 2018.
Hong Zhang and Adrian Sandu. Fatode: a library for forward, adjoint, and tangent linear integration
   of ODEs. SIAM Journal on Scientific Computing, 36(5):C504–C523, 2014.
                                           

         Appendix A   Proof of the Instantaneous Change of Variables Theorem

Theorem (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability
p(z(t)) dependent on time. Let dz/dt = f (z(t), t) be a differential equation describing a continuous-in-time
transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the
change in log probability also follows a differential equation:

                                 <<FORMULA>>

Proof. To prove this theorem, we take the infinitesimal limit of finite changes of log p(z(t)) through time. First
we denote the transformation of z over an ε change in time as
                                 <<FORMULA>>                                           (14)
We assume that f is Lipschitz continuous in z(t) and continuous in t, so every initial value problem has a unique
solution by Picard’s existence theorem. We also assume z(t) is bounded. These conditions imply that f , Tε , and
 ∂T are all bounded. In the following, we use these conditions to exchange limits and products.

                                 <<FORMULA>>

We can write the differential equation <<FORMULA>> using the discrete change of variables formula, and the
definition of the derivative:

                                 <<FORMULA>>                                                                    (15)

                                 <<FORMULA>>                                                                    (16)

                                 <<FORMULA>>                                           (by L’Hôpital’s rule)    (17)

                                 <<FORMULA>>                                                                    (18)

                                 <<FORMULA>>                                                                    (19)
    
                                 <<FORMULA>>                                                                    (20)

The derivative of the determinant can be expressed using Jacobi’s formula, which gives

                                 <<FORMULA>>                                                                    (21)

                                 <<FORMULA>>                                                                    (22)
    
                                 <<FORMULA>>                                                                    (23)


Substituting Tε with its Taylor series expansion and taking the limit, we complete the proof.

                                 <<FORMULA>>                                                                    (24)

                                 <<FORMULA>>                                                                    (25)

                                 <<FORMULA>>                                                                    (26)
    
                                 <<FORMULA>>                                                                    (27)


                                          A.1     Special Cases

Planar CNF. Let f (z) = uh(wz + b), then  ∂z = u ∂h  ∂z. Since the trace of an outer product is the inner
product, we have

                                <<FORMULA>>                                                                     (28)

This is the parameterization we use in all of our experiments.
Hamiltonian CNF. The continuous analog of NICE (Dinh et al., 2014) is a Hamiltonian flow, which splits

the data into two equal partitions and is a volume-preserving transformation, implying that ∂t = 0. We
can verify this. Let

                               <<FORMULA>>                                 (29)

Then because the Jacobian is all zeros on its diagonal, the trace is zero. This is a volume-preserving flow.
A.2     Connection to Fokker-Planck and Liouville PDEs
The Fokker-Planck equation is a well-known partial differential equation (PDE) that describes the probability
density function of a stochastic differential equation as it changes with time. We relate the instantaneous change
of variables to the special case of Fokker-Planck with zero diffusion, the Liouville equation.
As with the instantaneous change of variables, let z(t) ∈ RD evolve through time following dz(t)/dt = f (z(t), t).
Then Liouville equation describes the change in density of z–a fixed point in space–as a PDE,

                              <<FORMULA>>                                    (30)

However, (30) cannot be easily used as it requires the partial derivatives of p(z,t)/∂z, which is typically approximated
using finite difference. This type of PDE has its own literature on efficient and accurate simulation (Stam, 1999).
Instead of evaluating p(·, t) at a fixed point, if we follow the trajectory of a particle z(t), we obtain

                              <<FORMULA>>

partial derivative from first argument, z(t) partial derivative from second argument, t

                              <<FORMULA>>                                      (31)

We arrive at the instantaneous change of variables by taking the log,

                              <<FORMULA>>                                      (32)

While still a PDE, (32) can be combined with z(t) to form an ODE of size D + 1,

                              <<FORMULA>>                                       (33)

Compared to the Fokker-Planck and Liouville equations, the instantaneous change of variables is of more
practical impact as it can be numerically solved much more easily, requiring an extra state of D for following
the trajectory of z(t). Whereas an approach based on finite difference approximation of the Liouville equation
would require a grid size that is exponential in D.
Appendix B             A Modern Proof of the Adjoint Method
We present an alternative proof to the adjoint method (Pontryagin et al., 1962) that is short and easy to follow.
                                                                         14
B.1        Continuous Backpropagation

Let z(t) follow the differential equation dt = f (z(t), t, θ), where θ are the parameters. We will prove that if
we define an adjoint state

                                                                <<FORMULA>>                                                            (34)

then it follows the differential equation

                                                               <<FORMULA>>                                                             (35)

For ease of notation, we denote vectors as row vectors, whereas the main text uses column vectors.
The adjoint state is the gradient with respect to the hidden state at a specified time t. In standard neural networks,
the gradient of a hidden layer ht depends on the gradient from the next layer ht+1 by chain rule

                                                                 <<FORMULA>>                                                            (36)

With a continuous hidden state, we can write the transformation after an ε change in time as

                                                                 <<FORMULA>>                                                            (37)
 
                                                                 <<FORMULA>>                                                            (38)

The proof of (35) follows from the definition of derivative:

              <<FORMULA>>                                                                                                               (39)

              <<FORMULA>>                                                                                     (by Eq 38)                (40)

              <<FORMULA>>                                                                        (Taylor series around z(T))            (41)

              <<FORMULA>>                                                                                                               (42)

              <<FORMULA>>                                                                                                               (43)

             <<FORMULA>>                                                                                                                (44)
  
             <<FORMULA>>                                                                                                                (45)

We pointed out the similarity between adjoint method and backpropagation (eq. 38). Similarly to backpropaga-
tion, ODE for the adjoint state needs to be solved backwards in time. We specify the constraint on the last time
point, which is simply the gradient of the loss wrt the last time point, and can obtain the gradients with respect to
the hidden state at any time, including the initial value.

                                 <<FORMULA>>                   (46)

Here we assumed that loss function L depends only on the last time point tN . If function L depends also on
intermediate time points t1 , t2 , . . . , tN −1 , etc., we can repeat the adjoint step for each of the intervals [tN −1 , tN ],
[tN −2 , tN −1 ] in the backward order and sum up the obtained gradients.
B.2        Gradients wrt. θ and t
We can generalize (35) to obtain gradients with respect to θ–a constant wrt. t–and and the initial and end times,
t0 and tN . We view θ and t as states with constant differential equations and write

                                  <<FORMULA>>                                                 (47)

We can then combine these with z to form an augmented state1 with corresponding differential equation and
adjoint state,

                                    <<FORMULA>>                 (48)

Note this formulates the augmented ODE as an autonomous (time-invariant) ODE, but the derivations in the
previous section still hold as this is a special case of a time-variant ODE. The Jacobian of f has the form

                                      <<FORMULA>>                                (49)

where each 0 is a matrix of zeros with the appropriate dimensions. We plug this into (35) to obtain

                                    <<FORMULA>>                                  (50)

The first element is the adjoint differential equation (35), as expected. The second element can be used to obtain
the total gradient with respect to the parameters, by integrating over the full interval and setting aθ (tN ) = 0.

                                       <<FORMULA>>                               (51)

Finally, we also get gradients with respect to t0 and tN , the start and end of the integration interval.

                                       <<FORMULA>>                               (52)

Between (35), (46), (51), and (52) we have gradients for all possible inputs to an initial value problem solver.

            Appendix C              Full Adjoint sensitivities algorithm

This more detailed version of Algorithm 1 includes gradients with respect to the start and end times of integration.
Algorithm 2 Complete reverse-mode derivative of an ODE initial value problem

Input: dynamics parameters θ, start time t0 , stop time t1 , final state z(t1 ), loss gradient ∂L/∂z(t1 )

                  <<ALGORITHM>>

Note that we’ve overloaded t to be both a part of the state and the (dummy) independent variable. The
distinction is clear given context, so we keep t as the independent variable for consistency with the rest of the
text.

                     Appendix D                Autograd Implementation

                        <<ALGORITHM>>

                     Appendix E               Algorithm for training the latent ODE model

To obtain the latent representation zt0 , we traverse the sequence using RNN and obtain parameters of distribution
q(zt0 |{xti , ti }i , θenc ). The algorithm follows a standard VAE algorithm with an RNN variational posterior and
an ODESolve model:
                                       <<ALGORITHM>>
                                        <<FORMULA>>                         (53)
                                       <<ALGORITHM>>

                     Appendix F               Extra Figures

                                       <<FIGURE>>

<<END>> <<END>> <<END>>


<<START>> <<START>> <<START>> 

              Learning differential equations that are easy to solve

                      Jacob Kelly∗                                Jesse Bettencourt∗
         University of Toronto, Vector Institute         University of Toronto, Vector Institute
             jkelly@cs.toronto.edu                          jessebett@cs.toronto.edu
             Matthew James Johnson                               David Duvenaud
               Google Brain                                 University of Toronto, Vector Institute

            mattjj@google.com                           duvenaud@cs.toronto.edu

                              Abstract


Differential equations parameterized by neural networks become expensive to solve
numerically as training progresses. We propose a remedy that encourages learned
dynamics to be easier to solve. Specifically, we introduce a differentiable surrogate
for the time cost of standard numerical solvers, using higher-order derivatives
of solution trajectories. These derivatives are efficient to compute with Taylor-
mode automatic differentiation. Optimizing this additional objective trades model
performance against the time cost of solving the learned dynamics. We demonstrate
our approach by training substantially faster, while nearly as accurate, models in
supervised classification, density estimation, and time-series modelling tasks.

                    1       Introduction

Differential equations describe a system’s behavior by specifying its instantaneous dynamics. 
Historically, differential equations have been derived from theory, such as Newtonian mechanics, 
Maxwell’s equations, or epidemiological models of infectious disease, with parameters inferred 
from observations. Solutions to these equations usually cannot be expressed in closed-form, 
requiring numerical approximation. Recently, ordinary differential equations parameterized by 
millions of learned parameters, called neural ODEs, have been fit for latent time series models, 
density models, or as a replacement for very deep neural networks (Rubanova et al., 2019; Grath-
wohl et al., 2019; Chen et al., 2018). These models are not constrained to match a theoretical 
model,and sometimes substantially different dynamics can give nearly indistinguishable predictions. 
This raises the possibility that we can find nearly equivalent models that are substantially easier
and faster to solve. Yet standard training methods have no way to penalize the complexity of the 
dynamics being learned.                                                  

                          <<FIGURE>>

Equal Contribution. Code available at: github.com/jacobjinkelly/easy-neural-ode

How can we learn dynamics that are faster to solve numerically without substantially changing their
predictions? Much of the computational advantages of a continuous-time formulation come from
using adaptive solvers, and most of the time cost of these solvers comes from repeatedly evaluating
the dynamics function, which in our settings is a moderately-sized neural network. So, we’d like to
reduce the number of function evaluations (NFE) required for these solvers to reach a given error
tolerance. Ideally, we would add a term penalizing the NFE to the training objective, and let a
gradient-based optimizer trade off between solver cost and predictive performance. But because NFE
is integer-valued, we need to find a differentiable surrogate.
The NFE taken by an adaptive solver depends on how far it can extrapolate the trajectory forward
without introducing too much error. For example, for a standard adaptive-step Runge-Kutta solver
with order m, the step size is approximately inversely proportional to the norm of the local mth total
derivative of the solution trajectory with respect to time. That is, a larger mth derivative leads to a
smaller step size and thus more function evaluations. Thus, we propose to minimize the norm of this
total derivative during training, as a way to control the time required to solve the learned dynamics.
In this paper, we investigate the effect of this speed regularization in various models and solvers.
We examine the relationship between the solver order and the regularization order, and characterize
the tradeoff between speed and performance. In most instances, we find that solver speed can be
approximately doubled without a substantial increase in training loss. We also provide an extension
to the JAX program transformation framework that provides Taylor-mode automatic differentiation,
which is asymptotically more efficient for computing the required total derivatives than standard
nested gradients.
Our work compares against and generalizes that of Finlay et al. (2020), who proposed regularizing
dynamics in the FFJORD density estimation model, and showed that it stabilized dynamics enough
in that setting to allow the use of fixed-step solvers during training.
2    Background
An ordinary differential equation (ODE) specifies the instantaneous change of a vector-valued state
<<FORMULA>>, computing the state at a later time:

                                    <<FORMULA>>

is called an initial value problem (IVP). For example, f could describe the equations of motion for a
particle, or the transmission and recovery rates for a virus across a population. Usually, the required
integral has no analytic solution, and must be approximated numerically.
Adaptive-step Runge-Kutta ODE Solvers Runge-Kutta methods (Runge, 1895; Kutta, 1901)
approximate the solution trajectories of ODEs through a series of small steps, starting at time t0 .
At each step, they choose a step size h, and fit a local approximation to the solution, ẑ(t), using
several evaluations of f . When h is sufficiently small, the numerical error of a mth-order method
is bounded by kẑ(t + h) − z(t + h)k ≤ chm+1 for some constant c (Hairer et al., 1993). So, for a
mth-order method, the local error grows approximately in proportion to the size of the mth coefficient
in the Taylor expansion of the true solution. All else being equal, controlling this coefficient for all
dimensions of z(t) will allow larger steps to be taken without surpassing the error tolerance.
Neural Ordinary Differential Equations The dynamics function f can be a moderately-sized
neural network, and its parameters θ trained by gradient descent. Solving the resulting IVP is
analogous to evaluating a very deep residual network in which the number of layers corresponds
to the number of function evaluations of the solver (Chang et al., 2017; Ruthotto & Haber, 2018;
Chen et al., 2018). Solving such continuous-depth models using adaptive numerical solvers has
several computational advantages over standard discrete-depth network architectures. However, this
approach is often slower than using a fixed-depth network, due to an inability to control the number
of steps required by an adaptive-step solver.

                        3   Regularizing Higher-Order Derivatives for Speed

The ability of Runge-Kutta methods to take large and accurate steps is limited by the Kth-order
Taylor coefficients of the solution trajectory. We would like these coefficients to be small. Specifically,
we propose to regularize the squared norm of the Kth-order total derivatives of the state with respect
to time, integrated along the entire solution trajectory:

                                     <<FORMULA>>                                  (1)

where k·k2 is the squared `2 norm, and the dependence on the dynamics parameters θ is implicit
through the solution z(t) integrating dz(t)

                                       <<dt = f (z(t), t, θ)>>. 

During training, we weigh this regularization term by a hyperparameter λ and add it to our original loss 
to get our regularized objective:

                                     <<FORMULA>>                                   (2)

What kind of solutions are allowed when RK = 0? For K = 0,

                                     <<FORMULA>>

we have kz(t)k2 = 0, so the only possible solution is z(t) = 0.                  

For K = 1, we have kf (z(t), t)k2 = 0, so all solutions are
constant, flat trajectories. For K = 2 solutions are straight-line               
trajectories. Higher values of K shrink higher derivatives, but
don’t penalize lower-order dynamics. For instance, a quadratic                      
trajectory will have R3 = 0. Setting the Kth order dynamics to
exactly zero everywhere automatically makes all higher orders                        
zero as well. Figure 1 shows that regularizing R3 on a toy 1D
neural ODE reduces NFE.     

                                    <<FIGURE>>
                                                                                     
Which orders should we regularize? We propose matching the
order of the regularizer to that of the solver being used. We        
conjecture that regularizing dynamics of lower orders than that     
of the solver restricts the model unnecessarily, and that let-     
ting the lower orders remain unregularized should not increase    
NFE very much. Figure 2 shows empirically which orders           
of Runge-Kutta solvers can efficiently solve which orders of       
toy polynomial trajectories. We test these conjectures on real      
models and datasets in section 6.2.                             

The solution trajectory and our regularization term can be computed in a single call to an ODE solver
by augmenting the system with the integrand in eq. (1).

            4   Efficient Higher Order Differentiation with Taylor Mode

The number of terms in higher-order forward derivatives grows exponentially in K, becoming
prohibitively expensive for K = 5, and causing substantial slowdowns even for K = 2 and K = 3.
Luckily, there exists a generalization of forward-mode automatic differentiation (AD), known as
Taylor mode, which can compute the total derivative exactly for a cost of only O(K 2 ). We found
that this asymptotic improvement reduced wall-clock time by an order of magnitude, even for K as
low as 3.
First-order forward-mode AD Standard forward-mode AD computes, for a function f (x) and
an input perturbation vector v, the product ∂f  ∂x v. This Jacobian-vector product, or JVP, can be
computed efficiently without explicitly instantiating the Jacobian. This implicit computation of JVPs
is straightforward whenever f is a composition of operations for which which implicit JVP rules are
known.
Higher-order Jacobian-vector products Forward-mode AD can be generalized to higher orders
                                                                                            K
to compute Kth-order Jacobians contracted K times against the perturbation vector: ∂∂xKf v ⊗K .
Similarly, this can also be computed without representing any Jacobian matrices explicitly.

A naïve approach to higher-order forward mode is to recursively apply first-order forward mode.
                                                             K
Specifically, nesting JVPs K times gives the right answer: <<FORMULA>> but
causes an unnecessary exponential slowdown, costing O(exp(K)). This is because expressions that
appear in lower derivatives also appear in higher derivatives, but the work to compute is not shared
across orders.
Taylor Mode Taylor-mode AD generalizes             Function               Taylor propagation rule
first-order forward mode to compute the first    <<y = z + cw>>                  <<y[k] = z[k] + cw[k]>>
K derivatives exactly with a time cost of only                                     <<Pk>>
O(K 2 ) or O(K log K), depending on the op-       <<y =z∗w>>                 << y[k] =  h j=0   z[j] w[k−j] i>>
                                                                                       <<Pk−1>>
erations involved. Instead of providing rules      <<y = z/w>>         <<y[k] = w10 zk − j=0 z[j] w[k−j]>>
for propagating perturbation vectors, one pro-                                     <<Pk>>
                                                 <<y = exp(z)>>                <<ỹ[k] = j=1 y[k−j] z̃[j]>>
vides rules for propagating truncated Taylor                                       <<Pk>>
series. Some example rules are shown in ta-      <<s = sin(z)>>                <<s̃[k] = j=1 z̃[j] c[k−j]>>
                                                                                   <<Pk>>
ble 1. For more details see the Appendix and     <<c = cos(z)>>              <<c̃[k] = j=1 −z̃[j] s[k−j]>>
Griewank & Walther (2008, Chapter 12). We
provide an open source implementation of Table 1: Rules for propagating Taylor polynomial
Taylor mode AD in the JAX Python library coefficients through standard functions. These rules
(Bradbury et al., 2018).                       generalize standard first-order derivatives. Notation
                                               <<z[i] = i!1 zi>> and <<ỹ[i] = i!i zi>>.

                     5     Experiments
                                                                                    
We consider three different tasks in which continuous-
                                                                      
depth or continuous time models might have computa-                     
                                                                     
                                                                                 
tional advantages over standard discrete-depth models:
supervised learning, continuous generative modeling of                              
time-series (Rubanova et al., 2019), and density estima-                             
tion using continuous normalizing flows (Grathwohl et al.,
2019). Unless specified otherwise, we use the standard
                                                                                     
dopri5 Runge-Kutta 4(5) solver (Dormand & Prince,
1980; Shampine, 1986).                                                             <<FIGURE>>      
                                                                                                   
5.1   Supervised Learning                                             Figure 3: Number of function evalua-
                                                                      tions (NFE) and training error during
We construct a model for MNIST classification: it takes in            training. Speed regularization (solid)
as input a flattened MNIST image and integrates it through            decreases the NFE throughout training
dynamics given by a simple MLP, then applies a linear                 without substantially changing the train-
classification layer. In fig. 3 we compare the NFE and                ing error.
training error of a model with and without regularizing
R3 .
                                                                  
5.2   Continuous Generative Time Series Models

As in Rubanova et al. (2019), we use the Latent ODE        
architecture for modelling trajectories of ICU patients
using the PhysioNet Challenge 2012 dataset (Silva
et al., 2012). This variational autoencoder architec-            
ture uses an RNN recognition network, and models                     
the state dynamics using an ODE in a latent space.
In the supervised learning setting described in the
previous section only the final state affects model pre- Figure 4: Regularizing dynamics in a la-
dictions. In contrast, time-series models’ predictions tent ODE modeling PhysioNet clinical data.
also depend on the value of the trajectory at all inter- Shown are a representative 2-dimensional
mediate times when observations were made. So, we slice of 20 dimensional dynamics. We re-
might expect speed regularization to be ineffective duce average NFE from 281 to 90 while only
due to these extra constraints on the dynamics. How- incurring an 8% increase in loss.
ever, fig. 4 shows that, without changing their overall
shape the latent dynamics can be adjusted to reduce their NFE by a factor of 3.
                                                   
5.3                       Density Estimation with Continuous Normalizing Flows

Our third task is unsupervised density estimation, using a scalable variant of continuous normalizing
flows called FFJORD (Grathwohl et al., 2019). We fit the MINIBOONE tabular dataset from
Papamakarios et al. (2017) and the MNIST image dataset (LeCun et al., 2010). We use the respective
singe-flow architectures from Grathwohl et al. (2019).
Grathwohl et al. (2019) noted that the NFE required to numerically integrate their dynamics could
become prohibitively expensive throughout training. Table 2 shows that we can reduce NFE by 38%
for only a 0.6% increase in log-likelihood measured in bits/dim.
How to train your Neural ODE We compare against the approach of Finlay et al. (2020), who
design two regularization terms specifically for stabilizing the dynamics of FFJORD models:
       
                        <<FORMULA>>

The first term is designed to encourage straight-line paths, and the second, stochastic, term is designed
to reduce overfitting. Finlay et al. (2020) used fixed-step solvers during training for some datasets.
We compare these two regularization on training with each of adaptive and fixed-step solvers, and
evaluated using an adaptive solver, in section 6.3.
6                        Analysis and Discussion
6.1                       Trading off function evaluations for loss
What does the trade off between accuracy and speed look like? Ideally, we could reduce the solver
time a lot without substantially reducing model performance. Indeed, this is demonstrated in all three
settings we explored. Figure 5 shows that generally, model performance starts getting substantially
worse only after a 50% reduction in solver speed when controlling R2 .

               <<FIGURE>>

Figure 5: Tuning the regularization of R2 trades off between training loss and solver speed in three
different applications of neural ODEs. Horizontal axes show average number of function evaluations,
and vertical axes show unregularized training loss, both at the end of training.

6.2 Order of regularization vs. order of solver

Which order of total derivatives should we regularize for a particular solver? As mentioned earlier,
we conjecture that the best choice would be to match the order of the solver being used. Regularizing
too low an order might needlessly constrain the dynamics and make it harder to fit the data, while
regularizing too high an order might leave the dynamics difficult to solve for a lower-order solver.
However, we also expect that optimizing higher-order derivatives might be challenging, since these
higher derivatives can change quickly even for small changes to the dynamics parameters.
Figures 6 and 7 investigate this question on the task of MNIST classification. Figure 6 compares the
effectiveness of regularizing different orders when using a solver of a particular order. For a 2nd
order solver, regularizing K = 2 produces a strictly better trade-off between performance and speed,
as expected. For higher-order solvers, including ones with adaptive order, we found that regularizing
orders above K = 3 gave little benefit.

               <<FIGURE>>

Figure 7 investigates the relationship between RK and the quantity it is meant to be a surrogate
for: NFE. We observe a clear monotonic relationship between the two, for all orders of solver and
regularization.

               6.3          Do we reduce training time?

Our approach produces models that are fastest to evaluate at test time. However, when we train
with adaptive solvers we do not improve overall training time, due to the additional expense of
computing our regularizer. Training with a fixed-grid solver is faster, but can be unstable if dynamics
are unregularized. Finlay et al. (2020)’s regularization and ours allow us to use fixed grid solvers and
reduce training time. However, ours is 2.4× slower than Finlay et al. (2020) for FFJORD because
their regularization re-uses terms already computed in the FFJORD training objective. For objectives
where these cannot be re-used, like MNIST classification, our method is 1.7× slower, but achieves
better test-time NFE.

               6.4       Are we making the solver overconfident?

Because we optimize dynamics in a way specifically designed to make the solver take longer steps,
we might fear that we are “adversarially attacking” our solver, making it overconfident in its ability
to extrapolate. Figure 8c shows that this is not the case for MNIST classification.

               6.5       Does speed regularization overfit?

Finlay et al. (2020) motivated one of their regularization terms by the possibility of overfitting: having
faster dynamics only for the examples in the training set, but still low on the test set. However, they
did not check whether overfitting was occurring. In fig. 8b we confirm that our regularized dynamics
have nearly identical average solve time on a held-out test set, on MNIST classification.

                             7        Related Work

Although the field of numerical ODE solvers is extremely mature, as far as we know, there has
been almost no work specifically on tuning differential equations to be faster to solve. The closest

                                 <<FIGURE>>

Figure 8: Figure 8c We observe that the actual solver error is about equally well-calibrated for
regularized dynamics as random dynamics, indicating that regularization does not make the solver
overconfident. Figure 8b: There is negligible overfitting of solver speed. ??: Speed regularization
does not usefully improve generalization. For large λ, our method reduces overfitting, but increases
overall test error due to under-fitting.

related work is Grathwohl et al. (2019) who mention attempting to use weight decay and spectral
normalization to reduce NFE, and of course Finlay et al. (2020), who, among other contributions,
introduced the use of fixed-step solvers for stable training.
Stabilizing dynamics Simard et al. (1991) regularized the dynamics of discrete-time recurrent
neural networks to improve their stability, by constraining the norm of the Jacobian of the dynamics
function in the direction of its largest eigenvalue. However, this approach has an O(D3 ) time cost.
De Brouwer et al. (2019) introduced a parameterization of neural ODEs analogous to instantaneous
Gated Recurrent Unit (GRU) recurrent neural network architectures in order to stabilize training
dynamics. Dupont et al. (2019) provided theoretical arguments that adding extra dimensions to the
state of a neural ODE should make training easier, and showed that this helped reduce NFE during
training.
Gradually increasing depth Chang et al. (2017) noted the connection between residual networks
and ODEs, and took advantage of this connection to gradually make resnets deeper during training,
in order to save time. One can view the increase in NFE while neural ODEs as an automatic, but
uncontrolled, version of their method. Their results suggest we might benefit from introducing a
speed regularization schedule that gradually tapers off during training.
Gradient Regularization Novak et al. (2018); Drucker & LeCun (1992) regularized the gradients
of neural networks to improve generalization.
Table 2: Density Estimation on MNIST using FFJORD. For adaptive solvers, indicated by ∞ Steps,
our approach is slowest to train, but requires the fewest NFE once trained. For fixed-step solvers our
approach achieves lower bits/dim and NFE when comparing across fixed-grid solvers using the same
number of steps. Fixed step solvers that diverged due to instability are indicated by NaN bits/dim.
  
                        8    Scope

The initial speedups obtained in this paper are not yet enough to make neural ODEs competitive with
standard fixed-depth architectures in terms of speed for standard supervised learning. However, there
are many applications where continuous-depth architectures provide a unique advantage. Besides
density models such as FFJORD and time series models, continuous-depth architectures have been
applied in solving mean-field games (Ruthotto et al., 2019), image segmentation (Pinckaers & Litjens,
2019), image super-resolution (Scao, 2020), and molecular simulations (Wang et al., 2020). These
applications, which already use continuous-time models, could benefit from the speed regularization
proposed in this paper.
While we investigated only ODEs in this paper, this approach could presumably be extended straight-
forwardly to neural stochastic differential equations fit by adaptive solvers (Li et al., 2020) and other
flavors of parametric differential equations fit by gradient descent (Rackauckas et al., 2019).

                      9    Limitations

Hyperparameters The hyperparameter λ needs to be chosen to balance speed and training loss.
One the other hand, neural ODEs don’t require choosing the outer number of layers, which needs to
be chosen separately for each stack of layers in standard architectures.
One also needs to choose solver order and tolerances, and these can substantially affect solver speed.
We did not investigate loosening tolerances, or modifying other parameters of the solver. The default
tolerance of 1.4e-8 for both atol and rtol behaved well in all our experiments.
One also needs to choose K. Higher K seems to generally work better, but is slower per step at
training time. In principle, if one can express their utility explicitly in terms of training loss and NFE,
it may be possible to tune λ automatically during training using the predictable relationship between
RK and NFE shown in fig. 7.
Slower overall training Although speed regularization reduces the overall NFE during training, it
makes each step more expensive. In our density estimation experiments (table 2), the overall effect
was about about 70% slower training, compared to no regularization, when using adaptive solvers.
However, test-time evaluation is much faster, since there is no slowdown per step.
10     Conclusions
This paper is an initial attempt at controlling the integration time of differential equations by regular-
izing their dynamics. This is an almost unexplored problem, and there are almost certainly better
quantities to optimize than the ones examined in this paper.
Based on these initial experiments, we propose three practical takeaways:
       1. Across all tasks, tuning the regularization usually gave at least a 2x speedup without
          substantially hurting model performance.
       2. Overall training time with speed regularization is in general about 30% to 50% slower with
          adaptive solvers.
       3. For standard solvers, regularizing orders higher than R2 or R3 provided little additional
          benefit.
Future work It may be possible to adapt solver architectures to take advantage of flexibility in
choosing the dynamics. Standard solver design has focused on robustly and accurately solving a
given set of differential equations. However, in a learning setting, we could consider simply rejecting
some kinds of dynamics as being too difficult to solve, analogous to other kinds of constraints we put
on models to encourage statistical regularization.
                                                   
                        Acknowledgements

We thank Barak Perlmutter, Ken Jackson, Ricky T.Q. Chen, Will Grathwohl, Chris Finlay, and
Chris Rackauckas for feedback and helpful discussions. Resources used in preparing this research
were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and
companies sponsoring the Vector Institute.

                        References

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-
   Milne, S. JAX: composable transformations of Python+NumPy programs, 2018. URL http:
  //github.com/google/jax.
Chang, B., Meng, L., Haber, E., Tung, F., and Begert, D. Multi-level residual networks from
   dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential
   equations. In Advances in neural information processing systems, pp. 6571–6583, 2018.
De Brouwer, E., Simm, J., Arany, A., and Moreau, Y. GRU-ODE-Bayes: Continuous modeling of
   sporadically-observed time series. In Advances in Neural Information Processing Systems, pp.
  7377–7388, 2019.
Dormand, J. R. and Prince, P. J. A family of embedded Runge-Kutta formulae. Journal of computa-
   tional and applied mathematics, 6(1):19–26, 1980.
Drucker, H. and LeCun, Y. Improving generalization performance using double backpropagation.
  IEEE Trans. Neural Networks, 3(6):991–997, 1992. doi: 10.1109/72.165600. URL https:
  //doi.org/10.1109/72.165600.
Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural ODEs. In Advances in Neural Information
  Processing Systems, pp. 3134–3144, 2019.
Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. M. How to train your neural ODE.
   arXiv preprint arXiv:2002.02798, 2020.
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: Free-form
   continuous dynamics for scalable reversible generative models. International Conference on
  Learning Representations, 2019.
Griewank, A. and Walther, A. Evaluating derivatives. 2008.
Hairer, E., Norsett, S., and Wanner, G. Solving Ordinary Differential Equations I: Nonstiff Problems,
  volume 8. 01 1993. doi: 10.1007/978-3-540-78862-1.
Kutta, W. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für
  Mathematik und Physik, 46:435–453, 1901.
LeCun, Y., Cortes, C., and Burges, C. MNIST handwritten digit database. ATT Labs [Online].
  Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
Li, X., Chen, R. T. Q., Wong, T.-K. L., and Duvenaud, D. Scalable gradients for stochastic differential
   equations. In Artificial Intelligence and Statistics, 2020.
Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and
   generalization in neural networks: an empirical study. In 6th International Conference on Learning
  Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
  Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.
Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation.
  Advances in Neural Information Processing Systems, 2017.
Pinckaers, H. and Litjens, G. Neural ordinary differential equations for semantic segmentation of
   individual colon glands. arXiv preprint arXiv:1910.10470, 2019.
                                                    9
Rackauckas, C., Innes, M., Ma, Y., Bettencourt, J., White, L., and Dixit, V. Diffeqflux.jl-a Julia
   library for neural differential equations. arXiv preprint arXiv:1902.02376, 2019.
Rubanova, Y., Chen, T. Q., and Duvenaud, D. K. Latent ordinary differential equations for irregularly-
   sampled time series. In Advances in Neural Information Processing Systems, pp. 5321–5331,
   2019.
Runge, C. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46:
  167–178, 1895.
Ruthotto, L. and Haber, E. Deep neural networks motivated by partial differential equations. Journal
   of Mathematical Imaging and Vision, pp. 1–13, 2018.
Ruthotto, L., Osher, S. J., Li, W., Nurbekyan, L., and Fung, S. W. A machine learning framework for
   solving high-dimensional mean field game and mean field control problems. CoRR, abs/1912.01825,
   2019. URL http://arxiv.org/abs/1912.01825.
Scao, T. L. Neural differential equations for single image super-resolution. arXiv preprint
   arXiv:2005.00865, 2020.
Shampine, L. F. Some practical Runge-Kutta formulas. Mathematics of Computation, 46(173):
  135–150, 1986. ISSN 00255718, 10886842. URL http://www.jstor.org/stable/2008219.
Silva, I., Moody, G., Scott, D. J., Celi, L. A., and Mark, R. G. Predicting in-hospital mortality of
   ICU patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in
  Cardiology, pp. 245–248, 2012.
Simard, P., Raysz, J. P., and Victorri, B. Shaping the state space landscape in recurrent networks. In
  Advances in neural information processing systems, pp. 105–112, 1991.
Wang, W., Axelrod, S., and Gómez-Bombarelli, R. Differentiable molecular simulations for control
   and learning. arXiv preprint arXiv:2003.00868, 2020.

                  Appendix A            Taylor-mode Automatic Differentiation

                              A.1    Taylor Polynomials

To clarify the relationship between the presentation in Chapter 13 of Griewank & Walther (2008) and
our results we give the distinction between the Taylor coefficients and derivative coefficients, also
known, unhelpfully, as Tensor coefficients.

For a sufficiently smooth vector valued function f : Rn → Rm and the polynomial

                         << x(t) = x[0] + x[1] t + x[2] t2 + x[3] t3 + · · · + x[d] td ∈ Rn>>                                 (5)

we are interested in the d-truncated Taylor expansion

                          <<y(t) = f (x(t)) + O(td+1 )>>                                                                      (6)
                               
                          <<≡ y[0] + y[1] t + y[2] t + y[3] t + · · · + y[d] t ∈ R >>                                    (7)

with the notation that <<FORMULA>> is the Taylor coefficient, which is the normalized derivative coefficient.

The Taylor coefficients of the expansion, y[j] , are smooth functions of the i ≤ j coefficients x[i],

                                       <<FORMULA>>                                                                        (8)

                                       <<FORMULA>>                                                                        (9)

                                       <<FORMULA>>                                                                       (10)

                                       <<FORMULA>>                                                                       (11)

These, as given in Griewank & Walther (2008), are written in terms of the normalized, Taylor
coefficients. This obscures their direct relationship with the derivatives, which we make explicit.
Consider the polynomial eq. (5) with Taylor coefficients expanded so their normalization is clear.
Further, let’s use suggestive notation that these coefficients correspond to the higher derivatives of 
x with respect to t, making x(t) a Taylor polynomial. That is <<FORMULA>>.
                                       <<FORMULA>>                                                                       (12)

                                       <<FORMULA>>                                                                       (13)

                                       <<FORMULA>>                                                                       (14)

Again, we are interested in the polynomial eq. (7), but with the normalization terms explicit

                                       <<FORMULA>>                                                                       (15)

Now we can expand the expressions for the Taylor coefficients y[i] to expressions for derivative
coefficients yi = i!y[i].

The coefficients of the Taylor expansion, yj , are smooth functions of the i ≤ j coefficients xi,

                                       <<FORMULA>>                                                                       (16)

                                       <<FORMULA>>                                                                       (17)

                                       <<FORMULA>>                                                                       (18)

                                       <<FORMULA>>                                                                       (19)

                                       <<FORMULA>>                                                                       (20)

                                       <<FORMULA>>                                                                       (21)

Therefore, eqs. (16), (17), (19) and (21) show that the derivative coefficient yi are exactly the ith
order higher derivatives of the composition f (x(t)) with respect to t. The key insight to this exercise
is that by writing the derivative coefficients explicitly we reveal that the expressions for the terms,
eqs. (16) to (18) and (20), involve terms previously computed for lower order terms.
In general, it will be useful to consider that the yk derivative coefficients is a function of all lower
order input derivatives

                                             <<yk = yk (x0 , . . . , xk )>>.                                  (22)

We provide the API to compute this in JAX by indexing the k-output of jet

                                      <<yk = jet(f, x0 , (x1 , . . . , xk ))[k]>>.

                  A.2    Relationship with Differential Equations

                           A.2.1    Autonomous Form

We can transform the initial value problem

                                     <<FORMULA>>                                (23)

into an autonomous dynamical system by augmenting the system to include the independent variable
with trivial dynamics Hairer et al. (1993):

                               <<FORMULA>>                              (24)

We do this for notational convenience, as well it disambiguates that derivatives with respect to t are 
meant in the “total" sense. This is aleviates the potential ambiguity of ∂t f (x(t), t) which could mean
both the derivative with respect to the second argument and the derivative through x(t) by the chain
rule <<FORMULA>>.

            A.2.2    Taylor Coefficients for ODE Solution with jet

Recall that jet gives us the coefficients for yi as a function of f and the coefficients xj≤i . We
can use jet and the relationship xk+1 = yk to recursively compute the coefficients of the solution
polynomial.

                     Algorithm 1 Taylor Coefficients for ODE Solution by Recursive Jet

                                    <<ALGORITHM>>

                        A.3    Regularizing Taylor Terms

Computing the Taylor coefficients for the ODE solution as in algorithm 1 will give a local approx-
imation to the ODE solution. If infinitely many Taylor coefficients could be computed this would
give the exact solution. The order of the final Taylor coefficient, determining the truncation of the
polynomial, gives the order of the approximation.
If the higher order Taylor coefficients of the solution are large, then truncation will result in a local
approximation that quickly diverts from the solution. However, if the higher Taylor coefficients are
small then the local approximation will remain close to the solution. This motivates our regularization
method. The effect of our regularizer on the Taylor expansion of a solution to a neural ODE can be
seen in fig. 9.

                  Appendix B         Experimental Details

Experiments were conducted using GPU-based ODE solvers. Training gradients were computed
using the adjoint method, in which the trajectory is reconstructed backwards in time to save memory,
for backpropagation. As in Finlay et al. (2020), we normalize our regularization term in eq. (1) by
the dimension of the vector-valued trajectory z(t) so that we may choose λ free of scaling by the
dimension of the problem.

               B.1    Efficient computation of the gradient of regularization term

To optimize our regularized objective, we must compute its gradient. We use the adjoint method
as described in Chen et al. (2018) to differentiate through the solution to the ODE. In particular, to
optimize our model we only need to compute the gradient of the regularization term. The adjoint
method gives the gradient of the ODE solution as a solution to an augmented ODE.

                               <<FIGURE>>

Figure 9: Left: The dynamics and a trajectory of a neural ODE trained on a toy supervised learning
problem. The dynamics are poorly approximated by a 6th-order local Taylor series, and requires 92
NFE by a solve by a 5th-order Runge-Kutta solver. Right: Regularizing the 6th-order derivatives of
trajectories gives dynamics that are easier to solve numerically, requiring only 68 NFE.

                     B.2   Supervised Learning

The dynamics function f : Rd × R → Rd is given by an MLP as follows

                                            <<z1 = σ(x)>>
                                        <<h1 = W1 [z1 ; t] + b1>>
                                            <<z2 = σ(h1 )>>
                                        <<y = W2 [z2 ; t] + b2>>

Where <<[·; ·]>> denotes concatenation of a scalar onto a column vector. The parameters are <<W1 ∈
R^h×d>>, <<b1 ∈ R^h>> and <<W2 ∈ R^d×h>> , <<b2 ∈ R^d>> . Here we use 100 hidden units, i.e.<< h = 100>>. We have
<<d = 784>>, the dimension of an MNIST image.
We train with a batch size of 100 for 160 epochs. We use the standard training set of 60,000 images,
and the standard test set of 10,000 images as a validation/test set. We optimize our model using SGD
with momentum with β = 0.9. Our learning rate schedule is 1e-1 for the first 60 epochs, 1e-2 until
epoch 100, 1e-3 until epoch 140, and 1e-4 for the final 20 epochs.
B.3   Continuous Generative Modelling of Time-Series
The PhysioNet dataset consists of observations of 41 distinct traits over a time period of 48 hours.
We remove the parameters “Age”, “Gender”, “Height”, and “ICUType” as these attributes do not vary
in time. We also quantize the measurements for each attribute by the hour by averaging multiple
measurements within the same hour. This leaves 49 unique time stamps (the extra time stamp for
observations at exactly the endpoint of the 48 hour observation period). We report all our losses on
this quantized data. We performed this rather coarse quantization for computational reasons having
to do with our particular implementation of this model. The validation split was obtained by taking
a random split of 20% of the trajectories from the full dataset. In total there are 8000 trajectories.
Code is included for processing the dataset, and links to downloading the data may be found in the
code for Rubanova et al. (2019). All other experimental details may be found in the main body and
appendices of Rubanova et al. (2019).

                     B.4   Continuous Normalizing Flows

For the model trained on the MINIBOONE tabular dataset from Papamakarios et al. (2017), we used
the same architecture as in Table 4 in the appendix of Grathwohl et al. (2019). We chose the number
of epochs and a learning rate schedule based on manual tuning on the validation set, in contrast
to Grathwohl et al. (2019) who tuned these automatically using early stopping and an automatic
heuristic for the learning rate decay using evaluation on a validation set. In particular, we trained for
500 epochs with a learning rate of 1e-3 for the first 300 epochs, 1e-4 until epoch 425, and 1e-5
for the remaining 75 epochs. The number of epochs and learning rate schedule was determined by
evaluating the model on the validation set every 10 epochs, and decaying the learning rate by a factor
of 10 once the loss on the validation set stopped improving for several evaluations, with the goal of
matching or improving upon the log-likelihood reported in Grathwohl et al. (2019). The data was
obtained as made available from Papamakarios et al. (2017), which was already processed and split
into train/validation/test. In particular, the training set has 29556 examples, the validation set has
3284 examples, and the test set has 3648 examples, which consist of 43 features.
It is important to note that we implemented a single-flow model for the MNIST dataset, while the
original comparison in Finlay et al. (2020) was on a multi-flow model. This accounts for discrepancy
in bits/dim and NFE reported in Finlay et al. (2020).
All other experimental details are as in Grathwohl et al. (2019).

                              B.5   Hardware

MNIST Supervised learning, Physionet Time-series, and MNIST FFJORD experiments were trained
and evaluated on NVIDIA Tesla P100 GPU. Tabular data FFJORD experiments were evaluated on
NVIDIA Tesla P100 GPU but trained on NVIDIA Tesla T4 GPU. All experiments except for MNIST
FFJORD were trained with double precision for purposes of reproducibility.

                     Appendix C         Additional Results

                               C.1   Overfitting of NFE


                                    <<FIGURE>>

                Figure 10: The difference in NFE is tracked by the variance of NFE.

In fig. 10 we note that there is a striking correspondence in the variance of NFE across individual
examples (in both the train set (dark red) and test set (light red)) and the absolute difference in NFE
between examples in the training set and test set. This suggests that any difference in the average
NFE between training examples and test examples is explained by noise in the estimate of the true
average NFE. It is also interesting that speed regularization does not have a monotonic relationship
with the variance of NFE, and we speculate as to how this might interact between the correspondence
of NFE for a particular example and the difficulty in the model correctly classifying it.

                     C.2         Trading off function evaluations with a surrogate loss

In fig. 11 and fig. 12 we confirm that our method poses a suitable tradeoff not only on the loss being
optimized, but also on the potentially non-differentiable loss which we truly care about. On MNIST,
we get a similar pareto curve when plotting classification error as opposed to cross-entropy loss, and
similarly on the time-series modelling task we see that we get a similar pareto curve on MSE loss as
compared to IWAE loss. The pareto curves are plotted for R3 , R2 respectively.

                                                   <<FIGURE>>

                                         Figure 11: MNIST Classification                                                                                                 
                                         
                                                   <<FIGURE>>

                                         Figure 12: Physionet Time-Series

                                    C.3         Wall-clock Time

We include additional tables with wall-clock time and training with fixed grid solvers in table 3 and
table 4.


                           Appendix D          Comparison to How to Train Your Neural ODE

The terms from Finlay et al. (2020) are

                                    <<FORMULA>>

and an estimate of
                                    <<FORMULA>>

                       Table 3: Classification on MNIST

                                    <<TABLE>>

These are combined with a weighted average and integrated along the solution trajectory.
These terms are motivated by the expansion

                                    <<FORMULA>>

Namely, eq. (3) regularizes the first total derivative of the solution, f (z(t), t), along the trajectory, and
eq. (4) regularizes a stochastic estimate of the Frobenius norm of the spatial derivative, ∇z f (z(t), t),
along the solution trajectory.
In contrast, R2 regularizes the norm of the second total derivative directly. In particular, this takes
into account the ∂f ∂t term. In other words, this accounts for the explicit dependence of f on time,
while eq. (3) and eq. (4) capture only the implicit dependence on time through z(t).
Even in the case of an autonomous system, that is, where ∂f    ∂t is identically 0 and the dynamics f only
depend implicitly on time, these terms still differ. Namely, R2 integrates the following along the
solution trajectory:

                                       <<FORMULA>>

while Finlay et al. (2020) penalizes the respective norms of the matrix ∇z f (z(t), t) and vector
f (z(t), t) separately.

                     Table 4: Density Estimation on Tabular Data (MINIBOONE)

                                       <<TABLE>>

<<END>> <<END>> <<END>>


<<START> <<START>> <<START>>


          How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization

                Chris Finlay 1 Jörn-Henrik Jacobsen 2 Levon Nurbekyan 3 Adam M Oberman 1
                                                                                                                        
                    Abstract

         Training neural ODEs on large datasets has not
         been tractable due to the necessity of allowing
         the adaptive numerical ODE solver to refine its
         step size to very small values. In practice this
         leads to dynamics equivalent to many hundreds
         or even thousands of layers. In this paper, we
         overcome this apparent difficulty by introducing
         a theoretically-grounded combination of both op-
         timal transport and stability regularizations which
         encourage neural ODEs to prefer simpler dynam-
         ics out of all the dynamics that solve a problem
         well. Simpler dynamics lead to faster conver-
         gence and to fewer discretizations of the solver,
         considerably decreasing wall-clock time without
         loss in performance. Our approach allows us to
         train neural ODE-based generative models to the
         same performance as the unregularized dynamics,                 
         with significant reductions in training time. This
         brings neural ODEs closer to practical relevance             
         in large-scale applications.

                                       <<FIGURE>>   

                        Figure 1. Optimal transport map and a generic normalizing flow.

                                                                      Indeed, it was observed that there is a striking similarity
 1. Introduction                                                      between ResNets and the numerical solution of ordinary
                                                                        differential equations (E, 2017; Haber & Ruthotto, 2017;
Recent research has bridged dynamical systems, a                     Ruthotto & Haber, 2018; Chen et al., 2018; 2019). In these
workhorse of mathematical modeling, with neural networks,            works, deep networks are interepreted as discretizations of
the defacto function approximator for high dimensional data.         an underlying dynamical system, where time indexes the
The great promise of this pairing is that the vast mathemat-         “depth” of the network and the parameters of the discretized
ical machinery stemming from dynamical systems can be                dynamics are learned. An alternate viewpoint was taken by
leveraged for modelling high dimensional problems in a               neural ODEs (Chen et al., 2018), where the dynamics of
dimension-independent fashion.                                       the neural network are approximated by an adaptive ODE
Connections between neural networks and ordinary differ-             solver on the fly. This latter approach is quite compelling
ential equations (ODEs) were almost immediately noted                as it does not require specifying the number of layers of the
after residual networks (He et al., 2016) were first proposed.       network beforehand. Furthermore, it allows the learning of
                                                                     homeomorphisms without any structural constraints on the
                                                                     function computed by the residual block.
                                                                     Neural ODEs have shown great promise in the physical sciences 
                                                                     (Köhler et al., 2019), in modeling irregular time series
                                                                     (Rubanova et al., 2019), mean field games (Ruthotto et al.,
                                                                     2019), continuous-time modeling (Yildiz et al., 2019; Kanaa
                                                                     et al., 2019), and for generative modeling through normaliz-
                                                                     ing flows with free-form Jacobians (Grathwohl et al., 2019).


Recent work has even adapted neural ODEs to the stochas-        based on (ODE) which abstain from a priori fixing step-size.
tic setting (Li et al., 2020). Despite these successes, some    Chen et al.’s method is a continuous-time generalization of
hurdles still remain. In particular, although neural ODEs are   residual networks, where the dynamics are generated by an
memory efficient, they can take a prohibitively long time to    adaptive ODE solver that chooses step-size on-the-fly.
train, which is arguably one of the main stumbling blocks
                                                                Because of their adaptive nature, neural ODEs can be more
towards their widespread adoption.
                                                                flexible than ResNets in certain scenarios, such as when
In this work we reduce the training time of neural ODEs         trading between model speed and accuracy. Moreover given
by regularizing the learned dynamics, complementing other       a fixed network depth, the memory footprint of neural ODEs
recent approaches to this end such as augmented neural          is orders of magnitude smaller than a standard ResNet dur-
ODEs (Dupont et al., 2019). Without further constraints on      ing training. They therefore show great potential on a host
their dynamics, high dimensional neural ODEs may learn          of applications, including generative modeling and density
dynamics which minimize an objective function, but which        estimation. An apparent drawback of neural ODEs is their
generate irregular solution trajectories. See for example       long training time: although a learned function f (· ; θ) may
Figure 1b, where an unregularized flow exhibits undesirable     generate a map that solves a problem particularly well, the
properties due to unnecessarily fluctuating dynamics. As        computational cost of numerically integrating (ODE) may
a solution, we propose two theoretically motivated regular-     be so prohibitive that it is not tractable in practice. In this
ization terms arising from an optimal transport viewpoint       paper we demonstrate this need not be so: with proper reg-
of the learned map, which encourage well-behaved dynam-         ularization, it is possible to learn f (· ; θ) so that (ODE) is
ics (see 1a left). We empirically demonstrate that proper       easily and quickly solved.
regularization leads to significant speed-up in training time
without loss in performance, thus bringing neural ODEs          2.1. FFJORD
closer to deployment on large-scale datasets. Our methods
are validated on the problem of generative modelling and        In density estimation and generative modeling, we wish
density estimation, as an example of where neural ODEs          to estimate an unknown data distribution p(x) from which
have shown impressive results, but could easily be applied      we have drawn N samples. Maximum likelihood seeks to
elsewhere.                                                      approximate p(x) with a parameterized distribution pθ (x)
                                                                by minimizing the Kullback-Leibler divergence between the
In summary, our proposed regularized neural ODE (RN-            two, or equivalently minimizing
ODE) achieves the same performance as the baseline, while
reducing the wall-clock training time by many hours or even                                    
days.                                                                               <<FORMULA>>              (1)
                                                                                             
2. Neural ODEs & Continuous normalizing                         Continuous normalizing flows (Grathwohl et al., 2019; Chen
    flows                                                       et al., 2018) parameterize pθ (x) using a vector field f :
                                                                Rd × R 7→ Rd as follows. Let z(x, T ) be the solution map
Neural ODEs simplify the design of deep neural networks         given by running the dynamics (ODE) for fixed time T .
by formulating the forward pass of a deep network as the        Suppose we are given a known distribution q at final time T ,
solution of a ordinary differential equation. Initial work      such as the normal distribution. Change of variables tells us
along these lines was motivated by the similarity of the eval-  that the distribution pθ (x) may be evaluated through
uation of one layer of a ResNet and the Euler discretization
of an ODE. Suppose the block in the t-th layer of a ResNet          <<log pθ (x) = log q (z(x, T )) + log det | ∇ z(x, T )|>>          (2)
is given by the function f (x, t; θ), where θ are the block’s
parameters. Then the evaluation of this layer of the ResNet     Evaluating the log determinant of the Jacobian is difficult.
is simply xt+1 = xt + f (xt , t; θ). Now, instead consider      Grathwohl et al. (2019) exploit the following identity from
the following ODE                                               fluid mechanics (Villani, 2003, p 114)

                <<FORMULA>>                        (ODE)                 <<log det | ∇ z(x, t)| = div (f ) (z(x, t), t))>>       (3)

The Euler discretization of this ODE with step-size <<τ>> is        where <<div(·)>> is the divergence operator, <<div(f ) (x) =
<<zt+1 = zt + τ f (zt , t; θ)>>, which is nearly identical to the      i ∂xi fi (x)>>. By the fundamental theorem of calculus, we
forward evaluation of the ResNet’s layer (setting step-size          1
                                                                       In the normalizing flow literature divergence is typically writ-
<<τ = 1>> gives equality). Armed with this insight, Chen et al.     ten explicitly as the trace of the Jacobian, however we use div (·)
(2018) suggested a method for training neural networks          which is more common elsewhere.

                                                         <<FIGURE>>
Figure 2. Log-likelihood (measured in bits/dim) on the validation set as a function of wall-clock time. Rolling average of three hours, with
90% confidence intervals.

may then rewrite (2) in integral form                                    From this simple motivating example, the need for regular-
                                                                         ity of the vector field is apparent. Without placing demands
                                                                         on the vector field f , it is entirely possible that the learned
 <<log pθ (x) = log q (z(x, T )) + div (f ) (z(x, s), s) ds>>
                                                                         dynamics will be poorly conditioned. This is not just a theo-
                                                               (4)       retical exercise: because the dynamics must be solved with
Remark 2.1 (Divergence trace estimate). In (Grathwohl                    a numerical integrator, poorly conditioned dynamics will
et al., 2019), the divergence is estimated using an unbiased             lead to difficulties during numerical integration of (ODE).
Monte-Carlo trace estimate (Hutchinson, 1990; Avron &                    Indeed, later we present results demonstrating a clear corre-
Toledo, 2011),                                                           lation between the number of time steps an adaptive solver
                                                                         takes to solve (ODE), and the regularity of f .       
             <<FORMULA>>             (5)                                 How can the regularity of the vector field be measured? One
                                                                         motivating approach is to measure the force experienced by
                                                                         a particle z(t) under the dynamics generated by the vector
By using the substitution (4), the task of maximizing log-               field f , which is given by the total derivative of f with
likelihood shifts from choosing pθ to minimize (1), to learn-            respect to time
ing the flow generated by a vector field f . This results in a
normalizing flow with a free-form Jacobian and reversible                      
dynamics, and was named FFJORD by Grathwohl et al..                                          <<FORMULA>>                      (6)
                                                                                      
2.2. The need for regularity                                                                  <<FORMULA>>                  (7)

The vector field learned through FFJORD that maximizes                  Well conditioned flows will place constant, or nearly con-
the log-likelihood is not unique, and raises troubling prob-             stant, force on particles as they travel. Thus, in this work we
lems related to the regularity of the flow. For a simple                 propose regularizing the dynamics with two penalty terms,
example, refer to Figure 1, where we plot two normaliz-                  one term regularizing f and the other ∇ f . The first penalty,
ing flows, both mapping a toy one-dimensional distribution               presented in Section 3, is a measure of the distance travelled
to the unit Gaussian, and where both maximize the log-                   under the flow f , and can alternately be interpreted as the
likelihood of exactly the same sample of particles. Figure               kinetic energy of the flow. This penalty term is based off
1a presents a “regular” flow, where particles travel in straight         of numerical methods in optimal transport, and encourages
lines that travel with constant speed. In contrast, Figure 1b            particles to travel in straight lines with constant speed. The
shows a flow that still maximizes the log-likelihood, but                second penalty term, discussed in Section 4, performs regu-
that has undesirable properties, such as rapidly varying local           larization on the Jacobian of the vector field. Taken together
trajectories and non-constant speed.                                     the two terms ensure that the force experienced by a particle

under the flow is constant or nearly so.                         3.1. Linking normalizing flows to optimal transport

These two regularizers will promote dynamics that follow         Now suppose we wish to minimize (18a), with q(z) a unit
numerically easy-to-integrate paths, thus greatly speeding       normal distribution, and p(x) a data distribution, unknown
up training time.                                                to us, but from which we have drawn N samples, and which
                                                                 we model as a discrete distribution of Dirac masses. Enforc-
3. Optimal transport maps &                                      ing the initial condition is trivial because we have sampled
                                                                 from p directly. The continuity equation (18b) need not be
   Benamou-Brenier
                                                                 enforced because we are tracking a finite number of sam-
There is a remarkable similarity between density estimation      pled particles. However the final time condition ρT = q
using continuous time normalizing flows, and the calcula-        cannot be implemented directly, since we do not have di-
tion of the optimal transport map between two densities          rect control on the form ρT (z) takes. Instead, introduce
using the Benamou-Brenier formulation (Benamou & Bre-            a Kullback-Leibler term to (18a) penalizing discrepancy
nier, 2000; Santambrogio, 2015). While a review of optimal       between ρT and q. This penalty term has an elegant simpli-
transport theory is far outside the scope of this paper, here    fication when p(x) is modeled as a distribution of a finite
we provide an informal summary of key ideas relevant to          number of masses, as is done in generative modeling. Set-
continuous normalizing flows. The quadratic-cost optimal         ting ρ0 = pθ a brief derivation yields
transport map between two densities p(x) and q(x) is a map       
z : Rd 7→ Rd minimizing the transport cost                       
                                                                                <<FORMULA>>                 (10)

              <<FORMULA>>                  (8)
                                                                 With this simplification (18a) becomes

subject to the constraint that A q(z) dz = z−1 (A) p(x) dx, 
in other words that the measure of any set A is preserved                      
under the map z. In a seminal work, Benamou & Brenier                            <<FORMULA>>                (11)
(2000) showed that rather than solving for minimizers of (8)
directly, an indirect (but computationally efficient) method
is available by writing z(x, T ) as the solution map of a
flow under a vector field f (as in (ODE)) for time T , by        For further details on this derivation consult the supplemen-
minimizing                                                       tary materials.
                                                                 The connection between the Benamou-Brenier formulation
                      <<FORMULA>>                          (9a)   of the optimal transport problem on a discrete set of points
                                                                 and continuous normalizing flows is apparent: the optimal
                                                                 transport problem (11) is a regularized form of the continu-
                      <<FORMULA>>                          (9b)  ous normalizing flow optimization problem (1). We there-
                      <<ρ0 (x) = p>>,                      (9c)  fore expect that adding a kinetic energy regularization term
                      <<ρT (z) = q>>.                      (9d)  to FFJORD will encourage solution trajectories to prefer
                                                                 straight lines with constant speed.
 The objective function (18a) is a measure of the kinetic
energy of the flow. The constraint (18b) ensures probability
mass is conserved. The latter two constraints guarantee the      4. Unbiased Frobenius norm regularization of
learned distribution agrees with the source p and target q.          the Jacobian
Note that the kinetic energy (18a) is an upper bound on the
                                                                 Refering to equation (7), one can see that even if f is regu-
transport cost, with equality only at optimality.
                                                                 larized to be small, via a kinetic energy penalty term, if the
The optimal flow f minimizing (18) has several particularly      Jacobian is large then the force experienced by a particle
appealing properties. First, particles induced by the opti-      may also still be large. As a result, the error of the numerical
mal flow f travel in straight lines. Second, particles travel    integrator can be large, which may lead an adaptive solver
with constant speed. Moreover, under suitable conditions         to make many function evaluations. This relationship is
on the source and target distributions, the optimal solution     apparent in Figure 3, where we empirically demonstrate the
map is unique (Villani, 2008). Therefore the solution map        correlation between the number of function evaluations of
z(x, t) is entirely characterized by the initial and final posi- f taken by the adaptive solver, and the size of the Jacobian
tions: z(x, t) = (1 − Tt )z(x, 0) + Tt z(x, T ). Consequently,   norm of f . The correlation is remarkably strong: dynamics
given an optimal f it is extraordinarily easy to solve (ODE)     governed by a poorly conditioned Jacobian matrix require
numerically with minimal computational effort.                   the adaptive solver to take many small time steps.


Algorithm 1 RNODE: regularized neural ODE training of
FFJORD
         <<ALGORITHM>>

                                                                                 <<FIGURE>>

                                                                 Figure 3. Number of function evaluations vs Jacobian Frobenius
                                                                norm of flows on CIFAR10 during training with vanilla FFJORD,
                                                                 using an adaptive ODE solver.
\
                                                                 Avron & Toledo, 2011). For real matrix B, an unbiased
                     <<FORMULA>>                                 estimate of the trace is given by

                                                                                    <<FORMULA>>                 (14)

                                                                 where <<FORMULA>> is drawn from a unit normal distribution. 
                                                                 Thus the squared Frobenius norm can be easily estimated by 
                                                                 setting B = AAT.
Moreover, in particle-based methods, the kinetic energy          Turning to the Jacobian <<FORMULA>> of a vector valued func-
term forces dynamics to travel in straight lines only on         tion f : Rd 7→ Rd , recall that the vector-Jacobian product
data seen during training, and so the regularity of the map      <<FORMULA>> may be quickly computed through reverse-mode
is only guaranteed on trajectories taken by training data.       automatic differentiation. Therefore an unbiased Monte-
The issue here is one of generalization: the map may be          Carlo estimate of the Frobenius norm of the Jacobian is
irregular on off-distribution or perturbed images, and cannot    readily available
be remedied by the kinetic energy term during training alone.
In the context of generalization, Jacobian regularization is                    <<FORMULA>>                         (15)
analagous to gradient regularization, which has been shown       
to improve generalization (Drucker & LeCun, 1992; Novak                         <<FORMULA>>                         (16)
et al., 2018).

For these reasons, we also propose regularizing the Jacobian     Conveniently, in the FFJORD framework the quantity
through its Frobenius norm. The Frobenius norm k · kF of a       <<FORMULA>> must be computed during the estimate of the prob-
real matrix A can be thought of as the `2 norm of the matrix     ability distribution under the flow, in the Monte-Carlo esti-
A vectorized                                                     mate of the divergence term (5). Thus Jacobian Frobenius
                      <<FORMULA>>                         (12)   norm regularization is available with essentially no extra
                                                                 computational cost.
Equivalently it may be computed as
                                                                 5. Algorithm description
                               
                     <<kAkF = tr(AAT)>>                   (13)   All together, we propose modifying the objective function
                                                                 of the FFJORD continuous normalizing flow (Grathwohl
and is the Euclidean norm of the singular values of a matrix.    et al., 2019) with the two regularization penalties of Sec-
In trace form, the Frobenius norm lends itself to estimation     tions 3 & 4. The proposed method is called RNODE, short
using a Monte-Carlo trace estimator (Hutchinson, 1990;           for regularized neural ODE. Pseudo-code of the method is

                                                         <<TABLE>>

Table 1. Log-likelihood (in bits/dim) and training time (in hours) on validation images with uniform dequantization. Results on clean
images are found in the supplemental materials. For comparison we report both the results of the original FFJORD paper (Grathwohl
et al., 2019) and our own independent run of FFJORD (“vanilla”) on CIFAR10 and MNIST. Vanilla FFJORD did not train on ImageNet64
(denoted by “x”). Also reported are results for other flow-based generative modeling papers. Our method (FFJORD with RNODE) has
comparable log-likelihood as FFJORD but is significantly faster.
   

                                                              <<FIGURE>>

Figure 4. Quality of generated samples samples on 5bit CelebA-HQ64 with RNODE. Here temperature annealing (Kingma & Dhariwal,
2018) with T = 0.7 was used to generate visually appealing images. For full sized CelebA-HQ256 samples, consult the supplementary
materials.

presented in Algorithm 1. The optimization problem to be                   Here E, l, and n are respectively the kinetic energy, the
solved is                                                                  log determinant of the Jacobian, and the integral of the
                                                                           Frobenius norm of the Jacobian.
                                                                           Both the divergence term and the Jacobian Frobenius norm
                                                                           are approximated with Monte-Carlo trace estimates. In our
                     <<FORMULA>>                                           implementation, the Jacobian Frobenius estamate reuses
                                                                           the computatian T ∇ f from the divergence estimate for
                                                                           efficiency. We remark that the kinetic energy term only
                     <<FORMULA>>                                           requires the computation of a dot product. Thus just as
                                                                           in FFJORD, our implementation scales linearly with the
                     <<FORMULA>>             (17)                          number of time steps taken by the ODE solver.

                                                                           Gradients of the objective function with respect to the net-
where z(x, t) is determined by numerically solving (ODE).                  work parameters are computed using the adjoint sensitivity
Note that we take the mean over number of samples and                      method (Pontryagin et al., 1962; Chen et al., 2018).
input dimension. This is to ensure that the choice of regu-
larization strength λK and λJ is independent of dimension
size and sample size.                                                      6. Experimental design
To compute the three integrals and the log-probability under               Here we demonstrate the benefits of regularizing neural
q of z(x, T ) at final time T , we augment the dynamics of                 ODEs on generative models, an application where neu-
the ODE with three extra terms, so that the entire system                  ral ODEs have shown strong empirical performance. We
solved by the numerical integrator is                                      use four datasets: CIFAR10 (Krizhevsky & Hinton, 2009),
                                                                           MNIST (LeCun & Cortes, 1998), downsampled ImageNet
                                                                           (64x64) (van den Oord et al., 2016), and 5bit CelebA-HQ
                                                                           (256x256) (Karras et al., 2017). We use an identical neural
        <<FORMULA>>                                   (RNODE)              architecture to that of Grathwohl et al. (2019). The dynamics
                                                                           (Kingma & Dhariwal, 2018) trained with 40 GPUs for a week;
                                                                           in contrast we train with four GPUs in just under a week.
 
                                                      <<FIGURE>>

Figure 5. Ablation study of the effect of the two regularizers, comparing two measures of flow regularity during training with a fixed
step-size ODE solver. Figure 5a: mean Jacobian Frobenius norm as a function of training epoch. Figure 5b: mean kinetic energy of the
flow as a function of training epoch. Figure 5c: number of function evaluations.

are defined by a neural network <<f (z, t; θ(t)) : Rd × R+ 7→          step size by a factor of two until the discrete dynamics were
Rd>> where <<θ(t)>> is piecewise constant in time. On MNIST we         stable and achieved good performance. The Runge-Kutta
use 10 pieces; CIFAR10 uses 14; downsampled ImageNet                   4(5) adaptive solver was used on the two larger datasets. We
uses 18; and CelebA-HQ uses 26 pieces. Each piece is a                 have also observed that RNODE improves the training time
4-layer deep convolutional network comprised of 3x3 ker-               of the adaptive solvers as well, requiring many fewer func-
nels and softplus activation functions. Intermediary layers            tion evaluations; however in Python we have found that the
have 64 hidden dimensions, and time t is concatenated to               fixed grid solver is typically quicker at a specified number
the spatial input z. The integration time of each piece is             of function evaluations. At test time RNODE uses the same
[0, 1]. Weight matrices are chosen to imitate the multi-scale          adaptive solver as FFJORD.
architecture of Real NVP (Dinh et al., 2017), in that im-
                                                                       We always initialize RNODE so that <<f(z, t) = 0>>; thus train-
ages are ‘squeezed’ via a permutation to halve image height
                                                                       ing begins with an initial identity map. This is done by zero-
and width but quadruple the number of channels. Diver-
                                                                       ing the parameters of the last layer in each piece (block),
gence of f is estimated using the Gaussian Monte-Carlo
                                                                       following Goyal et al. (2017). The identity map is an ap-
trace estimator with one sample of fixed noise per solver
                                                                       propriate choice because it has zero transport cost and zero
time-step.
                                                                       Frobenius norm. Moreover the identity map is trivially
On MNIST and CIFAR10 we train with a batch size of                     solveable for any numerical solver, thus training begins
200 and train for 100 epochs on a single GPU3 , using the              without any effort required on the solver’s behalf.
Adam optimizer (Kingma & Ba, 2015) with a learning rate
                                                                       On all datasets we set both the kinetic energy regularization
of 1e−3. On the two larger datasets, we train with four
                                                                       coefficient λK and the Jacobian norm coefficient λJ to 0.01.
GPUs, using a per-GPU batch size of respectively 3 and 50
for CelebA-HQ and ImageNet. Data is preprocessed by per-
turbing with uniform noise followed by the logit transform.            7. Results
The reference implementation of FFJORD solves the dy-                  A comparison of RNODE against FFJORD and other flow-
namics using a Runge-Kutta 4(5) adaptive solver (Dormand               based generative models is presented in Table 1. We report
& Prince, 1980) with error tolerances 1e−5 and initial step            both our running of “vanilla” FFJORD and the results as
size 1e−2. We have found that using less accurate solvers              originally reported in (Grathwohl et al., 2019). We highlight
on the reference implementation of FFJORD results in nu-               that RNODE runs roughly 2.8x faster than FFJORD on both
merically unstable training dynamics. In contrast, a simple            datasets, while achieving or surpassing the performance of
fixed-grid four stage Runge-Kutta solver suffices for RN-              FFJORD. This can further be seen in Figure 2 where we plot
ODE during training on MNIST and CIFAR10, using a                      bits per dimension ( − d1 log2 p(x), a normalized measure
step size of 0.25. The step size was determined based on               of log-likelihood) on the validation set as a function of
a simple heuristic of starting with 0.5 and decreasing the             training epoch, for both datasets. Visual inspection of the
sample quality reveals no qualitative difference between

                                          <<FIGURE>>

          Figure 6. Quality of generated samples samples with and without regularization on MNIST, left, and CIFAR10, right.

regularized and unregularized approaches; refer to Figure 6.         encourages flows to travel a minimal distance. In addition,
Generated images for downsampled ImageNet and CelebA-                we see that the Jacobian norm alone also has a beneficial
HQ are deferred to the supplementary materials; we provide           effect on the distance particles travel. Overall, the results
smaller generated images for networks trained on CelebA-             support our theoretical reasoning empirically.
HQ 64x64 in Figure 4.
Surprisingly, our run of “vanilla” FFJORD achieved slightly          8. Previous generative flows inspired by
better performance than the results reported in (Grathwohl               optimal transport
et al., 2019). We suspect the discrepancy in performance
and run times between our implementation of FFJORD and               Zhang et al. (2018) define a neural ODE flow where the
that of the original paper is due to batch size: Grathwohl           dynamics are given as the gradient of a scalar potential func-
et al. use a batch size of 900 and train on six GPUs, whereas        tion. This interpretation has deep connections to optimal
on MNIST and CIFAR10 we use a batch size of 200 and                  transport: the optimal transport map is the gradient of a
train on a single GPU.                                               convex potential function. Yang & Karniadakis (2019) con-
                                                                     tinue along these lines, and define an optimal transport again
We were not able to train vanilla FFJORD on ImageNet64,              as a scalar potential gradient. Yang & Karniadakis (2019)
due to numerical underflow in the adaptive solver’s time step.       enforce that the learned map is in fact an optimal trans-
This issue cannot be remedied by increasing the solver’s             port map by penalizing their objective function with a term
error tolerance, for this would bias the log-likelihood esti-        measuring violations of the continuity equation. Ruthotto
mates on validation.                                                 et al. (2019) place generative flows within a broader context
                                                                     of mean field games, and as an example consider a neural
7.1. Ablation study on MNIST                                         ODE gradient potential flow solving the optimal transport
                                                                     problem in up to 100 dimensions. We also note the recent
In Figure 5, we compare the effect of each regularizer by
                                                                     work of Twomey et al. (2019), who proposed regularizing
itself on the training dynamics with the fixed grid ODE
                                                                     neural ODEs with an Euler-step discretization of the kinetic
solver on the MNIST dataset. Without any regularization at
                                                                     energy term to enforce ‘straightness’, although connections
all, training dynamics are numerically unstable and fail after
                                                                     to optimal transport were not discussed.
just under 50 epochs. This is precisely when the Jacobian
norm grows large; refer to Figure 5a. Figure 5a demonstrates         When a flow is the gradient of a scalar potential, the change
that each regularizer by itself is able to control the Jacobian      of variables formula (4) simplifies so that the divergence
norm. The Jacobian regularizer is better suited to this task,        term is replaced by the Laplacian of the scalar potential.
although it is interesting that the kinetic energy regularizer       Although mathematically parsimonious and theoretically
also improves the Jacobian norm. Unsurprisingly Figure 5b            well-motivated, we chose not to implement our flow as the
demonstrates the addition of the kinetic energy regularizer          gradient of a scalar potential function due to computational
                       How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
constraints: such an implementation would require ‘triple       through CIFAR, and companies sponsoring the Vector Insti-
backprop’ (twice to compute or approximate the Laplacian,       tute (www.vectorinstitute.ai/#partners).
and once more for the parameter gradient). Ruthotto et al.
(2019) circumvented this problem by utilizing special struc-    References
tural properties of residual networks to efficiently compute
the Laplacian.                                                  Avron, H. and Toledo, S. Randomized algorithms for esti-
                                                                   mating the trace of an implicit symmetric positive semi-
                                                                   definite matrix. J. ACM, 58(2):8:1–8:34, 2011. doi:
9. Discussion
                                                                   10.1145/1944345.1944349. URL https://doi.org/
In practice, RNODE is simple to implement, and only re-            10.1145/1944345.1944349.
quires augmenting the dynamics (ODE) with two extra
scalar equations (one for the kinetic energy term, and an-      Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duve-
other for the Jacobian penalty). In the setting of FFJORD,         naud, D., and Jacobsen, J. Invertible residual networks.
because we may recycle intermediary terms used in the              In Chaudhuri, K. and Salakhutdinov, R. (eds.), Pro-
divergence estimate, the computational cost of evaluating          ceedings of the 36th International Conference on Ma-
these two extra equations is minimal. RNODE introduces             chine Learning, ICML 2019, 9-15 June 2019, Long
two extra hyperparameters related to the strength of the reg-      Beach, California, USA, volume 97 of Proceedings
ularizers; we have found these required almost no tuning.          of Machine Learning Research, pp. 573–582. PMLR,
                                                                   2019. URL http://proceedings.mlr.press/
Although the problem of classification was not considered          v97/behrmann19a.html.
in this work, we believe RNODE may offer similar im-
provements both in training time and the regularity of the      Benamou, J.-D. and Brenier, Y. A computational fluid me-
classifier learned. In the classification setting we expect the    chanics solution to the Monge-Kantorovich mass transfer
computional overhead of calculating the two extra terms            problem. Numerische Mathematik, 84(3):375–393, 2000.
should be marginal relative to gains made in training time.
                                                                Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duve-
                                                                   naud, D. Neural Ordinary Differential Equations. In
10. Conclusion                                                     Advances in Neural Information Processing Systems 31:
We have presented RNODE, a regularized method for neu-             Annual Conference on Neural Information Processing
ral ODEs. This regularization approach is theoretically            Systems 2018, NeurIPS 2018, 3-8 December 2018,
well-motivated, and encourages neural ODEs to learn well-          Montréal, Canada, pp. 6572–6583, 2018. URL http:
behaved dynamics. As a consequence, numerical integration          //papers.nips.cc/paper/7892-neural-
of the learned dynamics is straight forward and relatively         ordinary-differential-equations.
easy, which means fewer discretizations are needed to solve     Chen, T. Q., Behrmann, J., Duvenaud, D., and Jacobsen,
the dynamics. In many circumstances, this allows for the re-       J. Residual flows for invertible generative modeling.
placement of adaptive solvers with fixed grid solvers, which       In Wallach, H. M., Larochelle, H., Beygelzimer,
can be more efficient during training. This leads to a sub-        A., d’Alché-Buc, F., Fox, E. B., and Garnett, R.
stantial speed up in training time, while still maintaining        (eds.), Advances in Neural Information Processing
the same empirical performance, opening the use of neural          Systems 32: Annual Conference on Neural Information
ODEs to large-scale applications.                                  Processing Systems 2019, NeurIPS 2019, 8-14 Decem-
                                                                   ber 2019, Vancouver, BC, Canada, pp. 9913–9923,
Acknowledgements                                                   2019.     URL http://papers.nips.cc/paper/
                                                                   9183-residual-flows-for-invertible-
C. F. and A. O. were supported by a grant from the Innova-         generative-modeling.
tive Ideas Program of the Healthy Brains and Healthy Lives
initiative (HBHL) through McGill University.                    Dinh, L., Sohl-Dickstein, J., and Bengio, S. Den-
L. N. was supported by AFOSR MURI FA9550-18-1-0502,                sity estimation using real NVP. In 5th International
AFOSR Grant No. FA9550-18-1-0167, and ONR Grant No.                Conference on Learning Representations, ICLR 2017,
N00014-18-1-2527.                                                  Toulon, France, April 24-26, 2017, Conference Track Pro-
                                                                   ceedings, 2017. URL https://openreview.net/
A. O. was supported by the Air Force Office of Scientific          forum?id=HkpbnH9lx.
Research under award number FA9550-18-1-0167
                                                                Dormand, J. R. and Prince, P. J. A family of embedded
Resources used in preparing this research were provided, in        Runge-Kutta formulae. Journal of computational and
part, by the Province of Ontario, the Government of Canada         applied mathematics, 6(1):19–26, 1980.
                     How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
Drucker, H. and LeCun, Y. Improving generalization per-      Hutchinson, M. F. A stochastic estimator of the trace of the
  formance using double backpropagation. IEEE Trans.            influence matrix for Laplacian smoothing splines. Com-
  Neural Networks, 3(6):991–997, 1992. doi: 10.1109/            munications in Statistics-Simulation and Computation,
  72.165600.       URL https://doi.org/10.1109/                 19(2):433–450, 1990.
  72.165600.
                                                             Kanaa, D., Voleti, V., Kahou, S., and Pal, C. Simple video
Dupont, E., Doucet, A., and Teh, Y. W. Augmented                generation using neural ODEs. Workshop on Learning
  neural ODEs. In Wallach, H. M., Larochelle, H.,               with Rich Experience, Advances in Neural Information
  Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Gar-       Processing Systems 32: Annual Conference on Neural
  nett, R. (eds.), Advances in Neural Information Pro-          Information Processing Systems 2019, NeurIPS 2019,
  cessing Systems 32: Annual Conference on Neural               8-14 December 2019, Vancouver, BC, Canada, 2019.
  Information Processing Systems 2019, NeurIPS 2019,
  8-14 December 2019, Vancouver, BC, Canada, pp.             Karras, T., Aila, T., Laine, S., and Lehtinen, J. Pro-
  3134–3144, 2019. URL http://papers.nips.cc/                   gressive growing of gans for improved quality, stabil-
  paper/8577-augmented-neural-odes.                             ity, and variation. CoRR, abs/1710.10196, 2017. URL
                                                                http://arxiv.org/abs/1710.10196.
E, W. A Proposal on Machine Learning via Dynam-
  ical Systems. Communications in Mathematics and            Kingma, D. P. and Ba, J. Adam: A method for stochastic op-
  Statistics, 5(1):1–11, March 2017. ISSN 2194-671X.            timization. In 3rd International Conference on Learning
  doi: 10.1007/s40304-017-0103-z. URL https://                  Representations, ICLR 2015, San Diego, CA, USA, May
  doi.org/10.1007/s40304-017-0103-z.                            7-9, 2015, Conference Track Proceedings, 2015. URL
                                                                http://arxiv.org/abs/1412.6980.
Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P.,
                                                             Kingma, D. P. and Dhariwal, P. Glow: Generative flow
  Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
                                                                with invertible 1x1 convolutions. In Bengio, S., Wallach,
  He, K. Accurate, large minibatch SGD: training ima-
                                                                H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N.,
  genet in 1 hour. CoRR, abs/1706.02677, 2017. URL
                                                                and Garnett, R. (eds.), Advances in Neural Information
  http://arxiv.org/abs/1706.02677.
                                                                Processing Systems 31: Annual Conference on Neural
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever,      Information Processing Systems 2018, NeurIPS 2018,
  I., and Duvenaud, D. FFJORD: free-form continu-               3-8 December 2018, Montréal, Canada, pp. 10236–
  ous dynamics for scalable reversible generative mod-          10245, 2018.        URL http://papers.nips.cc/
  els. In 7th International Conference on Learning Rep-         paper/8224-glow-generative-flow-with-
  resentations, ICLR 2019, New Orleans, LA, USA, May            invertible-1x1-convolutions.
  6-9, 2019, 2019. URL https://openreview.net/
                                                             Köhler, J., Klein, L., and Noé, F. Equivariant flows: sam-
  forum?id=rJxgknCcK7.
                                                                pling configurations for multi-body systems with sym-
Haber, E. and Ruthotto, L. Stable architectures for deep        metric energies. arXiv preprint arXiv:1910.00753, 2019.
  neural networks. Inverse Problems, 34(1):014004, 2017.     Krizhevsky, A. and Hinton, G.              Learning multiple
He, K., Zhang, X., Ren, S., and Sun, J. Deep resid-             layers of features from tiny images. Technical re-
  ual learning for image recognition. In 2016 IEEE              port, University of Toronto, 2009. URL http://
  Conference on Computer Vision and Pattern Recogni-            www.cs.toronto.edu/ ̃kriz/cifar.html.
  tion, CVPR 2016, Las Vegas, NV, USA, June 27-30,           LeCun, Y. and Cortes, C. The MNIST database of handwrit-
  2016, pp. 770–778. IEEE Computer Society, 2016. doi:          ten digits. 1998. URL http://yann.lecun.com/
  10.1109/CVPR.2016.90. URL https://doi.org/                    exdb/mnist/.
  10.1109/CVPR.2016.90.
                                                             Li, X., Wong, T. L., Chen, R. T. Q., and Duvenaud, D. Scal-
Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P.        able gradients for stochastic differential equations. CoRR,
  Flow++: Improving flow-based generative models with           abs/2001.01328, 2020. URL http://arxiv.org/
  variational dequantization and architecture design. In        abs/2001.01328.
  Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings
  of the 36th International Conference on Machine Learn-     Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and
  ing, ICML 2019, 9-15 June 2019, Long Beach, California,       Sohl-Dickstein, J. Sensitivity and generalization in neural
  USA, volume 97 of Proceedings of Machine Learning             networks: an empirical study. In 6th International Con-
  Research, pp. 2722–2730. PMLR, 2019. URL http:                ference on Learning Representations, ICLR 2018, Van-
  //proceedings.mlr.press/v97/ho19a.html.                       couver, BC, Canada, April 30 - May 3, 2018, Conference
                      How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
  Track Proceedings. OpenReview.net, 2018. URL https:            Processing Systems 2019, NeurIPS 2019, 8-14 December
  //openreview.net/forum?id=HJC2SzZCW.                           2019, Vancouver, BC, Canada, pp. 13412–13421, 2019.
                                                                 URL      http://papers.nips.cc/paper/9497-
Pontryagin, L. S., Mishchenko, E., Boltyanskii, V., and          ode2vae-deep-generative-second-order-
  Gamkrelidze, R. The mathematical theory of optimal             odes-with-bayesian-neural-networks.
  processes. 1962.
                                                              Zhang, L., E, W., and Wang, L. Monge-Ampère flow for
Rubanova, Y., Chen, T. Q., and Duvenaud, D. K. Latent or-        generative modeling. CoRR, abs/1809.10188, 2018. URL
  dinary differential equations for irregularly-sampled time     http://arxiv.org/abs/1809.10188.
  series. In Advances in Neural Information Processing
  Systems, pp. 5321–5331, 2019.
Ruthotto, L. and Haber, E. Deep neural networks motivated
  by partial differential equations. Journal of Mathematical
  Imaging and Vision, pp. 1–13, 2018.
Ruthotto, L., Osher, S. J., Li, W., Nurbekyan, L., and
  Fung, S. W. A machine learning framework for solv-
  ing high-dimensional mean field game and mean field
  control problems. CoRR, abs/1912.01825, 2019. URL
  http://arxiv.org/abs/1912.01825.
Santambrogio, F. Benamou-Brenier and other continu-
  ous numerical methods, pp. 219–248. Springer Interna-
  tional Publishing, Cham, 2015. ISBN 978-3-319-20828-
  2. doi: 10.1007/978-3-319-20828-2 6. URL https:
  //doi.org/10.1007/978-3-319-20828-2 6.
Twomey, N., Kozlowski, M., and Santos-Rodrı́guez, R. Neu-
  ral ODEs with stochastic vector field mixtures. CoRR,
  abs/1905.09905, 2019. URL http://arxiv.org/
  abs/1905.09905.
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu,
  K.       Pixel recurrent neural networks.           CoRR,
  abs/1601.06759, 2016. URL http://arxiv.org/
  abs/1601.06759.
Villani, C. Topics in Optimal Transportation. Graduate
  studies in mathematics. American Mathematical Society,
  2003. ISBN 9780821833124.
Villani, C. Optimal Transport: Old and New. Grundlehren
  der mathematischen Wissenschaften. Springer Berlin Hei-
  delberg, 2008. ISBN 9783540710509. URL https://
  books.google.ca/books?id=hV8o5R7 5tkC.
Yang, L. and Karniadakis, G. E. Potential flow gener-
  ator with L2 Optimal Transport regularity for gener-
  ative models. CoRR, abs/1908.11462, 2019. URL
  http://arxiv.org/abs/1908.11462.
Yildiz, C., Heinonen, M., and Lähdesmäki, H. ODE2VAE:
  deep generative second order ODEs with Bayesian neural
  networks. In Wallach, H. M., Larochelle, H., Beygelz-
  imer, A., d’Alché-Buc, F., Fox, E. B., and Garnett,
  R. (eds.), Advances in Neural Information Processing
  Systems 32: Annual Conference on Neural Information

A. Details of Section 3.1: Benamou-Brenier                         Hence, multiplying the objective function in (20) by λ and
    formulation in Lagrangian coordinates                          ignoring the f -independent term Ex∼p log p(x) we obtain
                                                                   an equivalent objective function
The Benamou-Brenier formulation of the optimal transporta-                   
tion (OT) problem in Eulerian coordinates is                        
                                                                      <<FORMULA>>                      (21)

                  <<FORMULA>>                             (18a)

                                                                   Finally, if we assume that {xi }N  i=1 are iid sampled from p,
                  <<FORMULA>>                             (18b)    we obtain the empirical objective function

                  <<ρ0 (x) = p>>,                        (18c)         

                  <<ρT (z) = q>>.                      (18d)                 <<FORMULA>>                         (22)

 The connection between continuous normalizing flows
(CNF) and OT becomes transparent once we rewrite (18) in
Lagrangian coordinates. Indeed, for regular enough velocity
                                                                   B. Additional results
fields f one has that the solution of the continuity equation      Here we present additional generated samples on the two
(18b), (18c) is given by ρt = z(·, t)]p where z is the flow        larger datasets considered, CelebA-HQ and ImageNet64. In
                                                                   addition bits/dim on clean images are reported in Table 2.
         <<FORMULA>>

The relation ρt = z(·, t)]p means that for arbitrary test
function φ we have that

               <<φ(x)ρt (x, t)dx = φ(z(x, t))p(x)dx>>

Therefore (18) can be rewritten as

   <<min      kf (z(x, t), t)k2 p(x) dxdt>>               (19a)

   <<subject to         ż(x, t) = f (z(x, t), t)>>,       (19b)

                      <<z(x, 0) = x>>,                     (19c)

                      <<z(·, T )]p = q>>.                  (19d)

Note that ρt is eliminated in this formulation. The terminal
condition (18d) is trivial to implement in Eulerian coordi-
nates (grid-based methods) but not so simple in Lagrangian
ones (19d) (grid-free methods). To enforce (19d) we intro-
duce a penalty term in the objective function that measures
the deviation of z(·, T )]p from q. Thus, the penalized ob-
jective function is
               <<FORMULA>>          (20)
where λ > 0 is the penalization strength. Next, we observe
that this objective function can be written as an expectation
with respect to x ∼ p. Indeed, the Kullback-Leibler di-
vergence is invariant under coordinate transformations, and
therefore

         <<FORMULA>>
              
                  <<FIGURE>>

Figure 7. Quality of FFJORD RNODE generated images on ImageNet-64.

               <<FIGURE>>

Figure 8. Quality of FFJORD RNODE generated images on CelebA-HQ. We use temperature annealing, as described in (Kingma &
Dhariwal, 2018), to generate visually appealing images, with T = 0.5, . . . , 1.

Table 2. Additional results and model statistics of FFJORD RNODE. Here we report validation bits/dim on both validation images, and on
validation images with uniform variational dequantization (ie perturbed by uniform noise). We also report number of trainable model
parameters.
                          <<TABLE>>

<<END>> <<END>> <<END>>


<<START>> <<START>> <<START>>

                     A guide to convolution arithmetic for   deep
                                      learning

                               The authors of this guide would like to thank David Warde-Farley,
                             Guillaume Alain  and  Caglar Gulcehre for   their valuable feedback. We
                             are likewise grateful to all those who helped improve this tutorial with
                             helpful comments, constructive criticisms  and  code contributions. Keep
                             them coming!
                               Special thanks to Ethan Schoonover, creator of the Solarized color
                             scheme, 1 whose colors were used for   the ﬁgures.

                                                    Feedback
                               Your feedback is welcomed! We did our best to be as precise, infor-
                             mative  and  up to the point as possible, but should there be any thing you
                             feel might be an error or could be rephrased to be more precise or com-
                             prehensible, please don’t refrain from contacting us. Likewise, drop us a
                             line if you think there is something that might ﬁt this technical report
                              and  you would like us to discuss – we will make our best eﬀort to update
                             this document.

                                            Source code  and  animations
                               The code used to generate this guide along with its ﬁgures is available
                             on GitHub. 2 There the reader can also ﬁnd an animated version of the
                             ﬁgures.


                      1 Introduction 5
                        1.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . .6
                        1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

                      2 Convolution arithmetic 12
                        2.1 No zero padding, unit strides . . . . . . . . . . . . . . . . . . . .12
                        2.2 Zero padding, unit strides . . . . . . . . . . . . . . . . . . . . . .13
                            2.2.1 Half (same) padding . . . . . . . . . . . . . . . . . . . . .13
                            2.2.2 Full padding . . . . . . . . . . . . . . . . . . . . . . . . .13
                        2.3 No zero padding, non-unit strides . . . . . . . . . . . . . . . . . .15
                        2.4 Zero padding, non-unit strides . . . . . . . . . . . . . . . . . . . .15

                      3 Pooling arithmetic 18

                      4 Transposed convolution arithmetic 19
                        4.1 Convolution as a matrix operation . . . . . . . . . . . . . . . . .20
                        4.2 Transposed convolution . . . . . . . . . . . . . . . . . . . . . . .20
                        4.3 No zero padding, unit strides, transposed . . . . . . . . . . . . .21
                        4.4 Zero padding, unit strides, transposed . . . . . . . . . . . . . . .22
                            4.4.1 Half (same) padding, transposed . . . . . . . . . . . . . .22
                            4.4.2 Full padding, transposed . . . . . . . . . . . . . . . . . . .22
                        4.5 No zero padding, non-unit strides, transposed . . . . . . . . . . .24
                        4.6 Zero padding, non-unit strides, transposed . . . . . . . . . . . . .24

                      5 Miscellaneous convolutions 28
                        5.1 Dilated convolutions . . . . . . . . . . . . . . . . . . . . . . . . .28


                                                                        Chapter 1


                      Introduction


                      Deep convolutional neural networks (CNNs) have been at the heart of spectac-
                      ular advances in deep learning. Although CNNs have been used as early as the
                      nineties to solve character recognition tasks (Le Cunet al., 1997), their current
                      widespread application is due to much more recent work, when a deep CNN
                      was used to beat state-of-the-art in the ImageNet image classiﬁcation challenge
                      (Krizhevskyet al., 2012).
                        Convolutional neural networks therefor  e constitute a very useful tool for   ma-
                      chine learning practitioners. However, learning to use CNNs for   the ﬁrst time
                      is generally an intimidating experience. A convolutional layer’s output shape
                      is aﬀected by the shape of its input as well as the choice of kernel shape, zero
                      padding  and  strides,  and  the relationship between these properties is not triv-
                      ial to infer. This contrasts with fully-connected layers, whose output size is
                      independent of the input size. Additionally, CNNs also usually feature apool-
                      ingstage, adding yet another level of complexity with respect to fully-connected
                      networks. Finally, so-called transposed convolutional layers (also known as frac-
                      tionally strided convolutional layers) have been employed in more  and  more work
                      as of late (Zeileret al., 2011; Zeiler  and  Fergus, 2014; Longet al., 2015; Rad-
                      for  det al., 2015; Visinet al., 2015; Imet al., 2016),  and  their relationship with
                      convolutional layers has been explained with various degrees of clarity.
                        This guide’s objective is twofold:

                        1.Explain the relationship between convolutional layers  and  transposed con-
                          volutional layers.
                        2.Provide an intuitive underst and ing of the relationship between input shape,
                          kernel shape, zero padding, strides  and  output shape in convolutional,
                          pooling  and  transposed convolutional layers.

                        In order to remain broadly applicable, the results shown in this guide are
                      independent of implementation details  and  apply to all commonly used machine
                      learning frameworks, such as Theano (Bergstraet al., 2010; Bastienet al., 2012),


                      Torch (Collobertet al., 2011), Tensorﬂow (Abadiet al., 2015)  and  Caﬀe (Jia et al., 2014).

                        This chapter brieﬂy reviews the main building blocks of CNNs, namely dis-
                      crete convolutions  and  pooling. for   an in-depth treatment of the subject, see
                      Chapter 9 of the Deep Learning textbook (Goodfellowet al., 2016).


                      1.1 Discrete convolutions

                      The bread  and  butter of neural networks is aﬃne transformations: a vector
                      is received as input  and  is multiplied with a matrix to produce an output (to
                      which a bias vector is usually added before passing the result through a non-
                      linearity). This is applicable to any  type of input, be it an image, a sound
                      clip or an unordered collection of features: whatever their dimensionality, their
                      representation can always be ﬂattened into a vector before the transfomation.
                        Images, sound clips  and  many  other similar kinds of data have an intrinsic
                      structure. More formally, they share these important properties:

                         They are stored as multi-dimensional arrays.
                         They feature one or more axes for   which ordering matters (e.g., width  and 
                          height axes for   an image, time axis for   a sound clip).
                         One axis, called the channel axis, is used to access diﬀerent views of the
                          data (e.g., the red, green  and  blue channels of a color image, or the left
                           and  right channels of a stereo audio track).

                        These properties are not exploited when an aﬃne transformation is applied;
                      in fact, all the axes are treated in the same way  and  the topological information
                      is not taken into account. Still, taking advantage of the implicit structure of
                      the data may prove very h and y in solving some tasks, like computer vision  and 
                      speech recognition,  and  in these cases it would be best to preserve it. This is
                      where discrete convolutions come into play.
                        A discrete convolution is a linear transformation that preserves this notion
                      of ordering. It is sparse (only a few input units contribute to a given output
                      unit)  and  reuses parameters (the same weights are applied to multiple locations
                      in the input).
                        Figure 1.1 provides an example of a discrete convolution. The light blue
                      grid is called the input feature map. To keep the drawing simple, a single input
                      feature map is represented, but it is not uncommon to have multiple feature
                      maps stacked one onto another. 1 A kernel(shaded area) of value

                                            <<FIGURE>>

                           Figure 1.1: Computing the output values of a discrete convolution.


                                            <<FIGURE>>


                          Figure 1.2: Computing the output values of a discrete convolution for   N = 2, i1 =i2 = 5, k1 =k2 = 3, s1 =s2 = 2,  and  p1 =p2 = 1.


                      slides across the input feature map. At each location, the product between
                      each element of the kernel  and  the input element it overlaps is computed  and 
                      the results are summed up to obtain the output in the current location. The
                      procedure can be repeated using diﬀerent kernels to for  m as many  output feature
                      maps as desired (Figure 1.3). The ﬁnal outputs of this procedure are called
                      output feature maps.2 If there are multiple input feature maps, the kernel will
                      have to be 3-dimensional – or, equivalently each one of the feature maps will
                      be convolved with a distinct kernel –  and  the resulting feature maps will be
                      summed up elementwise to produce the output feature map.
                        The convolution depicted in Figure 1.1 is an instance of a 2-D convolution,
                      but it can be generalized to N-D convolutions. for   instance, in a 3-D convolu-
                      tion, the kernel would be a cuboid and  would slide across the height, width  and 
                      depth of the input feature map.
                        The collection of kernels deﬁning a discrete convolution has a shape corre-
                      sponding to some permutation of(n;m;k 1 ;:::;k N ), where


                                      <<FORMULA>>

                        The following properties aﬀect the output size oj of a convolutional layer
                      along axis j:

                                      <<FORMULA>>

                      for   instance, Figure 1.2 shows a 3x3 kernel applied to a 5x5 input padded
                      with a 1x1 border of zeros using 2x2 strides.
                        Note that strides constitute a for  m of subsampling. As an alternative to
                      being interpreted as a measure of how much the kernel is translated, strides can
                      also be viewed as how much of the output is retained. for   instance, moving
                      the kernel by hops of two is equivalent to moving the kernel by hops of one but
                      retaining only odd output elements (Figure 1.4).
                        1 An example of this is what was referred to earlier as channels for images  and  sound clips.
                        2 While there is a distinction between convolution  and  cross-correlation from a signal pro-
                      cessing perspective, the two become interchangeable when the kernel is learned. for   the sake
                      of simplicity  and  to stay consistent with most of the machine learning literature, the term
                      convolution will be used in this guide.

                                            <<FIGURE>>

                      Figure 1.3: A convolution mapping from two input feature maps to three output
                      feature maps using a3 2  3x3 collection of kernels w. In the left pathway,
                      input feature map 1 is convolved with kernel w1;1  and  input feature map 2 is
                      convolved with kernel w1;2 ,  and  the results are summed together elementwise
                      to for  m the ﬁrst output feature map. The same is repeated for   the middle  and 
                      right pathways to for  m the second  and  third feature maps,  and  all three output
                      feature maps are grouped together to for  m the output.

                                            <<FIGURE>>

                      Figure 1.4: An alternative way of viewing strides. Instead of translating the
                       3x3 kernel by increments ofs= 2(left), the kernel is translated by increments
                      of1 and  only one ins= 2output elements is retained (right).


                                                1.2 Pooling

                      In addition to discrete convolutions themselves,pooling operations make up
                      another important building block in CNNs. Pooling operations reduce the size
                      of feature maps by using some function to summarize subregions, such as taking
                      the average or the maximum value.
                        Pooling works by sliding a window across the input  and  feeding the content
                      of the window to a pooling function. In some sense, pooling works very much
                      like a discrete convolution, but replaces the linear combination described by the
                      kernel with some other function. Figure 1.5 provides an example for   average
                      pooling,  and  Figure 1.6 does the same for   max pooling.
                        The following properties aﬀect the output size j of a pooling layer along
                      axisj:

                                      <<FORMULA>>


                                                  <<FIGURE>>


                     Figure 1.5: Computing the output values of a  3x3  average pooling operation on a 5x5 input using 1x1 strides.

                                                  <<FIGURE>>


                     Figure 1.6: Computing the output values of a  3x3  max pooling operation on a 5X5 input using 1X1 strides.


                      Convolution arithmetic


                      The analysis of the relationship between convolutional layer properties is eased
                      by the fact that they don’t interact across axes, i.e., the choice of kernel size,
                      stride  and  zero padding along axis j only aﬀects the output size of axis j.
                      Because of that, this chapter will focus on the following simpliﬁed setting:

                        2-D discrete convolutions (N= 2),
                        square inputs (i1 =i2 =i),
                        square kernel size (k1 =k2 =k),
                        same strides along both axes (s1 =s2 =s),
                        same zero padding along both axes (p1 =p2 =p).

                        This facilitates the analysis  and  the visualization, but keep in mind that the
                      results outlined here also generalize to the N-D  and  non-square cases.


                      2.1 No zero padding, unit strides

                      The simplest case to analyze is when the kernel just slides across every position
                      of the input (i.e.,s= 1 and p= 0). Figure 2.1 provides an example for  i= 4
                       and k= 3.
                        One way of deﬁning the output size in this case is by the number of possible
                      placements of the kernel on the input. Let’s consider the width axis: the kernel
                      starts on the leftmost part of the input feature map  and  slides by steps of one
                      until it touches the right side of the input. The size of the output will be equal
                      to the number of steps made, plus one, accounting for   the initial position of the
                      kernel (Figure 2.8a). The same logic applies for   the height axis.
                        More formally, the following relationship can be inferred:
                        
                          Relationship 1.for   any i,k and p,  and  for  s= 1,

                                                     <<FORMULA>>


                      2.2 Zero padding, unit strides

                      To factor in zero padding (i.e., only restricting tos= 1), let’s consider its eﬀect
                      on the eﬀective input size: padding with p zeros changes the eﬀective input size
                      from i to i+ 2p. In the general case, Relationship 1 can then be used to infer
                      the following relationship:

                          Relationship 2.for   any  i,k  and  p,  and  for   s= 1,

                                                <<FORMULA>>

                      Figure 2.2 provides an example for   i= 5,k= 4  and  p= 2.
                        In practice, two speciﬁc instances of zero padding are used quite extensively
                      because of their respective properties. Let’s discuss them in more detail.

                      2.2.1 Half (same) padding
                      Having the output size be the same as the input size (i.e.,o=i) can be a
                      desirable property:

                          Relationship 3.for   any  i  and  for   k o d (k= 2n+ 1; n2N),
                          s= 1  and  p=b k=2 c=n,

                                               <<FORMULA>> 

                      This is sometimes referred to as half(or same) padding. Figure 2.3 provides an
                      example for   i= 5,k= 3 and  (therefor  e) p= 1.

                      2.2.2 Full padding
                      While convolving a kernel generally decreases the output size with respect to
                      the input size, sometimes the opposite is required. This can be achieved with
                      proper zero padding:

                          Relationship 4.for   any  i  and  k,  and  for   p=kx1  and  s= 1,

                                                <<FORMULA>>


                                                <<FIGURE>>

                      Figure 2.1: (No padding, unit strides) Convolving a 3x3 kernel over a 4x4 
                      input using unit strides (i.e.,i= 4,k= 3,s= 1  and  p= 0).


                                                <<FIGURE>>

                      Figure 2.2: (Arbitrary padding, unit strides) Convolving a 4x4 kernel over a
                      5x5 input padded with a 2x2 border of zeros using unit strides (i.e.,i= 5,
                      k= 4,s= 1 and p= 2).


                                              <<FIGURE>>


                      Figure 2.3: (Half padding, unit strides) Convolving a 3x3 kernel over a 5x5 
                      input using half padding  and  unit strides (i.e.,i= 5,k= 3,s= 1  and  p= 1).


                                            <<FIGURE>>


                      Figure 2.4: (Full padding, unit strides) Convolving a 3x3 kernel over a 5x5 
                      input using full padding  and  unit strides (i.e.,i= 5,k= 3,s= 1  and  p= 2).


                      This is sometimes referred to as full padding, because in this setting every
                      possible partial or complete superimposition of the kernel on the input feature
                      map is taken into account. Figure 2.4 provides an example for i= 5,k= 3  and 
                      (therefore) p= 2.


                      2.3 No zero padding, non-unit strides

                      All relationships derived so far only apply for   unit-strided convolutions. Incorporating 
                      non unitary strides requires another inference leap. To facilitate
                      the analysis, let’s momentarily ignore zero padding (i.e.,s >1  and  p= 0).
                      Figure 2.5 provides an example for   i= 5,k= 3 and s= 2.
                        Once again, the output size can be deﬁned in terms of the number of possible
                      placements of the kernel on the input. Let’s consider the width axis: the kernel
                      starts as usual on the leftmost part of the input, but this time it slides by steps
                      of sizes until it touches the right side of the input. The size of the output is
                      again equal to the number of steps made, plus one, accounting for   the initial
                      position of the kernel (Figure 2.8b). The same logic applies for   the height axis.
                        From this, the following relationship can be inferred:

                          Relationship 5.for   any  i,k  and  s,  and  for   p= 0,
                                               
                                               <<FORMULA>>

                      The ﬂoor function accounts for   the fact that sometimes the last possible step
                      does not coincide with the kernel reaching the end of the input, i.e., some input
                      units are left out (see Figure 2.7 for   an example of such a case).


                      2.4 Zero padding, non-unit strides

                      The most general case (convolving over a zero padded input using non-unit
                      strides) can be derived by applying Relationship 5 on an eﬀective input of size
                      i+ 2p, in analogy to what was done for   Relationship 2:

                          Relationship 6.for   any i,k,p and s,
                                             
                                             <<FORMULA>>

                      As before, the ﬂoor function means that in some cases a convolution will produce
                      the same output size for   multiple input sizes. More speciﬁcally, ifi+ 2p kis
                      a multiple ofs, then any  input size j=i+a; a2 f0;:::; sx1 g will produce
                      the same output size. Note that this ambiguity applies only for   s >1.

                                            <<FIGURE>>

                        Figure 2.6 shows an example with i= 5,k= 3,s= 2  and  p= 1, while

                                            <<FIGURE>>

                      Figure 2.7 provides an example for   i= 6,k= 3,s= 2  and  p= 1. Interestingly,

                      despite having diﬀerent input sizes these convolutions share the same output
                      size. While this doesn’t aﬀect the analysis for   convolutions, this will complicate
                      the analysis in the case of transposed convolutions.


                                            <<FIGURE>>

                      Figure 2.5: (No zero padding, arbitrary strides) Convolving a 3x3 kernel over
                      a 5x5 input using 2x2 strides (i.e.,i= 5,k= 3,s= 2  and  p= 0).

                                            <<FIGURE>>

                      Figure 2.6: (Arbitrary padding  and  strides) Convolving a 3x3 kernel over a
                       5x5 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,
                      k= 3,s= 2  and  p= 1).

                                            <<FIGURE>>

                      Figure 2.7: (Arbitrary padding  and  strides) Convolving a 3x3 kernel over a
                      6x6 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,
                      k= 3,s= 2  and  p= 1). In this case, the bottom row  and  right column of the
                      zero padded input are not covered by the kernel.

                      (a) The kernel has to slide two steps (b) The kernel has to slide one step of
                      to the right to touch the right side of size two to the right to touch the right
                      the input ( and  equivalently downwards). side of the input ( and  equivalently down-
                      Adding one to account for   the initial ker- wards). Adding one to account for   the
                      nel position, the output size is 3x3. initial kernel position, the output size is 2x2.

                                            <<FIGURE>>

                                     Figure 2.8: Counting kernel positions.


                                                 Chapter 3

                      Pooling arithmetic

                      In a neural network, pooling layers provide invariance to small translations of
                      the input. The most common kind of pooling is max pooling, which consists
                      in splitting the input in (usually non-overlapping) patches  and  outputting the
                      maximum value of each patch. Other kinds of pooling exist, e.g., mean or
                      average pooling, which all share the same idea of aggregating the input locally
                      by applying a non-linearity to the content of some patches (Boureauet al.,
                      2010a,b, 2011; Saxeet al., 2011).
                        Some readers may have noticed that the treatment of convolution arithmetic
                      only relies on the assumption that some function is repeatedly applied onto
                      subsets of the input. This means that the relationships derived in the previous
                      chapter can be reused in the case of pooling arithmetic. Since pooling does not
                      involve zero padding, the relationship describing the general case is as follows:

                          Relationship 7.for   any  i,k  and  s,

                                              <<FORMULA>>

                      This relationship holds for any  type of pooling.


                                                Chapter 4

                      Transposed convolution arithmetic

                        
                      The need for   transposed convolutions generally arises from the desire to use a
                      transfor  mation going in the opposite direction of a normal convolution, i.e., from
                      something that has the shape of the output of some convolution to something
                      that has the shape of its input while maintaining a connectivity pattern that
                      is compatible with said convolution. for   instance, one might use such a trans-
                      for  mation as the decoding layer of a convolutional autoencoder or to project
                      feature maps to a higher-dimensional space.
                        Once again, the convolutional case is considerably more complex than the
                      fully-connected case, which only requires to use a weight matrix whose shape has
                      been transposed. However, since every convolution boils down to an eﬃcient im-
                      plementation of a matrix operation, the insights gained from the fully-connected
                      case are useful in solving the convolutional case.
                        Like for   convolution arithmetic, the dissertation about transposed convolu-
                      tion arithmetic is simpliﬁed by the fact that transposed convolution properties
                      don’t interact across axes.
                        The chapter will focus on the following setting:

                         2-D transposed convolutions (N= 2),
                         square inputs (i1 =i2 =i),
                         square kernel size (k1 =k2 =k),
                         same strides along both axes (s1 =s2 =s),
                         same zero padding along both axes (p1 =p2 =p).

                      Once again, the results outlined generalize to the N-D  and  non-square cases.


                                                 4.1 Convolution as a matrix operation

                      Take for   example the convolution represented in Figure 2.1. If the input  and 
                      output were to be unrolled into vectors from left to right, top to bottom, the
                      convolution could be represented as a sparse matrix C where the non-zero elements 
                      are the elements w i;j of the kernel (with i  and  j being the row  and  column
                      of the kernel respectively):
                    
                                                <<FORMULA>>

                        This linear operation takes the input matrix ﬂattened as a 16-dimensional
                      vector  and  produces a 4-dimensional vector that is later reshaped as the 2x2 
                      output matrix.
                        Using this representation, the backward pass is easily obtained by trans-
                      posingC; in other words, the error is backpropagated by multiplying the loss
                      withCT . This operation takes a 4-dimensional vector as input  and  produces
                      a 16-dimensional vector as output,  and  its connectivity pattern is compatible
                      withCby construction.
                        Notably, the kernel w deﬁnes both the matrices C  and  CT used for   the
                      for  ward  and  backward passes.


                      4.2 Transposed convolution

                      Let’s now consider what would be required to go the other way around, i.e.,
                      map from a 4-dimensional space to a 16-dimensional space, while keeping the
                      connectivity pattern of the convolution depicted in Figure 2.1. This operation
                      is known as a transposed convolution.
                        Transposed convolutions – also called fractionally strided convolutions or
                      deconvolutions 1 – work by swapping the for  ward  and  backward passes of a con-
                      volution. One way to put it is to note that the kernel deﬁnes a convolution, but
                      whether it’s a direct convolution or a transposed convolution is determined by
                      how the for  ward  and  backward passes are computed.
                        for   instance, although the kernel w deﬁnes a convolution whose for  ward  and 
                      backward passes are computed by multiplying with C  and  CT respectively, it
                      also deﬁnes a transposed convolution whose for  ward  and  backward passes are
                      computed by multiplying withCT  and  (CT )T =C respectively. 2
                        Finally note that it is always possible to emulate a transposed convolution
                      with a direct convolution. The disadvantage is that it usually involves adding
                        1 The term “deconvolution” is sometimes used in the literature, but we advocate against it
                      on the grounds that a deconvolution is mathematically deﬁned as the inverse of a convolution,
                      which is diﬀerent from a transposed convolution.
                        2 The transposed convolution operation can be thought of as the gradient of some convolution 
                        with respect to its input, which is usually how transposed convolutions are implemented
                      in practice.


                      many  columns  and  rows of zeros to the input, resulting in a much less eﬃcient
                      implementation.
                        Building on what has been introduced so far, this chapter will proceed some-
                      what backwards with respect to the convolution arithmetic chapter, deriving the
                      properties of each transposed convolution by referring to the direct convolution
                      with which it shares the kernel,  and  deﬁning the equivalent direct convolution.


                      4.3 No zero padding, unit strides, transposed

                      The simplest way to think about a transposed convolution on a given input is
                      to imagine such an input as being the result of a direct convolution applied on
                      some initial feature map. The transposed convolution can be then considered as
                      the operation that allows to recover the shape 3 of this initial feature map.
                        Let’s consider the convolution of a 3x3 kernel on a 4x4 input with unitary
                      stride  and  no padding (i.e.,i= 4,k= 3,s= 1  and  p= 0). As depicted in
                      Figure 2.1, this produces a 2x2 output. The transpose of this convolution will
                      then have an output of shape 4x4 when applied on a 2x2 input.
                        Another way to obtain the result of a transposed convolution is to apply an
                      equivalent – but much less eﬃcient – direct convolution. The example described
                      so far could be tackled by convolving a 3x3 kernel over a 2x2 input padded
                      with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,k0 =k,s0 = 1 and 
                      p0 = 2), as shown in Figure 4.1. Notably, the kernel’s  and  stride’s sizes remain
                      the same, but the input of the transposed convolution is now zero padded. 4
                        One way to understand  the logic behind zero padding is to consider the
                      connectivity pattern of the transposed convolution  and  use it to guide the design
                      of the equivalent convolution. for   example, the top left pixel of the input of the
                      direct convolution only contribute to the top left pixel of the output, the top
                      right pixel is only connected to the top right output pixel,  and  so on.
                        To maintain the same connectivity pattern in the equivalent convolution it is
                      necessary to zero pad the input in such a way that the ﬁrst (top-left) application
                      of the kernel only touches the top-left pixel, i.e., the padding has to be equal to
                      the size of the kernel minus one.
                        Proceeding in the same fashion it is possible to determine similar observa-
                      tions for   the other elements of the image, giving rise to the following relationship:
                        3 Note that the transposed convolution does not guarantee to recover the input itself, as it
                      is not deﬁned as the inverse of the convolution, but rather just returns a feature map that has
                      the same width  and  height.
                        4 Note that although equivalent to applying the transposed matrix, this visualization adds
                      a lot of zero multiplications in the for  m of zero padding. This is done here for   illustration
                      purposes, but it is ineﬃcient,  and  software implementations will normally not perfor  m the
                      useless zero multiplications.

                      Relationship 8.A convolution described bys= 1,p= 0 and k
                          has an associated transposed convolution described byk0 =k,s0 =s
                           and p0 = kx1  and  its output size is

                                            <<FORMULA>>

                        Interestingly, this corresponds to a fully padded convolution with unit strides.


                      4.4 Zero padding, unit strides, transposed

                      Knowing that the transpose of a non-padded convolution is equivalent to con-
                      volving a zero padded input, it would be reasonable to suppose that the trans-
                      pose of a zero padded convolution is equivalent to convolving an input padded
                      withlesszeros.
                        It is indeed the case, as shown in Figure 4.2 for  i= 5,k= 4 and p= 2.
                        for  mally, the following relationship applies for   zero padded convolutions:

                          Relationship 9.A convolution described by s= 1,k and phas an
                          associated transposed convolution described by k0 =k,s0 =s and 
                          p0 =k p 1 and  its output size is

                                           <<FORMULA>>

                      4.4.1 Half (same) padding, transposed
                      By applying the same inductive reasoning as befor  e, it is reasonable to expect
                      that the equivalent convolution of the transpose of a half padded convolution
                      is itself a half padded convolution, given that the output size of a half padded
                      convolution is the same as its input size. Thus the following relation applies:

                          Relationship 10.A convolution described byk= 2n+1; n2N,
                          s= 1 and p=bk=2c=nh as an associated transposed convolution
                          described byk0 =k,s0 =s and p0 =p and  its output size is

                                           <<FORMULA>>


                                           <<FIGURE>>

                        Figure 4.3 provides an example for   i= 5,k= 3 and  (therefor  e)p= 1.

                      4.4.2 Full padding, transposed
                      Knowing that the equivalent convolution of the transpose of a non-padded con-
                      volution involves full padding, it is unsurprising that the equivalent of the trans-
                      pose of a fully padded convolution is a non-padded convolution:

                                          <<FIGURE>>

                      Figure 4.1: The transpose of convolving a 3x3 kernel over a 4x4 input using
                      unit strides (i.e.,i= 4,k= 3,s= 1 and p= 0). It is equivalent to convolving
                      a 3x3 kernel over a 2x2 input padded with a 2x2 border of zeros using unit
                      strides (i.e.,i0 = 2,k0 =k,s0 = 1 and p0 = 2).

                                          <<FIGURE>>

                      Figure 4.2: The transpose of convolving a 4x4 kernel over a 5x5 input padded
                      with a 2x2 border of zeros using unit strides (i.e.,i= 5,k= 4,s= 1 and 
                      p= 2). It is equivalent to convolving a 4x4 kernel over a 6x6 input padded
                      with a 1x1 border of zeros using unit strides (i.e.,i0 = 6,k0 =k,s0 = 1 and 
                      p0 = 1).

                                           <<FIGURE>>

                      Figure 4.3: The transpose of convolving a 3x3 kernel over a 5x5 input using
                      half padding  and  unit strides (i.e.,i= 5,k= 3,s= 1 and p= 1). It is
                      equivalent to convolving a 3x3 kernel over a 5x5 input using half padding
                       and  unit strides (i.e.,i0 = 5,k0 =k,s0 = 1 and p0 = 1).


                       Relationship 11.A convolution described bys= 1,k and p= kx1 
                          has an associated transposed convolution described byk0 =k,s0 =s
                           and p0 = 0 and  its output size is

                                          <<FIGURE>>

                        Figure 4.4 provides an example for  i= 5,k= 3 and  (therefor  e)p= 2.


                      4.5 No zero padding, non-unit strides, transposed

                      Using the same kind of inductive logic as for   zero padded convolutions, one
                      might expect that the transpose of a convolution with s >1 involves an equiv-
                      alent convolution with s <1. As will be explained, this is a valid intuition,
                      which is why transposed convolutions are sometimes called fractionally strided
                      convolutions.
                        Figure 4.5 provides an example for  i= 5,k= 3 and s= 2which helps
                      understand  what fractional strides involve: zeros are inserted between input
                      units, which makes the kernel move around at a slower pace than with unit
                      strides. 5
                        for   the moment, it will be assumed that the convolution is non-padded
                      (p= 0)  and  that its input size i is such that  ixk  is a multiple ofs. In that
                      case, the following relationship holds:

                          Relationship 12.A convolution described byp= 0,k and s and 
                          whose input size is such that ixk is a multiple ofs, has an associated
                          transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 = kx1 ,
                          where~i0 is the size of the stretched input obtained by adding  sx1 
                          zeros between each input unit,  and  its output size is

                                            <<FORMULA>>

                      4.6 Zero padding, non-unit strides, transposed

                      When the convolution’s input sizeiis such thati+ 2p kis a multiple ofs,
                      the analysis can extended to the zero padded case by combining Relationship 9
                       and  Relationship 12:
                        5 Doing so is ineﬃcient  and  real-world implementations avoid useless multiplications by
                      zero, but conceptually it is how the transpose of a strided convolution can be thought of.
 
                                          <<FIGURE>> 

                      Figure 4.4: The transpose of convolving a 3x3 kernel over a 5x5 input using
                      full padding  and  unit strides (i.e.,i= 5,k= 3,s= 1 and p= 2). It is equivalent
                      to convolving a 3x3 kernel over a7 7input using unit strides (i.e.,i0 = 7,
                      k0 =k,s0 = 1 and p0 = 0).

                                          <<FIGURE>>

                      Figure 4.5: The transpose of convolving a 3x3 kernel over a 5x5 input using
                       2x2 strides (i.e.,i= 5,k= 3,s= 2 and p= 0). It is equivalent to convolving
                      a 3x3 kernel over a 2x2 input (with1zero inserted between inputs) padded
                      with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,~i0 = 3,k0 =k,s0 = 1
                       and p0 = 2).

                                        <<FIGURE>>

                      Figure 4.6: The transpose of convolving a 3x3 kernel over a 5x5 input padded
                      with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,k= 3,s= 2 and 
                      p= 1). It is equivalent to convolving a 3x3 kernel over a 3x3 input (with
                      1zero inserted between inputs) padded with a 1x1 border of zeros using unit
                      strides (i.e.,i0 = 3,~i0 = 5,k0 =k,s0 = 1 and p0 = 1).


                         Relationship 13.A convolution described byk,s and p and  whose
                          input sizeiis such tha ti+2p k is a multiple of s has an associated
                          transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 =
                          k p 1, where ~i0 is the size of the stretched input obtained by
                          adding sx1 zeros between each input unit,  and  its output size is

                                          <<FORMULA>>


                                          <<FIGURE>>

                        Figure 4.6 provides an example for  i= 5,k= 3,s= 2 and p= 1.
                        The constraint on the size of the inputican be relaxed by introducing
                      another parametera2 f0;:::; sx1 gthat allows to distinguish between thes
                      diﬀerent cases that all lead to the samei0 :

                          Relationship 14.A convolution described byk,s and phas an
                          associated transposed convolution described bya,~i0 ,k0 =k,s0 = 1
                           and p0 =k p 1, where~i0 is the size of the stretched input obtained
                          by adding sx1 zeros between each input unit,  and a= (i+ 2p k)
                          modsrepresents the number of zeros added to the bottom  and  right
                          edges of the input,  and  its output size is

                                         <<FORMULA>>


                                         <<FIGURE>>

                        Figure 4.7 provides an example for  i= 6,k= 3,s= 2 and p= 1.

                                        <<FIGURE>>

                      Figure 4.7: The transpose of convolving a 3x3 kernel over a 6x6 input padded
                      with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,k= 3,s= 2 and 
                      p= 1). It is equivalent to convolving a 3x3 kernel over a 2x2 input (with
                      1zero inserted between inputs) padded with a 1x1 border of zeros (with an
                      additional border of size1added to the bottom  and  right edges) using unit
                      strides (i.e.,i0 = 3,~i0 = 5,a= 1,k0 =k,s0 = 1 and p0 = 1).


                                                 Chapter 5


                      Miscellaneous convolutions

                      5.1 Dilated convolutions

                      Readers familiar with the deep learning literature may have noticed the term
                      “dilated convolutions” (or “atrous convolutions”, from the French expressioncon-
                      volutions à trous) appear in recent papers. Here we attempt to provide an in-
                      tuitive underst and ing of dilated convolutions. for   a more in-depth description
                       and  to underst and  in what contexts they are applied, see Chenet al.(2014); Yu
                       and  Koltun (2015).
                        Dilated convolutions “inﬂate” the kernel by inserting spaces between the ker-
                      nel elements. The dilation “rate” is controlled by an additional hyperparameter
                      d. Implementations may vary, but there are usually dx1 spaces inserted between
                      kernel elements such thatd= 1corresponds to a regular convolution.
                        Dilated convolutions are used to cheaply increase the receptive ﬁeld of output
                      units without increasing the kernel size, which is especially eﬀective when multi-
                      ple dilated convolutions are stacked one after another. for   a concrete example,
                      see Oordet al.(2016), in which the proposed WaveNet model implements an
                      autoregressive generative model for   raw audio which uses dilated convolutions
                      to condition new audio frames on a large context of past audio frames.
                        To underst and  the relationship tying the dilation rated and  the output size
                      o, it is useful to think of the impact ofdon theeﬀective kernel size. A kernel
                      of sizekdilated by a factordhas an eﬀective size

                                          <<FORMULA>>

                      This can be combined with Relationship 6 to for  m the following relationship for  
                      dilated convolutions:

                          Relationship 15.for any  i,k,p and s,  and  for   a dilation rated,

                                       <<FORMULA>>


                                        <<FIGURE>>
                      Figure 5.1: (Dilated convolution) Convolving a 3x3 kernel over a7 7input
                      with a dilation factor of 2 (i.e.,i= 7,k= 3,d= 2,s= 1 and p= 0).


                      Figure 5.1 provides an example for  i= 7,k= 3 and d= 2.


                                                  Bibliography


                      Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
                       G. S., Davis, A., Dean, J., Devin, M.,et al.(2015). Tensorﬂow: Large-
                       scale machine learning on heterogeneous systems. Software available from
                       tensorﬂow.org.
                      Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron,
                       A., Bouchard, N., Warde-Farley, D.,  and  Bengio, Y. (2012). Theano: new
                       features  and  speed improvements.arXiv preprint arXiv:1211.5590.
                      Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins,
                       G., Turian, J., Warde-Farley, D.,  and  Bengio, Y. (2010). Theano: A cpu  and 
                       gpu math compiler in python. InProc. 9th Python in Science Conf, pages
                       1–7.
                      Boureau, Y., Bach, F., LeCun, Y.,  and  Ponce, J. (2010a). Learning mid-level
                       features for   recognition. InProc. International Conference on Computer Vi-
                       sion  and  Pattern Recognition (CVPR’10). IEEE.
                      Boureau, Y., Ponce, J.,  and  LeCun, Y. (2010b). A theoretical analysis of feature
                       pooling in vision algorithms. InProc. International Conference on Machine
                       learning (ICML’10).
                      Boureau, Y., Le Roux, N., Bach, F., Ponce, J.,  and  LeCun, Y. (2011). Ask the
                       locals: multi-way local pooling for   image recognition. InProc. International
                       Conference on Computer Vision (ICCV’11). IEEE.
                      Chen, L.-C., Pap and reou, G., Kokkinos, I., Murphy, K.,  and  Yuille, A. L. (2014).
                       Semantic image segmentation with deep convolutional nets  and  fully con-
                       nected crfs.arXiv preprint arXiv:1412.7062.
                      Collobert, R., Kavukcuoglu, K.,  and  Farabet, C. (2011). Torch7: A matlab-like
                       environment for   machine learning. InBigLearn, NIPS Workshop, number
                       EPFL-CONF-192376.
                      Goodfellow, I., Bengio, Y.,  and  Courville, A. (2016). Deep learning. Book in
                       preparation for   MIT Press.


                      Im, D. J., Kim, C. D., Jiang, H.,  and  Memisevic, R. (2016). Generating images
                       with recurrent adversarial networks.arXiv preprint arXiv:1602.05110.
                      Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
                       rama, S.,  and  Darrell, T. (2014). Caﬀe: Convolutional architecture for   fast
                       feature embedding. InProceedings of the ACM International Conference on
                       Multimedia, pages 675–678. ACM.
                      Krizhevsky, A., Sutskever, I.,  and  Hinton, G. E. (2012). Imagenet classiﬁcation
                       with deep convolutional neural networks. InAdvances in neural infor  mation
                       processing systems, pages 1097–1105.
                      Le Cun, Y., Bottou, L.,  and  Bengio, Y. (1997). Reading checks with multilayer
                       graph transfor  mer networks. InAcoustics, Speech,  and  Signal Processing,
                       1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages
                       151–154. IEEE.
                      Long, J., Shelhamer, E.,  and  Darrell, T. (2015). Fully convolutional networks for  
                       semantic segmentation. InProceedings of the IEEE Conference on Computer
                       Vision  and  Pattern Recognition, pages 3431–3440.
                      Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
                       Kalchbrenner, N., Senior, A.,  and  Kavukcuoglu, K. (2016). Wavenet: A
                       generative model for   raw audio.arXiv preprint arXiv:1609.03499.
                      Radfor  d, A., Metz, L.,  and  Chintala, S. (2015). Unsupervised representa-
                       tion learning with deep convolutional generative adversarial networks.arXiv
                       preprint arXiv:1511.06434.
                      Saxe, A., Koh, P. W., Chen, Z., Bh and , M., Suresh, B.,  and  Ng, A. (2011).
                       On r and om weights  and  unsupervised feature learning. In L. Getoor  and 
                       T. Scheﬀer, editors,Proceedings of the 28th International Conference on Ma-
                       chine Learning (ICML-11), ICML ’11, pages 1089–1096, New York, NY, USA.
                       ACM.
                      Visin, F., Kastner, K., Courville, A. C., Bengio, Y., Matteucci, M.,  and  Cho,
                       K. (2015). Reseg: A recurrent neural network for   object segmentation.
                      Yu, F.  and  Koltun, V. (2015). Multi-scale context aggregation by dilated con-
                       volutions.arXiv preprint arXiv:1511.07122.
                      Zeiler, M. D.  and  Fergus, R. (2014). Visualizing  and  underst and ing convolu-
                       tional networks. InComputer vision–ECCV 2014, pages 818–833. Springer.
                      Zeiler, M. D., Taylor, G. W.,  and  Fergus, R. (2011). Adaptive deconvolutional
                       networks for   mid  and  high level feature learning. InComputer Vision (ICCV),
                       2011 IEEE International Conference on, pages 2018–2025. IEEE.

<<END>> <<END>> <<END>>


<<START>> <<START>> <<START>>

                  A Survey of Model Compression and Acceleration for Deep Neural Networks

                 Yu Cheng, Duo Wang, Pan Zhou Member IEEE, and Tao Zhang Senior Member  IEEE

         Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model
        recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment
        in devices with low memory resources or in applications with to billions [4].

       strict latency requirements. Therefore, a natural thought is to   As larger neural networks with more layers and nodes
 
        without signiﬁcantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech-
        niques for compacting and accelerating CNNs model developed. tion, recent years witnessed signiﬁcant progress in virtual
        These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre-
        parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle
        ferred/compact convolutional ﬁlters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced.
        For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efﬁcient deep learning methods can have
        performance, related applications, advantages, and drawbacks signiﬁcant impacts on distributed systems, embedded devices,
        etc. Then we will go through a few very recent additional and FPGA for Artiﬁcial Intelligence. For example, the ResNet-
        successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion ﬂoating number multiplications matrix, the main datasets used for evaluating the model per-
        formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant
        this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than
        on this topic.                                   75% of parameters and 50% computational time. For devices
         Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte
        Model Compression and Acceleration,                  resources, how to compact the models used on them is also
                                                   important.
                                                     Achieving these goal calls for joint solutions from many
                                                     
                                                     I. INTRODUCTION                
                                                     
         disciplines, including but not limited to machine learning, op-
         In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing,
        lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works
        achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which
        These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community
        billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years.
        very high computation capability plays a key role in their   We classify these approaches into four categories: pa-
        success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans-
        achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional ﬁlters, and knowledge distil-
        using a network containing 60 million parameters with ﬁve lation. The parameter pruning and sharing based methods
        convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to
        it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor-
        ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to
        example is the top face veriﬁcation results on the Labeled estimate the informative parameters of the deep CNNs. The
        Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional ﬁlters
        containing hundreds of millions of parameters, using a mix design special structural convolutional ﬁlters to reduce the
        of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge
                                                   distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
        Way, Redmond, WA 98052, USA.                         compact neural network to reproduce the output of a larger
         Duo Wang and Tao Zhang are with the Department of Automation, network.
        Tsinghua University, Beijing 100084, China.                     In Table I, we brieﬂy summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074,
        China.                                        rank factorization and knowledge distillation approaches can        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2


                                                TABLE I

                                            <<TABLE>>

        be used in DNN models with fully connected layers and
        convolutional layers, achieving comparable performances. On
        the other hand, methods using transferred/compact ﬁlters are
        designed for models with convolutional layers only. Low-rank
        factorization and transfered/compact ﬁlters based approaches
        provide an end-to-end pipeline and can be easily implemented
        in CPU/GPU environment, which is straightforward. while
        parameter pruning & sharing use different methods such as
        vector quantization, binary coding and sparse constraints to
        perform the task. Generally it will take several steps to achieve
        the goal.                                    
        
                                <<FIGURE>>
         Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output
       
        eter pruning/sharing low-rank factorization can be extracted is the compression model.
        from pre-trained ones or trained from scratch. While the
        transferred/compact ﬁlter and knowledge distillation models
        can only support train from scratch. These methods are inde- memory usage and ﬂoat point operations with little loss in
        pendently designed and complement each other. For example, classiﬁcation accuracy.
        transferred layers and parameter pruning & sharing can be   The method proposed in [10] quantized the link weights
        used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the
        used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce
        speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con-
        properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the
                                                   small-weight connections. Finally, the network was retrained
                                               to learn the ﬁnal weights for the remaining sparse connections. 

              II. PARAMETER PRUNING AND SHARING 
                     
         This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importanceﬁtting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which   In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classiﬁed into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix.       instance, BinaryConnect [12], BinaryNet [13] and XNORNet-
                                                   works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization                    activation during the model training. The systematic study in
         Network quantization compresses the original network by [15] showed that networks trained with back propagation could
        reducing the number of bits required to represent each weight. be resilient to speciﬁc weight distortions, including binary
        Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights.
        quantization to the parameter values. Vanhouckeet al.[8]   Drawbacks: the accuracy of the binary nets is signiﬁcantly
        showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet.
        in signiﬁcant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina-
        work in [9] used 16-bit ﬁxed-point representation in stochastic rization schemes are based on simple matrix approximations
        rounding based CNN training, which signiﬁcantly reduced and ignore the effect of binarization on the accuracy loss.        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3


         To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of
        Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear
        directly minimizes the loss with respect to the binary weights. transformsf(x;M) = (Mx), where ( )is an element-wise
        The work in [17] reduced the time on ﬂoat point multiplication nonlinear operator,xis the input vector, andMis them n
        in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense
        converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing
        signiﬁcant changes.                              matrix-vector products inO(mn)time. Thus, an intuitive
                                                   way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing                           structural matrix. Anm n matrix that can be described
         Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured
        network complexity and to address the over-ﬁtting issue. An matrix. Typically, the structure should not only reduce the
        early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference
        [18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and
        Surgeon [20] methods reduced the number of connections gradient computations.
        based on the Hessian of the loss function, and their work sug-   Following this direction, the work in [30], [31] proposed a
        gested that such pruning gave higher accuracy than magnitude- simple and efﬁcient approach based on circulant projections,
                                                   while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training procedure of those methods followed the way training from   <<FORMULA>>, a circulant matrix R^2 R^dxd is deﬁned
                                                   as: <<FORMULA>>
        scratch manner. A recent trend in this direction is to prune redundant, <<FORMULA>> non-informative weights in a pre-trained CNN model. For <<FORMULA>>
        example, Srinivas and Babu [21] explored the redundancy      <<FORMULA>> among neurons, and proposed a data-free pruning method to                       
        remove redundant neurons. Hanet al.[22] proposed to reduce                 <<FORMULA>>
        the total number of parameters and operations in the entire  thus the memory cost becomesO(d)instead of O(d^2) network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourier used a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan-   In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fully on soft weight-sharing was proposed, which included both connected layers. The Adaptive Fast food transform matrix quantization and pruning in one simple (re-)training procedure. R2Rn d was deﬁned as:The above pruning schemes typically produce connections
        pruning in CNNs.                                              <<FORMULA>>            (2)
         There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices.  2
        with sparsity constraints. Those sparsity constraints are typ- <<FORMULA>> is a random permutation matrix, and H denotes
        ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con-
        norm regularizers. The work in [25] imposed group sparsity nected layer with d inputs and n outputs using the Adaptive
        constraint on the convolutional ﬁlters to achieve structured Fast food transform reduces the storage and the computational
        brain Damage, i.e., pruning entries of the convolution kernels costs from O(n^d) to O(n) and from O(n^d) to O(n*log(d)),
        in a group-wise fashion. In [26], a group-sparse regularizer respectively.
        on neurons was introduced during the training stage to learn   The work in [29] showed the effectiveness of the new
        compact CNNs with reduced ﬁlters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their
        structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured
        ﬁlters, channels or even layers. In the ﬁlter-level pruning, all matrix classes, including block and multi-level Toeplitz-like
        the above works used l2-norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34].
        usedl1 -norm to select and prune unimportant ﬁlters.       Following this idea, [35] proposed a general structured efﬁ-
         Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs.
        and sharing. First, pruning with l1 or l2 regularization requires   Drawbacks: one problem of this kind of approaches is that
        more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the
        pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand,
        which demands ﬁne-tuning of the parameters and could be how to ﬁnd a proper structural matrix is difﬁcult. There is no
        cumbersome for some applications.                   theoretical way to derive it out.

        C. Designing Structural Matrix                          
        
        III. LOW-RANK FACTORIZATION AND SPARSITY


         In architectures that contain fully-connected layers, it is   Convolution operations contribute the bulk of most com-
        critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4


                                                                      TABLE II
                                                    COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES
                                                                   ON ILSVRC-2012.

                                                                    <<TABLE>>


                                       <<FIGURE>>

        Fig. 2. A typical framework of the low-rank regularization method. The left    
        is the original convolutional layer and the right is the low-rank constraint    
        convolutional layer with rank-K.                             
                                                      
        would improve the compression rate as well as the overall
        speedup. For the convolution kernels, it can be viewed as a
        4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic
        the intuition that there is a signiﬁcant amount of redundancy parameters in deep models using the low-rank method. [42]
        in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the ﬁnal weight
        remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted
        it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite
        help.                                        the fully connected layer for designing compact multi-task
         It has been a long time for using low-rank ﬁlters to acceler- deep learning architectures.
        ate convolution, for example, high dimensional DCT (discrete   Drawbacks: low-rank approaches are straightforward for
        cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements
        to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti-
        respectively. Learning separable 1D ﬁlters was introduced ﬁed units and maxout. However, the implementation is not
        by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which
        idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current
        approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and
        kernels were proposed in [37]. They achieved 2 speedup thus cannot perform global parameter compression, which
        for a single convolutional layer with 1% drop in classiﬁcation is important as different layers hold different information.
        accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to
        decomposition schemes, reporting a 4.5 speedup with 1% achieve convergence when compared to the original model.
        drop in accuracy in text recognition.
         The low-rank approximation was done layer by layer. The   IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS
        parameters of one layer were ﬁxed after it was done, and the   CNNs are parameter efﬁcient due to exploring the trans-layers above were ﬁne-tuned based on a reconstruction error lation invariant property of the representations to the input criterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-ﬁtting. Although a strong theory lowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant property used nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu- decomposition for training low-rank constrained CNNs from tional ﬁlters to compress CNN models is motivated by recent scratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input, ( )be a network or layer and T( ) be the both the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is deﬁned as:Low-rank) can be used to train CNNs from scratch. However,
        there are few differences between them. For example, ﬁnding                <<FORMULA>>            (3) 
        the best low-rank approximation in CP decomposition is an ill-
        posed problem, and the best rank-K (K is the rank number) indicating that transforming the input x by the transform T( )
        approximation may not exist sometimes. While for the BN and then passing it through the network or layer  ( ) should
        scheme, the decomposition always exists. We perform a simple give the same result as ﬁrst mapping x through the network
        comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq.
        speedup and the compression rates are used to measure their (10), the transforms <<T( )>> and <<T_0( )>> are not necessarily the
        performances.                                  same as they operate on different objects. According to this
         As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or ﬁlters
        be viewed as a 2D matrix and thus the above mentioned  ( ) to compress the whole network models. From empirical
        methods can also be applied there. There are several classical observation, deep CNNs also beneﬁt from using a large set of
        works on exploiting low-rankness in fully connected layers. convolutional ﬁlters by applying certain transformT( )to a        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5


        small set of base ﬁlters since it acts as a regularizer for the                   TABLE III
        model.                                                  A SIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND
         Following this direction, there are many recent reworks               
        proposed to build a convolutional layer from a set of base                        <<TABLE>>
        ﬁlters [43]–[46]. What they have in common is that the     
        transform T( ) lies in the family of functions that only operate      
        in the spatial domain of the convolutional ﬁlters. For example,      
        the work in [45] found that the lower convolution layers of     
        CNNs learned redundant ﬁlters to extract both positive and
        negative phase information of an input signal, and deﬁnedT( )   Drawbacks: there are few issues to be addressed for ap-to be the simple negation function:                   proaches that apply transform constraints to convolutional ﬁl-
                       
                       <<FORMULA>>             (4) 

        ters. First, these methods can achieve competitive performance x                 for wide/ﬂat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional ﬁlter andW  is the ﬁlter x         ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2 compression   Using a compact ﬁlter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric ﬁlters with compact blocks to improve the classiﬁcation accuracy. The intuition is that the learning the speed, which signiﬁcantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing3 3convolution into two1 1to useful convolutional ﬁlters instead of redundant ones.     convolutions was used in [48], which achieved signiﬁcantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace3 3convolution with1 1convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The
        transformT( )was deﬁne as:                           

                    <<FORMULA>>           (5) 

         V. KNOWLEDGE DISTILLATION   
         
         To the best of our knowledge, exploiting knowledge transfer
        where   were the multi-bias factors. The work in [47] con- (KT) to compress model was ﬁrst proposed by Caruanaet
        side red a combination of rotation by a multiple of 90   and al.[50]. They trained a compressed/ensemble model of strong
        horizontal/vertical ﬂipping with:                     classiﬁers with pseudo-data labeled, and reproduced the output
                                                   of the original larger network. But the work is limited to 
                                                   
                                                   <<FORMULA>>            (6) 

                                                   shallow models. The idea has been recently adopted in [51]
        whereWT  was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide
        original ﬁlters with angle 2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model
        transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The
         was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from
        can achieve good classiﬁcation performance.             a large teacher model into a small one by learning the class
         The work in [44] deﬁnedT( )as the set of translation distributions output via softmax.
        functions applied to 2D ﬁlters:                        The work in [52] introduced a KD compression framework,
                                                   which eased the training of deep networks by following a

                                                   <<FORMULA>>    (7) 

                                                   student-teacher paradigm, in which the student was penalized
        whereT( ;x;y)denoted the translation of the ﬁrst operand by according to a softened version of the teacher’s output. The
        (x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into
        at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained
        can be used to 1) improve the classiﬁcation accuracy as a to predict the output and the classiﬁcation labels. Despite
        regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various
        parameter efﬁciency by ﬂexibly varying their architectures to image classiﬁcation tasks. The work in [53] aimed to address
        compress networks.                              the network compression problem by taking advantage of
         Table III brieﬂy compares the performance of different depth neural networks. It proposed an approach to train thin
        methods with transferred convolutional ﬁlters, using VGGNet but deep networks, called FitNets, to compress wide and
        (16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended
        on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In
        observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher
        little or no drop in classiﬁcation accuracy.               network, FitNet made the student mimic the full feature maps 


        of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec-
        the capacities of teacher and student may differ greatly.     ture such as GoogleNet or Network in Network, can achieve
         All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting
        10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully
        experimental results show that these methods match or outper- optimized the utilization of the computing resources inside
        form the teacher’s performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62]
        parameters and multiplications.                      and motivated them to increase the depth and width of the
         There are several extension along this direction of dis- network while keeping the computational budget constant.
        tillation knowledge. The work in [54] trained a parametric   The work in [63] targeted the Residual Network based
        student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called
        proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory
        neural networks for the student model. Different from previous setup to train short networks and used deep networks at test
        works which represented the knowledge using the soften label time. It started with very deep networks, while during training,
        probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers
        neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this
        information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual
        The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed
        instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers
        network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best
        are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional
        mations between neural network speciﬁcations. Zagoruyko networks with adaptive inference graphs to adaptively deﬁne
        et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66].
        assumption of FitNet. They transferred the attention maps that   Other approaches to reduce the convolutional overheads in-are summaries of the full activations.                  clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help signiﬁcantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classiﬁcation tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral ﬁlters [70]. Those worksperformance competitive with other type of approaches.     only aim to speed up the computation but not reduce the
                                                   memory storage.
                                                   
                                                   VI. OTHER TYPES OF APPROACHES

         We ﬁrst summarize the works utilizing attention-based
        methods. Note that attention-based mechanism [58] can reduce    
        
                                                                                        VII. BENCHMARKS , EVALUATION AND DATABASES
        computations signiﬁcantly by learning to selectively focus or   In the past ﬁve years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to ﬁrst standard models include LeNets [71], All-CNN-nets [72] andﬁnd the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been signiﬁcantly reduced.                  layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected
        a sparse combination of the experts to process each input. In   The standard criteria to measure the quality of model
        [61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the
        which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters
        selected and executed a subset of D2NN neurons based on the in the original model Manda   is that of the compressed
        input.                                        model M  , then the compression rate  (M;M   ) of M  over
         There have been other attempts to reduce the number of Mis                     aparameters of neural networks by replacing the fully connected                 (M;M   ) =  :            (8)a         IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7


                          TABLE IV                       or low rank factorization based methods. If you need
           SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT         end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION .          and transferred convolutional ﬁlters approaches could be
                                                                        considered.
                                                                         For applications in some speciﬁc domains, methods with low-rank factorization [40]           human prior (like the transferred convolutional ﬁlters, Network in network [73]      low-rank factorization [40]
                      <<TABLE>>                                          structural matrix) sometimes have beneﬁts. For example,
                                                                           when doing medical images classiﬁcation, transferred Residual networks [75]  compact ﬁlters [49], stochastic depth [63]       convolutional ﬁlters could work well as medical images parameter sharing [24]
                                                                           (like organ) do have the rotation transformation property.
                                                                            Usually the approaches of pruning & sharing could give parameter pruning [20], [22]          reasonable compression rate while not hurt the accuracy.
                                                       Thus for applications which requires stable model accu-
        Another widely used measurement is the index space saving     racy, it is better to utilize pruning & sharing.
        deﬁned in several papers [30], [35] as                     If your problem involves small/medium size datasets, you
                                                       can try the knowledge distillation approaches. The com-a a 
                     <<FORMULA>>           (9)     pressed student model can take the beneﬁt of transferring a                     knowledge from teacher model, making it robust datasets
        where a and a are the number of the dimension of the index     which are not large.
        space in the original model and that of the compressed model,     As we mentioned before, techniques of the four groups
        respectively.                                      are orthogonal. It is reasonable to combine two or three
         Similarly, given the running timesofMands  ofM  ,     of them to maximize the performance. For some spe-
        the speedup rate <<FORMULA>> is deﬁned as:                  ciﬁc applications, like object detection, which requires
                                 s                     both convolutional and fully connected layers, you can
                    <<FORMULA>>            (10)                    
                                                            compress the convolutional layers with low rank based
        Most work used the average training time per epoch to measure     method and the fully connected layers with a pruning
        the running time, while in [30], [35], the average testing time     technique.
        was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster
        computation for both the training and the testing stages.       Techniques for deep model compression and acceleration
         Good compression methods are expected to achieve almost are still in the early stage and the following challenges still
        the same performance as the original model with much smaller need to be addressed.
        parameters and less computational time. However, for different     Most of the current state-of-the-art approaches are built
        applications with different CNN designs, the relation between     on well-designed CNN models, which have limited free-
        parameter size and computational time may be different.     dom to change the conﬁguration (e.g., network structural,
        For example, it is observed that for deep CNNs with fully     hyper-parameters). To handle more complicated tasks,
        connected layers, most of the parameters are in the fully     it should provide more plausible ways to conﬁgure the
        connected layers; while for image classiﬁcation tasks, ﬂoat     compressed models.
        point operations are mainly in the ﬁrst few convolutional layers     Pruning is an effective way to compress and acceler-
        since each ﬁlter is convolved with the whole image, which is     ate CNNs. The current pruning techniques are mostly
        usually very large at the beginning. Thus compression and     designed to eliminate connections between neurons. On
        acceleration of the network should focus on different type of     the other hand, pruning channel can directly reduce the
        layers for different applications.                         feature map width and shrink the model into a thinner
                                                       one. It is efﬁcient but also challenging because removing
               VIII. D ISCUSSION AND CHALLENGES            channels might dramatically change the input of the
                                                       following layer.In this paper, we summarized recent efforts on compressing
        and accelerating deep neural networks (DNNs). Here we dis-     As we mentioned before, methods of structural matrix
                                                       and transferred convolutional ﬁlters impose prior humancuss more details about how to choose different compression     knowledge to the model, which could signiﬁcantly affectapproaches, and possible challenges/solutions on this area.       the performance and stability. It is critical to investigate
                                                       how to control the impact of those prior knowledge.A. General Suggestions                               The methods of knowledge distillation provide many ben-
         There is no golden rule to measure which approach is the     eﬁts such as directly accelerating model without special
        best. How to choose the proper method is really depending     hardware or implementations. It is still worthy developing
        on the applications and requirements. Here are some general     KD-based approaches and exploring how to improve their
        guidance we can provide:                             performances.
           If the applications need compacted models from pre-     Hardware constraints in various of small platforms (e.g.,
           trained models, you can choose either pruning & sharing     mobile, robotic, self-driving car) are still a major problem        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8


           to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g.,
           use of the limited computational source and how to design video and image frames [88], [89]).
           special compression methods for such platforms are still
           challenges that need to be addressed.                         IX. ACKNOWLEDGMENTS
           Despite the great achievements of these compression ap-
           proaches, the black box mechanism is still the key barrier   The authors would like to thank the reviewers and broader
           to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular,
           is still an important problem.                    we would like to thank Hong Zhao from the Department of
                                                   Automation of Tsinghua University for her help on modifying
        C. Possible Solutions                             the paper. This research is supported by National Science
                                                   Foundation of China with Grant number 61401169.To solve the hyper-parameters conﬁguration problem, we
        can rely on the recent learning-to-learn strategies [76], [77].
        This framework provides a mechanism allowing the algorithm                  REFERENCES
        to automatically learn how to exploit structure in the problem  [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classiﬁcation with of interest. Very recently, leveraging reinforcement learning     deep convolutional neural networks,” inNIPS, 2012.
        to efﬁciently sample the design space and improve the model  [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
        compression has also been tried [78].                     gap to human-level performance in face veriﬁcation,” inCVPR, 2014.
                                                    [3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efﬁciency beneﬁt on both     adaptive feature sharing in multi-task networks with applications in
        CPU and GPU because no special implementation is required.     person attribute classiﬁcation,”CoRR, vol. abs/1611.05377, 2016.
        But it is also challenging to handle the input conﬁguration.  [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
                                                       M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel     distributed deep networks,” inNIPS, 2012.
        pruning methods [79], which focus on imposing sparse con-  [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
        straints on weights during training. However, training from     recognition,”CoRR, vol. abs/1512.03385, 2015.
                                                    [6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In     deep convolutional networks using vector quantization,”CoRR, vol.
        [80], the authors provided an iterative two-step algorithm to     abs/1412.6115, 2014.
        effectively prune channels in each layer.                 [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized
                                                       convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models     on Computer Vision and Pattern Recognition (CVPR), 2016.
        and transferring it to the student models is useful for the  [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
        knowledge distillation (KD) approaches. Instead of directly re-     neural networks on cpus,” inDeep Learning and Unsupervised Feature
                                                       Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl-  [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
        edge of neurons could be helpful. One can derive a way to     learning with limited numerical precision,” inProceedings of the
        select essential neurons related to the task [81], [82]. The     32Nd International Conference on International Conference on Machine
                                                       Learning - Volume 37, ser. ICML’15, 2015, pp. 1737–1746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
        or samples, that implies these regions or samples share some     deep neural networks with pruning, trained quantization and huffman
        common properties that may relate to the task.              coding,”International Conference on Learning Representations (ICLR),
                                                       2016. For methods with the convolutional ﬁlters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network
        matrix, we can conclude that the transformation lies in the     quantization,”CoRR, vol. abs/1612.01543, 2016.
        family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
                                                       neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is     in Neural Information Processing Systems 28: Annual Conference on
        to provide a generalization of the aforementioned approaches     Neural Information Processing Systems 2015, December 7-12, 2015,
        in two aspects: 1) instead of limiting the transformation to     Montreal, Quebec, Canada, 2015, pp. 3123–3131.
                                                   [13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predeﬁned transformations, let it be the     works with weights and activations constrained to +1 or -1,”CoRR, vol.
        whole family of spatial transformations applied on 2D ﬁlters     abs/1602.02830, 2016.
        or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
                                                       Imagenet classiﬁcation using binary convolutional neural networks,” in model parameters.                                  ECCV, 2016.
         Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha,
        some general/uniﬁed approaches is one direction. Wanget al.     “Deep neural networks are robust to weight binarization and other non-
        [83] presented a feature map dimensionality reduction method     linear distortions,”CoRR, vol. abs/1606.01981, 2016.
                                                   [16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen-     networks,”CoRR, vol. abs/1611.01600, 2016.
        erated from different ﬁlters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
        information of the original network. The idea can be applied     with few multiplications,”CoRR, vol. abs/1510.03009, 2015.
                                                   [18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The     construction with back-propagation,” inAdvances in Neural Information
        work in [84] proposed a one-shot whole network compression     Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177–185.
        scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information
                                                       processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and ﬁne-tuning to make deep     Damage, pp. 598–605.
        CNNs work in mobile devices.                      [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives
         Despite the classiﬁcation task, people are also adapting the     for network pruning: Optimal brain surgeon,” inAdvances in Neural
                                                       Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164– compacted models in other tasks [85]–[87]. We would like to     171.          IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9


          [21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural  [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net-
              networks,” inProceedings of the British Machine Vision Conference      works,”arXiv preprint arXiv:1602.07576, 2016.
              2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp.  [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural
              31.1–31.12.                                              networks,” inAdvances In Neural Information Processing Systems, 2016,
          [22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and      pp. 1082–1090.
              connections for efﬁcient neural networks,” inProceedings of the 28th  [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and
              International Conference on Neural Information Processing Systems, ser.      improving convolutional neural networks via concatenated rectiﬁed
              NIPS’15, 2015.                                            linear units,”arXiv preprint arXiv:1603.05201, 2016.
          [23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com-  [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in
              pressing neural networks with the hashing trick.” JMLR Workshop and      deep neural networks,”arXiv preprint arXiv:1604.00676, 2016.
              Conference Proceedings, 2015.                             [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic
          [24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural      symmetry in convolutional neural networks,” inProceedings of the
              network compression,”CoRR, vol. abs/1702.04008, 2017.               33rd International Conference on International Conference on Machine
          [25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain      Learning - Volume 48, ser. ICML’16, 2016.
              damage,” in2016 IEEE Conference on Computer Vision and Pattern  [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
              Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,      resnet and the impact of residual connections on learning.”CoRR, vol.
              pp. 2554–2564.                                            abs/1602.07261, 2016.
          [26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact  [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Uniﬁed,
              cnns,” inEuropean Conference on Computer Vision, Amsterdam, the      small, low power fully convolutional neural networks for real-time object
              Netherlands, October 2016, pp. 662–677.                          detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016.
          [27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured  [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ
              sparsity in deep neural networks,” inAdvances in Neural Information      inProceedings of the 12th ACM SIGKDD International Conference on
              Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg,      Knowledge Discovery and Data Mining, ser. KDD ’06, 2006, pp. 535–
              I. Guyon, and R. Garnett, Eds., 2016, pp. 2074–2082.                 541.
          [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning  [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
              ﬁlters for efﬁcient convnets,”CoRR, vol. abs/1608.08710, 2016.           Advances in Neural Information Processing Systems 27: Annual Confer-
          [29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for      ence on Neural Information Processing Systems 2014, December 8-13
              small-footprint deep learning,” inAdvances in Neural Information Pro-      2014, Montreal, Quebec, Canada, 2014, pp. 2654–2662.
              cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,  [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
              and R. Garnett, Eds., 2015, pp. 3088–3096.                        neural network,”CoRR, vol. abs/1503.02531, 2015.
          [30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F.  [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
              Chang, “An exploration of parameter redundancy in deep networks with      Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550,
              circulant projections,” inInternational Conference on Computer Vision      2014.
              (ICCV), 2015.                                         [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling,
          [31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and      “Bayesian dark knowledge,” inAdvances in Neural Information Process-
              S. Chang, “Fast neural networks with circulant projections,”CoRR, vol.      ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
              abs/1502.03436, 2015.                                       and R. Garnett, Eds., 2015, pp. 3420–3428.
          [32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song,  [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression
              and Z. Wang, “Deep fried convnets,” inInternational Conference on      by distilling knowledge from neurons,” inProceedings of the Thirtieth
              Computer Vision (ICCV), 2015.                                 AAAI Conference on Artiﬁcial Intelligence, February 12-17, 2016,
          [33]J. Chun and T. Kailath,Generalized Displacement Structure for Block-      Phoenix, Arizona, USA., 2016, pp. 3560–3566.
              Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel-  [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning
              berg: Springer Berlin Heidelberg, 1991, pp. 215–236.                  via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015.
          [34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution  [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
              in low-rank tensor formats via cross approximation,”SIAM J. Scientiﬁc      Improving the performance of convolutional neural networks via atten-
              Computing, vol. 37, no. 2, 2015.                                tion transfer,”CoRR, vol. abs/1612.03928, 2016.
          [35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc:  [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
              A structured efﬁcient linear layer,” inInternational Conference on      jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014.
              Learning Representations (ICLR), 2016.                       [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and
          [36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable      A. C. Courville, “Dynamic capacity networks,” inProceedings of the
              ﬁlters,” in2013 IEEE Conference on Computer Vision and Pattern      33nd International Conference on Machine Learning, ICML 2016, New
              Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754–      York City, NY, USA, June 19-24, 2016, 2016, pp. 2549–2558.
              2761.                                               [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
          [37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,      and J. Dean, “Outrageously large neural networks: The sparsely-gated
              “Exploiting linear structure within convolutional networks for efﬁcient      mixture-of-experts layer,” 2017.
              evaluation,” inAdvances in Neural Information Processing Systems 27,  [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and
              Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.      J. Odobez, “Deep dynamic neural networks for multimodal gesture
              Weinberger, Eds., 2014, pp. 1269–1277.                           segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell.,
          [38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional      vol. 38, no. 8, pp. 1583–1597, 2016.
              neural networks with low rank expansions,” inProceedings of the British  [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
              Machine Vision Conference. BMVA Press, 2014.                    V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
          [39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit-      inComputer Vision and Pattern Recognition (CVPR), 2015.
              sky, “Speeding-up convolutional neural networks using ﬁne-tuned cp-  [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep
              decomposition,”CoRR, vol. abs/1412.6553, 2014.                    Networks with Stochastic Depth, 2016.
          [40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks  [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual
              with low-rank regularization,” vol. abs/1511.06067, 2015.               networks with separated stochastic depth,”CoRR, vol. abs/1612.01230,
          [41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas,      2016.
              “Predicting parameters in deep learning,” in Advances in Neural  [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and
              Information Processing Systems 26, C. Burges, L. Bottou, M. Welling,      R. Feris, “Blockdrop: Dynamic inference paths in residual networks,”
              Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 2148–2156.      inCVPR, 2018.
              [Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper   [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer-
              ﬁles/nips26/1053.pdf                                        ence graphs,” 2018.
          [42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab-  [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional
              hadran, “Low-rank matrix factorization for deep neural network training      networks through FFTs, 2014.
              with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on  [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
              Acoustics, Speech and Signal Processing, 2013.                      works,” in2016 IEEE Conference on Computer Vision and Pattern          IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10


              Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,  [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong,
              pp. 4013–4021.                                            M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X.
          [69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S.      Yu, “Ibm research and columbia university trecvid-2012 multimedia
              Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol.      event detection (med), multimedia event recounting (mer), and semantic
              abs/1611.05138, 2016.                                       indexing (sin) systems,” 2012.
          [70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving
              pooling in deep networks,” inProceedings of the IEEE Conference on
              Computer Vision and Pattern Recognition, 2018.                                  Yu Cheng(yu.cheng@microsoft.com) currently is a
          [71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning                   Researcher at Microsoft. Before that, he was a Re-
              applied to document recognition,” inProceedings of the IEEE, 1998, pp.                   search Staff Member at IBM T.J. Watson Research
              2278–2324.                                                            Center. Yu got his Ph.D. from Northwestern Univer-
          [72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried-                   sity in 2015 and bachelor from Tsinghua University
              miller, “Striving for simplicity: The all convolutional net,”CoRR, vol.                   in 2010. His research is about deep learning in
              abs/1412.6806, 2014.                                                     general, with speciﬁc interests in the deep generative
          [73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014.                    model, model compression, and transfer learning.
          [74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for                   He regularly serves on the program committees of
              large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014.                        top-tier AI conferences such as NIPS, ICML, ICLR,
          [75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image                   CVPR and ACL.
              recognition,”arXiv preprint arXiv:1512.03385, 2015.
          [76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
              D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient
              descent by gradient descent,” inNeural Information Processing Systems
              (NIPS), 2016.                                                          Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference                   ceived the B.S. degree in automation from theon Learning Representations 2016, 2016.                                       Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl                   Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe                   Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018.                    Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in                   are about deep learning, particularly in few-shotdeep networks,” pp. 2270–2278, 2016.                                         learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating                   on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on                   robotics vision.Computer Vision (ICCV), Oct 2017.
          [81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep
              neural networks,”ECCV, 2018.
          [82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric
              learning via cross sample similarities transfer,” inProceedings of the
              Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18),
              New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852–                   Pan Zhou(panzhou@hust.edu.cn) is currently an
              2859.                                                                associate professor with School of Electronic In-
          [83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond ﬁlters: Compact feature                   formation and Communications, Wuhan, China. He
              map for portable deep model,” inProceedings of the 34th International                   received his Ph.D. in the School of Electrical and
              Conference on Machine Learning, ser. Proceedings of Machine Learning                   Computer Engineering at the Georgia Institute of
              Research, D. Precup and Y. W. Teh, Eds., vol. 70. International                   Technology in 2011. Before that, he received his
              Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp.                   B.S. degree in theAdvanced Classof HUST, and
              3703–3711.                                                            a M.S. degree in the Department of Electronics
          [84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression                   and Information Engineering from HUST, Wuhan,
              of deep convolutional neural networks for fast and low power mobile                   China, in 2006 and 2008, respectively. His current
              applications,”CoRR, vol. abs/1511.06530, 2015.                                   research interest includes big data analytics and
          [85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efﬁcient  machine learning, security and privacy, and information networks.
              object detection models with knowledge distillation,” inAdvances in
              Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg,
              S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
              Eds., 2017, pp. 742–751.
          [86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,                   Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob-
              “Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE                   tained his B.S., M.S., and Ph.D. degrees from Ts-
              Conference on Computer Vision and Pattern Recognition (CVPR), June                   inghua University, Beijing, China, in 1993, 1995,
              2018.                                                                and 1999, respectively, and another Ph.D. degree
          [87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,                   from Saga University, Saga, Japan, in 2002, all in
              Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy                   control engineering. He is currently a Professor with
              trade-offs for modern convolutional object detectors,” in2017 IEEE                   the Department of Automation, Tsinghua University.
              Conference on Computer Vision and Pattern Recognition, CVPR 2017,                   He serves the Associate Dean, School of Information
              Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296–3297.                          Science and Technology and Head of the Department
          [88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence                   of Automation. His current research interests include
              modeling for video event detection,” in The IEEE Conference on                   artiﬁcial intelligence, robotics, image processing,
              Computer Vision and Pattern Recognition (CVPR), June 2014.        control theory, and control of spacecraft.

<<END>> <<END>> <<END>>


<<START>> <<START>> <<START>>

            Analysis and Design of Echo State Networks

            Mustafa C. Ozturk
            can@cnel.uﬂ.edu

            Dongming Xu
            dmxu@cnel.uﬂ.edu

            Jose C. Principe
            principe@cnel.uﬂ.edu

            Computational NeuroEngineering Laboratory, Department of Electrical and
            Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.


            The design of echo state network (ESN) parameters relies on the selec-
            tion of the maximum eigenvalue of the linearized system around zero
            (spectral radius). However, this procedure does not quantify in a sys-
            tematic manner the performance of the ESN in terms of approximation
            error. This article presents a functional space approximation framework
            to better understand the operation of ESNs and proposes an information-
            theoretic metric, the average entropy of echo states, to assess the richness
            of the ESN dynamics. Furthermore, it provides an interpretation of the
            ESN dynamics rooted in system theory as families of coupled linearized
            systems whose poles move according to the input signal dynamics. With
            this interpretation, a design methodology for functional approximation
            is put forward where ESNs are designed with uniform pole distributions
            covering the frequency spectrum to abide by the richness metric, irre-
            spective of the spectral radius. A single bias parameter at the ESN input,
            adapted with the modeling error, conﬁgures the ESN spectral radius to
            the input-output joint space. Function approximation examples compare
            the proposed design methodology versus the conventional design.


            1 Introduction

            Dynamic computational models require the ability to store and access the
            time history of their inputs and outputs. The most common dynamic neural
            architecture is the time-delay neural network (TDNN) that couples delay
            lines with a nonlinear static architecture where all the parameters (weights)
            are adapted with the backpropagation algorithm. The conventional delay
            line utilizes ideal delay operators, but delay lines with local ﬁrst-order re-
            cursive ﬁlters have been proposed by Werbos (1992) and extensively stud-
            ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
            1993). Chains of ﬁrst-order integrators are interesting because they effec-
            tively decrease the number of delays necessary to create time embeddings


           (Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
           ent type of embedding that is largely unexplored. RNNs are perhaps the
           most biologically plausible of the artiﬁcial neural network (ANN) models
           (Anderson, Silverstein, Ritz, & Jones, 1977; Hopﬁeld, 1984; Elman, 1990),
           but are not well understood theoretically (Siegelmann & Sontag, 1991;
           Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
           RNNs is the difﬁculty to adapt the system weights. Various algorithms,
           such as backpropagation through time (Werbos, 1990) and real-time recur-
           rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
           however, these algorithms suffer from computational complexity, resulting
           in slow training, complex performance surfaces, the possibility of instabil-
           ity, and the decay of gradients through the topology and time (Haykin,
           1998). The problem of decaying gradients has been addressed with spe-
           cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
           native second-order training methods based on extended Kalman ﬁltering
           (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
           Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
           et al., 1998) provide more reliable performance and have enabled practical
           applications in identiﬁcation and control of dynamical systems (Kechri-
           otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
           Kambhampati, & Warwick, 1995).
             Recently,twonewrecurrentnetworktopologieshavebeenproposed:the
           echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
           the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
           2002). ESNs possess a highly interconnected and recurrent topology of
           nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
           and contain information about the history of input and output patterns.
           The outputs of these internal PEs (echo states) are fed to a memoryless but
           adaptive readout network (generally linear) that produces the network out-
           put. The interesting property of ESN is that only the memoryless readout is
           trained, whereas the recurrent topology has ﬁxed connection weights. This
           reduces the complexity of RNN training to simple linear regression while
           preserving a recurrent topology, but obviously places important constraints
           in the overall architecture that have not yet been fully studied. Similar ideas
           have been explored independently by Maass and formalized in the LSM
           architecture. LSMs, although formulated quite generally, are mostly im-
           plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
           whereas ESNs are dynamical ANN models. Both attempt to model biolog-
           ical information processing using similar principles. We focus on the ESN
           formulation in this letter.

             The echo state condition is deﬁned in terms of the spectral radius (the
           largest among the absolute values of the eigenvalues of a matrix, denoted
           by·) of the reservoir’s weight matrix (W<1). This condition states
           that the dynamics of the ESN is uniquely controlled by the input, and the
           effect of the initial states vanishes. The current design of ESN parameters           
           relies on the selection of spectral radius. However, there are many possible
           weight matrices with the same spectral radius, and unfortunately they do
           not all perform at the same level of mean square error (MSE) for functional
           approximation. A similar problem exists in the design of the LSM. LSMs
           have been shown to possess universal approximation given the separation
           property (SP) for the liquid (reservoir in ESNs) and the approximation
           property (AP) for the readout (Maass et al., 2002). SP is quantiﬁed by a
           kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
           (2005) that is based on the rank of a matrix formed by the system states
           corresponding to different input signals. The kernel quality is a measure
           for the complexity and diversity of nonlinear operations carried out by the
           liquid on its input stream in order to boost the classiﬁcation power of a
           subsequent linear decision hyperplane (Maass et al., 2005). A variation of
           SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
           been argued that complex calculations can be best carried out by networks
           on the boundary between ordered and chaotic dynamics.

           In this letter,we are interested in studying the ESN for functional approx-
           imation (ﬁlters that map input function su(·) of time on output function sy(·)
           of time). We see two major shortcomings with the current ESN approach
           that uses echo state condition as a design principle. First, the impact of ﬁxed
           reservoir parameters for function approximation means that the informa-
           tion about the desired response is conveyed only to the output projection.
           This is not optimal, and strategies to select different reservoirs for different
           applications have not been devised. Second, imposing a constraint only on
           the spectral radius is a weak condition to properly set the parameters of
           the reservoir, as experiments show (different randomizations with the same
           spectral radius perform differently for the same problem; see Figure 2).
             This letter aims to address these two problems by proposing a frame-
           work, a metric, and a design principle for ESNs. The framework is a signal
           processing interpretation of basis and projections in functional spaces to
           describe and understand the ESN architecture. According to this interpre-
           tation, the ESN states implement a set of basis functionals (representation
           space) constructed dynamically by the input, while the readout simply
           projects the desired response onto this representation space. The metric
           to describe the richness of the ESN dynamics is an information-theoretic
           quantity, the average state entropy (ASE). Entropy measures the amount of
           information contained in a given random variable (Shannon, 1948). Here,
           the random variable is the instantaneous echo state from which the en-
           tropy for the overall state (vector) is estimated. The probability density
           function (pdf) in a differential geometric framework should be thought of
           as a volume form; that is, in our case, the pdf of the state vector describes
           the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
           established information as a coordinate free metric in the state manifold.
           Therefore, entropy becomes a global descriptor of information that quanti-
           ﬁes the volume of the manifold deﬁned by the random variable. Due to the
           time dependency of the states, the state entropy averaged over time (ASE)
           is an appropriate estimate of the volume of the state manifold.
             The design principle speciﬁes that one should consider independently
           thecorrelationamongthebasisandthespectralradius.In the absence of any
           information about the desired response, the ESN states should be designed
           with the highest ASE, independent of the spectral radius. We interpret the
           ESN dynamics as a combination of time-varying linear systems obtained
           from the linearization of the ESN nonlinear PE in a small, local neighbor-
           hood of the current state. The design principle means that the poles of the
           linearized ESN reservoir should have uniform pole distributions to gener-
           ate echo states with the most diverse pole locations (which correspond to
           the uniformity of time constants). Effectively, this will create the least cor-
           related bases for a given spectral radius, which corresponds to the largest
           volume spanned by the basis set. When the designer has no other informa-
           tion about the desired response to set the basis, this principle distributes
           the system’s degrees of freedom uniformly in space. It approximates for
           ESNs the well-known property of orthogonal basis. The unresolved issue
           that ASE does not quantify is how to set the spectral radius, which depends
           again on the desired mapping. The concept of memory depth as explained
           in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
           issues associated with the spectral radius. The correlation time of the de-
           sired response (as estimated by the ﬁrst zero of the autocorrelation function)
           gives an indication of the type of spectral radius required (long correlation
           time requires high spectral radius). Alternatively, a simple adaptive bias is
           added at the ESN input to control the spectral radius integrating the infor-
           mation from the input-output joint space in the ESN bases. For sigmoidal
           PEs, the bias adjusts the operating points of the reservoir PEs, which has
           the net effect of adjusting the volume of the state manifold as required to
           approximate the desired response with a small error. This letter shows that
           ESNs designed with this strategy obtain systematically better results in a
           set of experiments when compared with the conventional ESN design.


           2 Analysis of Echo State Networks

              2.1 Echo States as Bases and Projections.Let us consider the ar-
           chitecture and recursive update equation of a typical ESN more closely.
           Consider the recurrent discrete-time neural network given in Figure 1
           with M input units, N internal PEs, and L output units. The value of
           the input unit at time n is <<u(n)=[u1 (n),u2 (n),...,uM (n)]^T>> , of internal
           units are <<x(n)=[x1 (n),x2 (n),...,xN (n)]^T>> , and of output units are <<y(n)=
           [y1 (n),y2 (n),...,yL (n)]^T>> . The connection weights are given in anN×M
           weight matrixWin =(win ) for connections between the input and the inter- ij 
           nalPEs,in an N×N matrix W=(wij ) for connections between the internal
           PEs, in an L×N matrix <<W_out =(w_out)>> for connections from PEs to the ij 
          Input Layer Dynamical Reservoir Read-out

                            <<FIGURE>>

           Figure 1: An echo state network (ESN). ESN is composed of two parts: a ﬁxed-
           weight (W<1) recurrent network and a linear readout. The recurrent net-
           work is a reservoir of highly interconnected dynamical components, states of
           which are called echo states. The memoryless linear readout is trained to pro-
           duce the output.


           output units, and in an N× L matrix <<FORMULA>> for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
           activation of the internal PEs (echo state) is updated according to

                             <<FORMULA>>,             (2.1)

           where f=(f1 ,f2 ,...,fN ) are the internal PEs’ activation functions.Here, all
          i ’s are hyperbolic tangent functions ( ex −  ). The output from the readout ex +e−x
           network is computed according to

               <<y(n+1)=f_out (W_out x(n+1))>>,                           (2.2)

           where <<f_out =(f_out ,f_out ,...,f_out )>> are the output unit’s nonlinear functions <<FORMULA>> (Jaeger, 2001, 2002a). 
           Generally, the readout is linear so f_out is identity.
             ESNs resemble the RNN architecture proposed in Puskorius and
           Feldkamp (1996) and also used by Sanchez (2004) in brain-machine
           interfaces. The critical difference is the dimensionality of the hidden re-
           current PE layer and the adaptation of the recurrent weights. We submit
           that the ideas of approximation theory in functional spaces (bases and pro-
           jections), so useful in adaptive signal processing (Principe, 2001), should
           be utilized to understand the ESN architecture. Let h(u(t)) be a real-valued
           function of a real-valued vector

              <<u(t)=[u1 (t),u2 (t),...,uM (t)] T>>.

           In functional approximation, the goal is to estimate the behavior ofh(u(t))
           as a combination of simpler functions ϕi (t), called the basis functionals,
           such that its approximant,hˆ(u(t)), is given by

                   <<FORMULA>>.

           Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of
           the central questions in practical functional approximation is how to choose
           the set of bases to approximate a given desired signal. In signal processing,
           thechoicenormallygoesforacompletesetoforthogonalbasis,independent
           of the input. When the basis set is complete and can be made as large
           as required, ﬁxed bases work wonders (e.g., Fourier decompositions). In
           neural computing, the basic idea is to derive the set of bases from the
           input signal through a multilayered architecture. For instance, consider a
           single hidden layer TDNN with NPEs and a linear output. The hidden-
           layer PE outputs can be considered a set of nonorthogonal basis functionals
           dependent on the input,

                    <<FORMULA>>

           bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi-
           mation produced by the TDNN is then

                    <<FORMULA>>,                                (2.3)

           whereai ’s are the weights of the output layer. Notice that thebij ’s adapt
           the bases and theai ’s adapt the projection in the projection space. Here the
           goal is to restrict the number of bases (number of hidden layer PEs) because
           their number is coupled with the number of parameters to adapt, which
           has an impact on generalization and training set size, for example. Usually,
           since all of the parameters of the network are adapted, the best basis in the
           joint (input and desired signals) space as well as the best projection can be
           achieved and represents the optimal solution. The output of the TDNN is
           a linear combination of its internal representations, but to achieve a basis
           set (even if nonorthogonal), linear independence among theϕi (u(t))’s must
           be enforced. Ito, Shah and Pon, and others have shown that this is indeed
           the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
           the scope of this article.

             The ESN (and the RNN) architecture can also be studied in this frame-
           work. The states of equation 2.1 correspond to the basis set, which are
           recursively computed from the input, output, and previous states through
           Win ,W,andWback . Notice, however, that none of these weight matrices is
           adapted, that is, the functional bases in the ESN are uniquely deﬁned by the
           input and the initial selection of weights. In a sense, ESNs are trading the
           adaptive connections in the RNN hidden layer by a brute force approach
           of creating ﬁxed diversiﬁed dynamics in the hidden layer.
             For an ESN with a linear readout network, the output equation (y(n+
           1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and
           ai ’s are replaced by the echo states and the readout weights, respectively.
           The readout weights are adapted in the training data, which means that the
           ESN is able to ﬁnd the optimal projection in the projection space, just like
           the RNN or the TDNN.

             A similar perspective of basis and projections for information processing
           in biological networks has been proposed by Pouget and Sejnowski (1997).
           They explored the possibility that the response of neurons in parietal cortex
           serves as basis functions for the transformations from the sensory input
           to the motor responses. They proposed that “the role of spatial represen-
           tations is to code the sensory inputs and posture signals in a format that
           simpliﬁes subsequent computation, particularly in the generation of motor
           commands”.

             The central issue in ESN design is exactly the nonadaptive nature of
           the basis set. Parameter sets in the reservoir that provide linearly inde-
           pendent states and possess a given spectral radius may deﬁne drastically
           different projection spaces because the correlation among the bases is not
           constrained. A simple experiment was designed to demonstrate that the se-
           lection of the ESN parameters by constraining the spectral radius is not the
           most suitable for function approximation. Consider a 100-unit ESN where
           the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
           the ESN generate the seventh power of the input signal. Different realiza-
           tions of a randomly connected 100-unit ESN were constructed where the
           entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025,
           and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
           weights are set to+1or,−1 with equal probabilities, andWback is set to
           zero. Input is applied for 300 time steps, and the echo states are calculated
           using equation 2.1. The next step is to train the linear readout. One method

                                      <<FIGURE>>

           Figure 2: Performances of ESNs for different realizations ofWwith the same
           weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba-
           bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
           of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results
           show that for each set of random weights that provide the same spectral ra-
           dius, the correlation or degree of redundancy among the bases will change, and
           different performances are encountered in practice.


           to determine the optimal output weight matrix,Wout , in the mean square
           error (MSE) sense (where MSE is deﬁned by <<FORMULA>>) is to use 2 the Wiener solution given by Haykin (2001):

                                        <<FORMULA>>

           Here,E[.] denotes the expected value operator, andddenotes the desired
           signal. Figure 2 depicts the MSE values for 50 different realizations of
           the ESNs. As observed, even though each ESN has the same sparseness
           and spectral radius, the MSE values obtained vary greatly among differ-
           ent realizations. The minimum MSE value obtained among the 50 realiza-
           tions is 5.9x10 −9 , whereas the maximum MSE is 8.9x10 −5 . This experiment    
           demonstrates that a design strategy that is based solely on the spectral
           radius is not sufﬁcient to specify the system architecture for function ap-
           proximation. This shows that for each set of random weights that provide
           thesamespectralradius,thecorrelationordegreeofredundancyamongthe
           bases will change, and different performances are encountered in practice.

             2.2 ESN Dynamics as a Combination of Linear Systems.
             
           It is well known that the dynamics of a nonlinear system can be approximated by
           that of a linear system in a small neighborhood of an equilibrium point
           (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis
           with hyperbolic tangent nonlinearities and approximate the ESN dynam-
           ics by the dynamics of the linearized system in the neighborhood of the
           current system state. Hence, when the system operating point varies over
           time, the linear system approximating the ESN dynamics changes. We are
           particularly interested in the movement of the poles of the linearized ESN.
           Consider the update equation for the ESN without output feedback given
           by

               <<x(n+1)=f(Win u(n+1)+Wx(n))>>.

           Linearizing the system around the current statex(n), one obtains the
           Jacobian matrix, <<J(n+1)>>, deﬁned by
                
                              <<FORMULA>>

           Here,net i(n) is the ith entry of the vector <<(W_in u(n+1)+Wx(n))>>, and w_ij
           denotes the (i,j)th entry of W. The poles of the linearized system at time
           n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the
           amplitude of each PE changes, the local slope changes, and so the poles of
           A. The transfer function of a linear system <<x(n+1)=Ax(n)+Bu(n)>> is <<X(z) =(zI−U(z)A)−1>> 
           Adjoint <<(zI−A)>>. The poles of the transfer function can be obtained by solving <<det(zI−A)=0>>.
           The solution corresponds to the eigenvalues of A.     


           the linearized system are time varying, although the parameters of ESN are
           ﬁxed. In order to visualize the movement of the poles, consider an ESN with
           100 states. The entries of the internal weight matrix are chosen to be 0,
           0.4 and −0.4 with probabilities 0.9, 0.05, and 0.05.W is scaled such that a
           spectral radius of 0.95 is obtained. Input weights are set to +1 or −1 with
           equal probabilities. A sinusoidal signal with a period of 100 is fed to the
           system, and the echo states are computed according to equation 2.1. Then
           the Jacobian matrix and the eigenvalues are calculated using equation 2.5.
           Figure 3 shows the pole tracks of the linearized ESN for different input
           values. A single ESN with ﬁxed parameters implements a combination of
           many linear systems with varying pole locations, hence many different
           time constants that modulate the richness of the reservoir of dynamics as a
           function of input amplitude. Higher-amplitude portions of the signal tend
           to saturate the nonlinear function and cause the poles to shrink toward
           the origin of thez-plane (decreases the spectral radius), which results in a
           system with a large stability margin. When the input is close to zero, the
           poles of the linearized ESN are close to the maximal spectral radius chosen,
           decreasing the stability margin. When compared to their linear counterpart,
           an ESN with the same number of states results in a detailed coverage of
           thez-plane dynamics, which illustrates the power of nonlinear systems.
           Similar results can be obtained using signals of different shapes at the ESN
           input.
             A key corollary of the above analysis is that the spectral radius of an
           ESN can be adjusted using a constant bias signal at the ESN input without
           changing the recurrent connection matrix,W. The application of a nonzero
           constant bias will move the operating point to regions of the sigmoid func-
           tion closer to saturation and always decrease the spectral radius due to the
           shape of the nonlinearity. 2 The relevance of bias in terms of overall system
           performance has also been discussed in Jaeger (2002b) and Bertschinger
           and Natschlager (2004), but here we approach it from a system theory per-¨
           spective and explain its effect on reservoir dynamics.

           3 Average State Entropy as a Measure of the Richness of ESN Reservoir

           Previous research was aware of the inﬂuence of diversity of the recurrent
           layer outputs on the overall performance of ESNs and LSMs. Several met-
           rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al.,


             2 Assume W has nondegenerate eigenvalues and corresponding linearly independent
           eigenvectors. Then consider the eigendecomposition of W, where <<FORMULA>>,Pis the
           eigenvectormatrixandDisthediagonalmatrixofeigenvalues <<FORMULA>> of W.SinceF(n)andD
           are diagonal, <<FORMULA>> is the eigendecomposition
           of <<J(n+1)>>. Here, each entry of <<FORMULA>>, is an eigenvalue of J. Therefore,
           <<FORMULA>> since <<FORMULA>>.

                              <<FIGURE>>

           Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input
           goes through a cycle. An ESN with ﬁxed parameters implements a combination
           of linear systems with varying pole locations. (A) One cycle of sinusoidal signal
           with a period of 100. (B–E) The positions of poles of the linearized systems
           when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative
           pole locations show the movement of the poles as the input changes. Due to
           the varying pole locations, different time constants modulate the richness of
           the reservoir of dynamics as a function of input amplitude. Higher-amplitude
           signals tend to saturate the nonlinear function and cause the poles to shrink
           toward the origin of thez-plane (decreases the spectral radius), which results in
           a system with a large stability margin. When the input is close to zero, the poles
           ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing
           the stability margin. An ESN with more states results in a detailed coverage of
           thez-plane dynamics, which illustrates the power of nonlinear systems, when
           compared to their linear counterpart.

           Here, our approach of bases and projections leads to a new metric.
           We propose the instantaneous state entropy to quantify the distribution of
           instantaneous amplitudes across the ESN states. Entropy of the instanta-
           neous ESN states is appropriate to quantify performance in function ap-
           proximation because the ESN output is a mere weighted combination of
           the instantaneous value of the ESN states. If the echo state’s instantaneous
           amplitudes are concentrated on only a few values across the ESN state dy-
           namic range, the ability to approximate an arbitrary desired response by
           weighting the states is limited (and wasteful due to redundancy between
           the different states), and performance will suffer. On the other hand, if the
           ESN states provide a diversity of instantaneous amplitudes, it is much eas-
           ier to achieve the desired mapping. Hence, the instantaneous entropy of the
           states appears as a good measure to quantify the richness of dynamics with
           instantaneous mappers. Due to the time structure of signals, the average
           state entropy (ASE), deﬁned as the state entropy averaged over time, will be
           the parameter used to quantify the diversity in the dynamical reservoir of
           the ESN. Moreover, entropy has been proposed as an appropriate measure
           of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE
           measures the volume of the echo state manifold spanned by trajectories.
             Renyi’squadraticentropyisemployedherebecauseitisaglobalmeasure
           of information. In addition, an efﬁcient nonparametric estimator of Renyi’s
           entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe,
           Xu, & Fisher, 2000). Renyi’s entropy with parameterγfor a random variable
           X with a <<FORMULA>> is given by Renyi (1970):


                        <<FORMULA>>


           Renyi’s quadratic entropy is obtained forγ=2 (forγ→1, Shannon’s en-
           tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un-
           known pdf to be estimated, Parzen windowing approximates the underly-
           ing pdf by

                        <<FORMULA>>

           whereKσ is the kernel function with the kernel sizeσ. Then the Renyi’s
           quadratic entropy can be estimated by (Principe et al., 2000)

                          <<FORMULA>>


             The instantaneous state entropy is estimated using equation 3.1 where
           thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T
           of an ESN withNinternal PEs. Results will be shown with a gaussian kernel
           with kernel size chosen to be 0.3 of the standard deviation of the entries
           of the state vector. We will show that ASE is a more sensitive parameter to
           quantify the approximation properties of ESNs by experimentally demon-
           strating that ESNs with different spectral radius and even with the same
           spectral radius display different ASEs.

             Let us consider the same 100-unit ESN that we used in the previous
           section built with three different spectral radii 0.2, 0.5, 0.8 with an input
           signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks.
           The instantaneous state entropy is also calculated at each time step using
           equation 3.1 and plotted in Figure 4B. First, note that the instantaneous
           state entropy changes over time with the distribution of the echo states as
           we would expect, since state entropy is dependent on the input signal that
           also changes in this case. Second, as the spectral radius increases in the
           simulation, the diversity in the echo states increases. For the spectral radius
           of 0.2, echo state’s instantaneous amplitudes are concentrated on only a
           few values, which is wasteful due to redundancy between different states.
           In practice, to quantify the overall representation ability over time, we will
           use ASE, which takes values−0.735,−0.007, and 0.335 for the spectral
           radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral
           radius, several ASEs are possible. Figure 4C shows ASEs from 50 different
           realizations of ESNs with the same spectral radius of 0.5, which means that
           ASE is a ﬁner descriptor of the dynamics of the reservoir. Although we
           have presented an experiment with sinusoidal signal, similar results are
           obtained for other inputs as long as the input dynamic range is properly
           selected.

             Maximizing ASE means that the diversity of the states over time is the
           largest and should provide a basis set that is as uncorrelated as possible.
           This condition is unfortunately not a guarantee that the ESN so designed
           will perform the best, because the basis set in ESNs is created independent
           of the desired response and the application may require a small spectral
           radius. However, we maintain that when the desired response is not ac-
           cessible for the design of the ESN bases or when the same reservoir is
           to be used for a number of problems, the default strategy should be to
           maximize the ASE of the state vector. The following section addresses
           the design of ESNs with high ASE values and a simple mechanism to
           adjust the reservoir dynamics without changing the recurrent connection
           weights.

           4 Designing Echo State Networks

             4.1 Design of the Echo State Recurrent Connections.According to the
           interpretation of ESNs as coupled linear systems, the design of the internal
           connection matrix, W, will be based on the distribution of the poles of the
           linearized system around zero state. Our proposal is to design the ESN
           such that the linearized system has uniform pole distribution inside the
           unit circle of thez-plane. With this design scenario, the system dynamics
           will include uniform coverage of time constants arising from the uniform
           distribution of the poles, which also decorrelates as much as possible the
           basis functionals. This principle was chosen by analogy to the identiﬁcation
           oflinearsystemsusingKautzﬁlters(Kautz,1954),whichshowsthatthebest
           approximation of a given transfer function by a linear system with ﬁnite
           order is achieved when poles are placed in the neighborhood of the spectral
           resonances. When no information is available about the desired response,
           we should uniformly spread the poles to anticipate good approximation to
           arbitrary mappings.

             We again use a maximum entropy principle to distribute the poles inside
           the unit circle uniformly. The constraints of a circle as boundary conditions
           for discrete linear systems and complex conjugate locations are easy to
           include for the pole distribution (Thogula, 2003). The poles are ﬁrst initial-
           ized at random locations; the quadratic Renyi’s entropy is calculated by
           equation 3.1, and poles are moved such that the entropy of the new dis-
           tribution is increased over iterations (Erdogmus & Principe, 2002). This
           method is efﬁcient to ﬁnd uniform coverage of the unit circle with an arbi-
           trary number of poles. The system with the uniform pole locations can be
           interpreted using linear system theory. The poles that are close to the unit
           circle correspond to many sharp bandpass ﬁlters specializing in different
           frequency regions, whereas the inner poles realize ﬁlters of larger frequency
           support. Moreover, different orientations (angles) of the poles create ﬁlters
           of different center frequencies.

             Now the problem is to construct an internal weight matrix from the pole
           locations (eigenvalues ofW). In principle, we would like to create a sparse

                                    <<FIGURE>>

           Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs
           ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8,
           from top to bottom, respectively. The diversity of echo states increases when the
           spectral radius increases. Within the dynamic range of the echo states, systems
           with smaller spectral radius can generate only uneven representations, while
           forW=0.8, outputs of echo states almost uniformly distribute within their
           dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1.
           Information contained in the echo states is changing over time according to the
           input amplitude. Therefore, the richness of representation is controlled by the
           input amplitude. Moreover, the value of ASE increases with spectral radius.
           (C) ASEs from 50 different realizations of ESNs with the same spectral radius
           of 0.5. The plot shows that ASE is a ﬁner descriptor of the dynamics of the
           reservoir than the spectral radius. 

           matrix, so we started with the sparsest matrix (with an inverse), which is
           the direct canonical structure given by (Kailath, 1980)

                  <<FORMULA>>

           The characteristic polynomial of W_i's

                  <<FORMULA>>,                        (4.2)

           wherepi ’s are the eigenvalues andai ’s are the coefﬁcients of the character-
           istic polynomial ofW. Here, we know the pole locations of the linear system
           obtained from the linearization of the ESN, so using equation 4.2, we can
           obtain the characteristic polynomial and constructWmatrix in the canon-
           ical form using equation 4.1. We will call the ESN constructed based on
           the uniform pole principle ASE-ESN. All other possible solutions with the
           same eigenvalues can be obtained byQ−1 WQ,whereQis any nonsingular
           matrix.

             To corroborate our hypothesis, we would like to show that the linearized
           ESN designed with the recurrent weight matrix having the eigenvalues
           uniformly distributed inside the unit circle creates higher ASE values for a
           given spectral radius compared to other ESNs with random internal con-
           nection weight matrices. We will consider an ESN with 30 states and use our
           procedure to create theWmatrix for ASE-ESN for different spectral radii
           between <<[0.1, 0.95]>>. Similarly, we constructed ESNs with sparse randomW
           matrices with different sparseness constraints. This corresponds to a weight
           distribution having the values 0, c and −c with probabilities <<p_1>> ,<<(1−p_1)/2>>,
           and <<(1−p_1)/2>>, wherep1 deﬁnes the sparseness ofWandcis a constant
           that takes a speciﬁc value depending on the spectral radius. We also created
           Wmatrices with values uniformly distributed between−1 and 1 (U-ESN)
           and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then,
           for differentWin matrices, we run the ASE-ESNs with the sinusoidal input
           given in section 3 and calculate ASE. Figure 5 compares the ASE values
           averaged over 1000 realizations. As observed from the ﬁgure, the ASE-ESN
           with uniform pole distribution generates higher ASE on average for all
           spectral radii compared to ESNs with sparse and uniform random connec-
           tions. This approach is indeed conceptually similar to Jeffreys’ maximum
           entropy prior (Jeffreys, 1946): it will provide a consistently good response
           for the largest class of problems. Concentrating the poles of the linearized


                                    <<FIGURE>>

           Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith
           uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN
           with uniformly distributed weights between−1 and 1. Randomly generated
           weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the
           networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole
           distribution generates a higher ASE on average for all spectral radii compared
           to ESNs with random connections.


           system in certain regions of the space provides good performance only if
           the desired response has energy in this part of the space, as is well known
           from the theory of Kautz ﬁlters (Kautz, 1954).

             4.2 Design of the Adaptive Bias.
             
           In conventional ESNs, only the output weights are trained, optimizing the 
           projections of the desired response onto the basis functions (echo states). 
           Since the dynamical reservoir is ﬁxed,
           the basis functions are only input dependent. However, since function ap-
           proximation is a problem in the joint space of the input and desired signals,
           a penalty in performance will be incurred. From the linearization analysis
           that shows the crucial importance of the operating point of the PE non-
           linearity in deﬁning the echo state dynamics, we propose to use a single
           external adaptive bias to adjust the effective spectral radius of an ESN. No-
           tice that according to linearization analysis, bias can reduce only spectral
           radius. The information for adaptation of bias is the MSE in training, which
           modulates the spectral radius of the system with the information derived
           from the approximation error. With this simple mechanism, some informa-
           tionfromtheinput-outputjointspaceisincorporatedinthedeﬁnitionofthe
           projection space of the ESN. The beauty of this method is that the spectral
           radius can be adjusted by a single parameter that is external to the system
           without changing reservoir weights.

             The training of bias can be easily accomplished. Indeed, since the pa-
           rameter space is only one-dimensional, a simple line search method can be
           efﬁciently employed to optimize the bias. Among different line search al-
           gorithms, we will use a search that uses Fibonacci numbers in the selection
           of points to be evaluated (Wilde, 1964). The Fibonacci search method min-
           imizes the maximum number of evaluations needed to reduce the interval
           of uncertainty to within the prescribed length. In our problem, a bias value
           is picked according to Fibonacci search. For each value of bias, training
           data are applied to the ESN, and the echo states are calculated. Then the
           corresponding optimal output weights and the objective function (MSE)
           are evaluated to pick the next bias value.
             Alternatively, gradient-based methods can be utilized to optimize the
           bias, due to simplicity and low computational cost. System update equation
           with an external bias signal,b,isgivenby

               <<x(n+1)=f(W_in u(n+1)+Win b+Wx(n))>>.

           The update equation forbis given by

                <<FORMULA>>

             Here,Ois the MSE deﬁned previously. This algorithm may suffer from
           similar problems observed in gradient-based methods in recurrent net-
           works training. However, we observed that the performance surface is
           rather simple. Moreover, since the search parameter is one-dimensional,
           the gradient vector can assume only one of the two directions. Hence, im-
           precision in the gradient estimation should affect the speed of convergence
           but normally not change the correct gradient direction.

           5 Experiments

           This section presents a variety of experiments in order to test the validity
           of the ESN design scheme proposed in the previous section.

             5.1 Short-Term Memory Capacity.

             This experiment compares the shortterm memory (STM) capacity of ESNs 
             with the same spectral radius using
           the framework presented in Jaeger (2002a). Consider an ESN with a sin-
           gle input signal, <<u(n)>>, optimally trained with the desired signal <<u(n−k)>>,
           for a given delayk. Denoting the optimal output signalyk (n), thek-delay     
           STM capacity of a network,MC k , is deﬁned as a squared correlation coef-
           ﬁcient betweenu <<(n−k)>> and <<FORMULA>> (Jaeger, 2002a). The STM capacity, MC,
           of the network is deﬁned as  <<FORMULA>>. STM capacity measures how accu-
           rately the delayed versions of the input signal are recovered with optimally
           trained output units. Jaeger (2002a) has shown that the memory capacity
           for recalling an independent and identically distributed (i.i.d.) input by an
           Nunit RNN with linear output units is bounded by N.
             We use ESNs with 20 PEs and a single input unit. ESNs are driven
           by an i.i.d. random input signal,<<u(n)>>, that is uniformly distributed over
           [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions
           of the input, <<u(n−1),...,u(n−40)>>. We used four different ESNs: R-ESN,
           U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN
           used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47,
           −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a
           sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof
           U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spec-
           tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed
           with uniform poles. BASE-ESN has the same recurrent weight matrix as
           ASE-ESN and an adaptive bias at its input. In each ESN, the input weights
           are set to 0.1 or−0.1 with equal probability, and direct connections from the
           input to the output are allowed, whereasWback is set to 0 (Jaeger, 2002a).
           The echo states are calculated using equation 2.1 for 200 samples of the
           input signal, and the ﬁrst 100 samples corresponding to initial transient
           are eliminated. Then the output weight matrix is calculated using equation
           2.4. For the BASE-ESN, the bias is trained for each task. All networks are
           run with a test input signal, and the corresponding output andMC k are
           calculated. Figure 6 shows thek-delay STM capacity (averaged over 100
           trials) of each ESN for delays 1,...,40 for the test signal. The STM capac-
           ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70,
           and 16.90, respectively. First, ESNs with uniform pole distribution (ASE-
           ESN and BASE-ESN) haveMCs that are much longer than the randomly
           generated ESN given in Jaeger (2002a) in spite of all having the same spec-
           tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical
           maximumvalueofN=20.AcloserlookattheﬁgureshowsthatR-ESNper-
           forms slightly better than ASE-ESN for delays less than 9. In fact, for small
           k, large ASE degrades the performance because the tasks do not need long
           memory depth. However, the drawback of high ASE for smallkis recov-
           ered in BASE-ESN, which reduces the ASE to the appropriate level required
           for the task. Overall, the addition of the bias to the ASE-ESN increases the
           STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly
           better STM compared to R-ESN with only three different weight values,
           although it has more distinct weight values compared to R-ESN. It is also
           signiﬁcant to note that theMCwill be very poor for an ESN with smaller
           spectral radius even with an adaptive bias, since the problem requires large
           ASE and bias can only reduce ASE. This experiment demonstrates the

                                       <<FIGURE>>

           Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed
           using the test signal. The results are averaged over 100 different realizations of
           each ESN type with the speciﬁcations given in the text for differentWandWin
           matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are
           13.09, 13.55, 16.70, and 16.90, respectively.


           suitability of maximizing ASE in tasks that require a substantial memory
           length.

             5.2 Binary Parity Check.
             
             The effect of the adaptive bias was marginal
           in the previous experiment since the nature of the problem required large
           ASE values. However, there are tasks in which the optimal solutions re-
           quire smaller ASE values and smaller spectral radius. Those are the tasks
           where the adaptive bias becomes a crucial design parameter in our design
           methodology.
             Consider an ESN with 100 internal units and a single input unit. ESN is
           drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal
           is to train an ESN to generate them-bit parity corresponding to lastmbits
           received, wheremis 3,...,8. Similar to the previous experiments, we used
           the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly
           connected ESN where the entries ofWmatrix are set to 0, 0.06,−0.06
           with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse
           connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN
           are designed with a spectral radius of 0.9. The input weights are set to 1 or -1
           with equal probability, and direct connections from the input to the output
           are allowed whereasWback is set to 0. The echo states are calculated using
           equation 2.1 for 1000 samples of the input signal, and the ﬁrst 100 samples
           corresponding to the initial transient are eliminated.Then the output weight        

                                         <<FIGURE>>

           Figure 7: The number of wrong decisions made by each ESN form=3,...,8
           in the binary parity check problem. The results are averaged over 100 differ-
           ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin
           matrices with the speciﬁcations given in the text. The total numbers of wrong
           decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and
           699. 

           matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias
           is trained for each task. The binary decision is made by a threshold detector
           that compares the output of the ESN to 0.5. Figure 7 shows the number of
           wrong decisions (averaged over 100 different realizations) made by each
           ESN for <<m=3,...,8>>.
             The total numbers of wrong decisions for <<m=3,...,8>> of R-ESN, ASE-
           ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs
           poorly since the nature of the problem requires a short time constant for
           fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the
           R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions.
           BASE-ESN performs a lot better than ASE-ESN and slightly better than
           the R-ESN since the adaptive bias reduces the spectral radius effectively.
           Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN,
           since the task requires access to longer input history, which compromises
           the need for fast response. Indeed, the bias in the BASE-ESN takes effect
           when there are errors (m>4) and when the task beneﬁts from smaller
           spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and
           2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide
           range of bias values that result in similar MSE values (between 0 and 3). In 
           summary, this experiment clearly demonstrates the power of the bias signal
           to conﬁgure the ESN reservoir according to the mapping task.

             5.3 System Identiﬁcation.
             This section presents a function approxima-
           tion task where the aim is to identify a nonlinear dynamical system. The
           unknown system is deﬁned by the difference equation

               <<y(n+1)=0.3y(n)+0.6y(n−1)+f(u(n))>>,

           where

                <<f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu)>>.

           The input to the system is chosen to be <<sin(2πn/25)>>.
             We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with
           30 internal units and a single input unit. TheWmatrix of each ESN is scaled
           suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN
           where the entries ofWmatrix are set to 0, 0.35,−0.35 with probabilities 0.8,
           0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or−1 with
           equal probability, and direct connections from the input to the output are
           allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated
           using equation 2.4. The MSE values (averaged over 100 realizations) for R-
           ESN and ASE-ESN are 1.23x10 −5 and 1.83x10 −6 , respectively. The addition
           of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10^−6
           to 3.27x10^−9 .

           6 Discussion

           The great appeal of echo state networks (ESNs) and liquid state machine
           (LSM) is their ability to construct arbitrary mappings of signals with rich
           and time-varying temporal structures without requiring adaptation of the
           free parameters of the recurrent layer. The echo state condition allows the
           recurrent connections to be ﬁxed with training limited to the linear output
           layer. However, the literature did not elucidate on how to properly choose
           the recurrent parameters for system identiﬁcation applications. Here, we
           provide an alternate framework that interprets the echo states as a set
           of functional bases formed by ﬁxed nonlinear combinations of the input.
           The linear readout at the output stage simply computes the projection of
           the desired output space onto this representation space. We further in-
           troduce an information-theoretic criterion, ASE, to better understand and
           evaluate the capability of a given ESN to construct such a representation
           layer. The average entropy of the distribution of the echo states quantiﬁes
           thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest
           to achieve the smallest correlation among the bases and be able to cope with     
           arbitrary mappings. However, not all function approximation problems re-
           quire the same memory depth, which is coupled to the spectral radius. The
           effective spectral radius of an ESN can be optimized for the given problem
           with the help of an external bias signal that is adapted using the joint input-
           output space information. The interesting property of this method when
           applied to ESN built from sigmoidal nonlinearities is that it allows the ﬁne
           tuning of the system dynamics for a given problem with a single external
           adaptive bias input and without changing internal system parameters. In
           our opinion, the combination of the largest possible ASE and the adapta-
           tion of the spectral radius by the bias produces the most parsimonious pole
           location of the linearized ESN when no knowledge about the mapping is
           available to optimally locate the bass functionals. Moreover, the bias can be
           easily trained with either a line search method or a gradient-based method
           since it is one-dimensional. We have illustrated experimentally that the de-
           sign of the ESN using the maximization of ASE with the adaptation of the
           spectral radius by the bias has provided consistently better performance
           across tasks that require different memory depths. This means that these
           two parameters’ design methodology is preferred to the spectral radius
           criterion proposed by Jaeger, and it is still easily incorporated in the ESN
           design.

             Experiments demonstrate that the ASE for ESN with uniform linearized
           poles is maximized when the spectral radius of the recurrent weight matrix
           approaches one (instability). It is interesting to relate this observation with
           the computational properties found in dynamical systems “at the edge of
           chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchﬁeld, 1993;
           Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨
           tomata rules are evolved to perform a complex computation, evolution will
           tend to select rules with “critical” parameter values, which correlate with
           a phase transition between ordered and chaotic regimes. Recently, similar
           conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨
           Langton’s interpretation of edge of chaos was questioned by Mitchell et al.
           (1993). Here, we provide a system-theoretic view and explain the computa-
           tional behavior with the diversity of dynamics achieved with linearizations
           that have poles close to the unit circle. According to our results, the spectral
           radiusoftheoptimalESNinfunctionapproximationisproblemdependent,
           and in general it is impossible to forecast the computational performance
           as the system approaches instability (the spectral radius of the recurrent
           weight matrix approaches one). However, allowing the system to modu-
           late the spectral radius by either the output or internal biasing may allow
           a system close to instability to solve various problems requiring different
           spectral radii.

             Our emphasis here is mostly on ESNs without output feedback connec-
           tions. However, the proposed design methodology can also be applied to
           ESNs with output feedback. Both feedforward and feedback connections
           contribute to specify the bases to create the projection space. At the same
           time, there are applications where the output feedback contributes to the
           system dynamics in a different fashion. For example, it has been shown that
           a ﬁxed weight (fully trained) RNN with output feedback can implement a
           family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992).
           In meta-learning, the role of output feedback in the network is to bias the
           system to different regions of dynamics, providing multiple input-output
           mappings required (Santiago & Lendaris, 2004). However, results could not
           be replicated with ESNs (Prokhorov, 2005). We believe that more work has
           to be done on output feedback in the context of ESNs but also suspect that
           the echo state condition may be a restriction on the system dynamics for
           this type of problem.

             There are many interesting issues to be researched in this exciting new
           area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s
           representation layer in an unsupervised fashion. In fact, we can easily adapt
           withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild,
           and Principe (2003): extra weights linking the outputs of recurrent states to
           maximize output entropy. Output entropy maximization is a well-known
           metric to create independent components (Bell & Sejnowski, 1995), and
           here it means that the echo states will become as independent as possible.
           This would circumvent the linearization of the dynamical system to set the
           recurrent weights and would ﬁne-tune continuously in an unsupervised
           manner the parameters of the ESN among different inputs. However, it
           goes against the idea of a ﬁxed ESN reservoir.

             The reservoir of recurrent PEs can be thought of as a new form of a time-
           to-space mapping. Unlike the delay line that forms an embedding (Takens,
           1981), this mapping may have the advantage of ﬁltering noise and produce
           representations with better SNRs to the peaks of the input, which is very
           appealing for signal processing and seems to be used in biology. However,
           further theoretical work is necessary in order to understand the embedding
           capabilities of ESNs. One of the disadvantages of the ESN correlated basis
           is in the design of the readout. Gradient-based algorithms will be very
           slow to converge (due to the large eigenvalue spread of modes), and even
           if recursive methods are used, their stability may be compromised by the
           condition number of the matrix. However, our recent results incorporating
           anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of
           solving this problem.

             Finally we would like to brieﬂy comment on the implications of these
           models to neurobiology and computational neuroscience. The work by
           Pouget and Sejnowski (1997) has shown that the available physiological
           data are consistent with the hypothesis that the response of a single neuron
           in the parietal cortex serves as a basis function generated by the sensory
           input in a nonlinear fashion. In other words, the neurons transform the
           sensory input into a format (representation space) such that the subsequent
           computation is simpliﬁed. Then, whenever a motor command (output of
           the biological system) needs to be generated, this simple computation to
           read out the neuronal activity is done. There is an intriguing similarity
           betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski
           and our interpretation of echo states in ESN. We believe that similar ideas
           can be applied to improve the design of microcircuit implementations of
           LSMs. First, the framework of functional space interpretation (bases and
           projections) is also applicable to microcircuits. Second, the ASE measure
           may be directly utilized for LSM states because the states are normally low-
           pass-ﬁltered before the readout. However, the control of ASE by changing
           the liquid dynamics is unclear. Perhaps global control of thresholds or bias
           current will be able to accomplish bias control as in ESN with sigmoid
           PEs.


           Acknowledgments

           This work was partially supported by NSFECS-0422718, NSFCNS-0540304,
           and ONR N00014-1-1-0405.


           References

           Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer.
           Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor-
             ical perception, and probability learning: Some applications of a neural model.
             Psychological Review, 84, 413–451.
           Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach
             to blind separation and blind deconvolution.Neural Computation, 7(6), 1129–
             1159.
           Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨
             in recurrent neural networks.Neural Computation, 16(7), 1413–1436.
           Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal
             of Physics, 14(1), 1–13.
           de Vries, B. (1991).Temporal processing with neural networks—the development of the
             gamma model. Unpublished doctoral dissertation, University of Florida.
           Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural
             network for system identiﬁcation and control.IEEE Proceedings of Control Theory
             and Applications, 142(4), 307–314.
           Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179–211.
           Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation:
             Stochastic information gradient.Signal Processing Letters, 10(8), 242–245.
           Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for
             adaptive system training.IEEE Transactions on Neural Networks, 13(5), 1035–1044.
           Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream
             Kalman ﬁlter training for recurrent networks. In J. Suykens, & J. Vandewalle
             (Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 29–53). Dordrecht,
             Netherlands: Kluwer.           136 M. Ozturk, D. Xu, and J. Pr´ıncipe


           Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle
             River, NJ. Prentice Hall.
           Haykin, S. (2001).Adaptive ﬁlter theory(4th ed.). Upper Saddle River, NJ: Prentice
             Hall.
           Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa-
             tion, 9(8), 1735–1780.
           Hopﬁeld, J. (1984). Neurons with graded response have collective computational
             properties like those of two-state neurons.Proceedings of the National Academy of
             Sciences, 81, 3088–3092.
           Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math-
             ematics, 5(1), 189–203.
           Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural
             networks(Tech. Rep. No. 148). Bremen: German National Research Center for
             Information Technology.
           Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152).
             Bremen: German National Research Center for Information Technology.
           Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL,
             EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German
             National Research Center for Information Technology.
           Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
             and saving energy in wireless communication.Science, 304(5667), 78–80.
           Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems.
             Proceedings of the Royal Society of London, A 196, 453–461.
           Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall.
           Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit
             Theory, 1(3), 29–39.
           Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks
             for adaptive communication channel equalization.IEEE Transactions on Neural
             Networks, 5(2), 267–278.
           Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks.
             IEEE Transactions on Neural Networks, 6(5), 1000–1004.
           Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation
             theory(2nd ed.). New York: Springer-Verlag.
           Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 12–37.
           Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the
             computational power and generalization capability of neural microcircuits. In
             L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing
             systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press.
           Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨
             stable states: A new framework for neural computation based on perturbations.
             Neural Computation, 14(11), 2531–2560.
           Mitchell, M., Hraber, P., & Crutchﬁeld, J. (1993). Revisiting the edge of chaos:
             Evolving cellular automata to perform computations.Complex Systems, 7, 89–
             130.
           Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J.
             Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293–
             301). Singapore: World Scientiﬁc.           Analysis and Design of Echo State Networks 137


           Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex
             using basis functions.Journal of Cognitive Neuroscience, 9(2), 222–237.
           Principe, J. (2001). Dynamic neural networks and optimal signal processing. In
             Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6–
             28). Boca Raton, FL: CRC Press.
           Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma ﬁlter—a new
             class of adaptive IIR ﬁlters with restricted feedback.IEEE Transactions on Signal
             Processing, 41(2), 649–656.
           Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin
             (Ed.),Unsupervised adaptive ﬁltering(pp. 265–319). Hoboken, NJ: Wiley.
           Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter-
             national Joint Conference on Neural Networks(pp. 1463–1466). Montreal, Canada.
           Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with ﬁxed
             weights in recurrent neural networks: An overview. InProc. of International Joint
             Conference on Neural Networks(pp. 2018–2022). Honolulu, Hawaii.
           Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys-
             tems with Kalman ﬁlter trained recurrent networks.IEEE Transactions on Neural
             Networks, 5(2), 279–297.
           Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap-
             plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 1407–1420.
           Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev,
             M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with
             echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and
             Signal Processing. Philadelphia.
           Renyi, A. (1970).Probability theory. New York: Elsevier.
           Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis.
             Unpublished doctoral dissertation, University of Florida.
           Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net-
             works: Reformulating ﬁxed weight neural networks. InProc. of International Joint
             Conference on Neural Networks(pp. 189–194). Budapest, Hungary.
           Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in
             multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 10–18.
           Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical
             Journal, 27, 623–656.
           Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc-
             toral dissertation, Rutgers University.
           Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied
             Mathematics Letters, 4(6), 77–80.
           Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended
             Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process-
             ing systems, 1(pp. 133–140). San Mateo, CA: Morgan Kaufmann.
           Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S.
             Young (Eds.),Dynamical systems and turbulence(pp. 366–381). Berlin: Springer.
           Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub-
             lished master’s thesis, University of Florida.
           Werbos, P. (1990). Backpropagation through time: What it does and how to do it.
             Proceedings of IEEE, 78(10), 1550–1560.           138 M. Ozturk, D. Xu, and J. Pr´ıncipe


           Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua-
             tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 65–89). New
             York: Van Nostrand Reinhold.
           Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall.
           Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running
             fully recurrent neural networks.Neural Computation, 1, 270–280.
<<END>> <<END>> <<END>>


<<START>> <<START>> <<START>>
                         Bayesian Compression for Deep Learning

                        Christos Louizos          Karen Ullrich               Max Welling
                     University of Amsterdam    University of Amsterdam    University of Amsterdam
                     TNO Intelligent Imaging     k.ullrich@uva.nl               CIFAR  
                      c.louizos@uva.nl                                       m.welling@uva.nl


                                             Abstract

                       Compression and computational efﬁciency in deep learning have become a problem
                       of great signiﬁcance. In this work, we argue that the most principled and effective
                       way to attack this problem is by adopting a Bayesian point of view, where through
                       sparsity inducing priors we prune large parts of the network. We introduce two
                       novelties in this paper: 1) we use hierarchical priors to prune nodes instead of
                       individual weights, and 2) we use the posterior uncertainties to determine the
                       optimal ﬁxed point precision to encode the weights. Both factors signiﬁcantly
                       contribute to achieving the state of the art in terms of compression rates, while
                       still staying competitive with methods designed to optimize for speed or energy
                       efﬁciency.


                 1 Introduction

                 While deep neural networks have become extremely successful in in a wide range of applications,
                 often exceeding human performance, they remain difﬁcult to apply in many real world scenarios. For
                 instance, making billions of predictions per day comes with substantial energy costs given the energy
                 consumption of common Graphical Processing Units (GPUs). Also, real-time predictions are often
                 about a factor100away in terms of speed from what deep NNs can deliver, and sending NNs with
                 millions of parameters through band limited channels is still impractical. As a result, running them on
                 hardware limited devices such as smart phones, robots or cars requires substantial improvements on
                 all of these issues. For all those reasons, compression and efﬁciency have become a topic of interest
                 in the deep learning community.
                 While all of these issues are certainly related, compression and performance optimizing procedures
                 might not always be aligned. As an illustration, consider the convolutional layers of Alexnet, which
                 account for only 4% of the parameters but 91% of the computation [68]. Compressing these layers
                 will not contribute much to the overall memory footprint.
                 There is a variety of approaches to address these problem settings. However, most methods have
                 the common strategy of reducing both the neural network structure and the effective ﬁxed point
                 precision for each weight. A justiﬁcation for the former is the ﬁnding that NNs suffer from signiﬁcant
                 parameter redundancy [14]. Methods in this line of thought are network pruning, where unnecessary
                 connections are being removed [40,24,21], or student-teacher learning where a large network is
                 used to train a signiﬁcantly smaller network [5, 27].
                 From a Bayesian perspective network pruning and reducing bit precision for the weights is aligned
                 with achieving high accuracy, because Bayesian methods search for the optimal model structure
                 (which leads to pruning with sparsity inducing priors), and reward uncertain posteriors over parameters
                 through the bits back argument [28] (which leads to removing insigniﬁcant bits). This relation is
                 made explicit in the MDL principle [20] which is known to be related to Bayesian inference.

                 In this paper we will use the variational Bayesian approximation for Bayesian inference which has
                 also been explicitly interpreted in terms of model compression [28]. By employing sparsity inducing
                 priors for hidden units (and not individual weights) we can prune neurons including all their ingoing
                 and outgoing weights. This avoids more complicated and inefﬁcient coding schemes needed for
                 pruning or vector quantizing individual weights. As an additional Bayesian bonus we can use the
                 variational posterior uncertainty to assess which bits are signiﬁcant and remove the ones which
                 ﬂuctuate too much under approximate posterior sampling. From this we derive the optimal ﬁxed
                 point precision per layer, which is still practical on chip.

                 2 Variational Bayes and Minimum Description Length

                 A fundamental theorem in information theory is the minimum description length (MDL) principle [20].
                 It relates to compression directly in that it deﬁnes the best hypothesis to be the one that communicates
                 the sum of the model (complexity costLC ) and the data misﬁt (error costLE ) with the minimum
                 number of bits [59,60]. It is well understood that variational inference can be reinterpreted from an
                 MDL point of view [56,72,28,30,19]. More speciﬁcally, assume that we are presented with a dataset QD that consists from N input-output pairs <<FORMULA>>. Let <<FORMULA>>
                 be a parametric model, e.g. a deep neural network, that maps inputs x to their corresponding outputs
                 y using parameters w governed by a prior distribution <<p(w)>>. In this scenario, we wish to approximate
                 the intractable posterior distribution <<p(w|D) =p(D|w)p(w)=p(D)>> with a ﬁxed form approximate
                 posterior <<q (w)>> by optimizing the variational parameters   according to:

                             <<FORMULA>> 

                 where <<H( )>> denotes the entropy and <<L( )>> is known as the evidence-lower-bound (ELBO) or negative
                 variational free energy. As indicated in eq.1, <<L( )>> naturally decomposes into a minimum cost for
                 communicating the targets <<FORMULA>> under the assumption that the sender and receiver agreed on a n=1 prior <<p(w)>> and that the receiver knows the inputs <<FORMULA>> and form of the parametric model. n=1
                 By using sparsity inducing priors for groups of weights that feed into a neuron the Bayesian mecha-
                 nism will start pruning hidden units that are not strictly necessary for prediction and thus achieving
                 compression. But there is also a second mechanism by which Bayes can help us compress. By
                 explicitly entertaining noisy weight encodings through <<q (w)>> we can beneﬁt from the bits-back
                 argument [28,30] due to the entropy term; this is in contrast to inﬁnitely precise weights that lead to
                 <<FORMULA>>. Nevertheless in practice, the data misﬁt termLE is intractable for neural network
                 models under a noisy weight encoding, so as a solution Monte Carlo integration is usually employed.
                 Continuous q (w) allow for the reparametrization trick [36,58]. Here, we replace sampling from
                 q (w) by a deterministic function of the variational parameters   and random samples from some
                 noise variables :

                            <<FORMULA>>;        (2)

                 where <<w=f( ; )>>. By applying this trick, we obtain unbiased stochastic gradients of the ELBO
                 with respect to the variational parameters , thus resulting in a standard optimization problem that is
                 ﬁt for stochastic gradient ascent. The efﬁciency of the gradient estimator resulting from eq. 2 can be
                 further improved for neural networks by utilizing local reparametrizations [37] (which we will use in
                 our experiments); they provide variance reduction in an efﬁcient way by locally marginalizing the
                 weights at each layer and instead sampling the distribution of the pre-activations.

                 3 Related Work

                 One of the earliest ideas and most direct approaches to tackle efﬁciency is pruning. Originally
                 introduced by [40], pruning has recently been demonstrated to be applicable to modern architectures
                 [25,21]. It had been demonstrated that an overwhelming amount of up to 99,5% of parameters
                 can be pruned in common architectures. There have been quite a few encouraging results obtained
                 by (empirical) Bayesian approaches that employ weight pruning [19,7,52,70,51]. Nevertheless,

                    2 In practice this term is a large constant determined by the weight precision.

                 weight pruning is in general inefﬁcient for compression since the matrix format of the weights is not
                 taken into consideration, therefore the Compressed Sparse Column (CSC) format has to be employed.
                 Moreover, note that in conventional CNNs most ﬂops are used by the convolution operation. Inspired
                 by this observation, several authors proposed pruning schemes that take these considerations into
                 account [73, 74] or even go as far as efﬁciency aware architectures to begin with [32, 15, 31]. From
                 the Bayesian viewpoint, similar pruning schemes have been explored at [47, 53, 39, 34].
                 Given optimal architecture, NNs can further be compressed by quantization. More precisely, there
                 are two common techniques. First, the set of accessible weights can be reduced drastically. As an
                 extreme example, [13,48,57,76] and [11] trained NN to use only binary or tertiary weights with
                 ﬂoating point gradients. This approach however is in need of signiﬁcantly more parameters than
                 their ordinary counterparts. Work by [18] explores various techniques beyond binary quantization:
                 k-means quantization, product quantization and residual quantization. Later studies extent this set to
                 optimal ﬁxed point [44] and hashing quantization [10]. [25] apply k-means clustering and consequent
                 center training. From a practical point of view, however, all these are fairly unpractical during
                 test time. For the computation of each feature map in a net, the original weight matrix must be
                 reconstructed from the indexes in the matrix and a codebook that contains all the original weights.
                 This is an expensive operation and this is why some studies propose a different approach than set
                 quantization. Precision quantization simply reduces the bit size per weight. This has a great advantage
                 over set quantization at inference time since feature maps can simply be computed with less precision
                 weights. Several studies show that this has little to no effect on network accuracy when using 16bit
                 weights [49,22,12,71,9]. Somewhat orthogonal to the above discussion but certainly relevant are
                 approaches that customize the implementation of CNNs for hardware limited devices[31, 4, 62].


                 4 Bayesian compression with scale mixtures of normals


                 Consider the following prior over a parameter w where its scale z is governed by a distribution <<p(z)>>:


                                       <<FORMULA>>;                    (3)


                 with z2 serving as the variance of the zero-mean normal distribution over w. By treating the scales
                 of w as random variables we can recover marginal prior distributions over the parameters that have
                 heavier tails and more mass at zero; this subsequently biases the posterior distribution over w to
                 be sparse. This family of distributions is known as scale-mixtures of normals [6,2] and it is quite
                 general, as a lot of well known sparsity inducing distributions are special cases.
                 One example of the aforementioned framework is the spike-and-slab distribution [50], the golden
                 standard for sparse Bayesian inference. Under the spike-and-slab, the mixing density of the scales is a
                 Bernoulli distribution, thus the marginal <<p(w)>> has a delta “spike” at zero and a continuous “slab” over
                 the real line. Unfortunately, this prior leads to a computationally expensive inference since we have
                 to explore a space of2M models, whereMis the number of the model parameters. Dropout [29,67],
                 one of the most popular regularization techniques for neural networks, can be interpreted as positing a
                 spike and slab distribution over the weights where the variance of the “slab” is zero [17,45]. Another
                 example is the Laplace distribution which arises by considering <<FORMULA>>. The mode of
                 the posterior distribution under a Laplace prior is known as the Lasso [69] estimator and has been
                 previously used for sparsifying neural networks at [73,61]. While computationally simple, the
                 Lasso estimator is prone to “shrinking" large signals [8] and only provides point estimates about
                 the parameters. As a result it does not provide uncertainty estimates, it can potentially overﬁt and,
                 according to the bits-back argument, is inefﬁcient for compression.
                 For these reasons, in this paper we will tackle the problem of compression and efﬁciency in neural
                 networks by adopting a Bayesian treatment and inferring an approximate posterior distribution over
                 the parameters under a scale mixture prior. We will consider two choices for the prior over the scales
                 p(z); the hyperparameter free log-uniform prior [16,37] and the half-Cauchy prior, which results into
                 a horseshoe [8] distribution. Both of these distributions correspond to a continuous relaxation of the
                 spike-and-slab prior and we provide a brief discussion on their shrinkage properties at Appendix C.

                 4.1 Reparametrizing variational dropout for group sparsity

                 One potential choice for p(z) is the improper log-uniform prior [37] <<FORMULA>>. It turns out that
                 we can recover the log-uniform prior over the weightswif we marginalize over the scales z: 
                 
                                              <<FORMULA>>                (4)
                 
                 This alternative parametrization of the log uniform prior is known in the statistics literature as the
                 normal-Jeffreys prior and has been introduced by [16]. This formulation allows to “couple" the
                 scales of weights that belong to the same group (e.g. neuron or feature map), by simply sharing the
                 corresponding scale variablezin the joint prior 3 :

                                              <<FORMULA>>;                  (5)
   
                 where W is the weight matrix of a fully connected neural network layer with A being the dimen-
                 sionality of the input and B the dimensionality of the output. Now consider performing variational
                 inference with a joint approximate posterior parametrized as follows:

                                             <<FORMULA>>;                  (6) 
                                       
                 where  _i is the dropout rate [67,37,51] of the given group. As explained at [37,51], the multiplicative
                 parametrization of the approximate posterior over z suffers from high variance gradients; therefore
                 we will follow [51] and re-parametrize it in terms of <<FORMULA>>, hence optimize w.r.t. _2 . 
                 The <<FORMULA>> lower bound under this prior and approximate posterior becomes:

                                              <<FORMULA>>                    (7)

                 Under this particular variational posterior parametrization the negative KL-divergence from the
                 conditional prior <<p(W|z)>> to the approximate posterior <<q (W|z)>> is independent of z:

                                                                        <<FORMULA>>       (8)

                 This independence can be better understood if we consider a non-centered parametrization of the
                 prior [55]. More speciﬁcally, consider reparametrizing the weights asw~ij =wij ; this will then result zi
                 into <<p(W|z)p(z) =p(W~)p(z)>>, where <<FORMULA>>. Now if <<FORMULA>> and <<W= diag(z)>>
                 we perform variational inference under the p(W~)p(z)prior with an approximate posterior that has Q the form of <<FORMULA>>, with <<FORMULA>>, then we see that we ij arrive at the same expressions for the negative KL-divergence from the prior to the approximate
                 posterior. Finally, the negative KL-divergence from the normal-Jeffreys scale prior p(z) to the
                 Gaussian variational posterior q depends only on the “implied” dropout rate, <<FORMULA>>, and zi z takes the following form [51]:       

                                               <<FORMULA>>;                  (9)
                                          
                 where <<FORMULA>> are the sigmoid and softplus functions respectively 4 and k1 = 0:63576,k2 =
                 1:87320,k3 = 1:48695. We can now prune entire groups of parameters by simply specifying a thresh-
                 old for the variational dropout rate of the corresponding group, e.g.<<FORMULA>>. It should be mentioned that this prior parametrization readily allows for a more ﬂexible marginal pos-
                 terior over the weights as we now have a compound distribution, <<FORMULA>>; this
                 is in contrast to the original parametrization and the Gaussian approximations employed by [37,51].
                 Strictly speaking the result of eq. 4 only holds when each weight has its own scale and not when that scale is
                 shared across multiple weights. Nevertheless, in practice we obtain a prior that behaves in a similar way, i.e. it
                 biases the variational posterior to be sparse.

                                                <<FORMULA>>

                 Furthermore, this approach generalizes the low variance additive parametrization of variational
                 dropout proposed for weight sparsity at [51] to group sparsity (which was left as an open question
                 at [51]) in a principled way.
                 At test time, in order to have a single feedforward pass we replace the distribution overWat each
                 layer with a single weight matrix, the masked variational posterior mean:

                                                 <<FORMULA>>;                         (10)

                 where m is a binary mask determined according to the group variational dropout rate andMW are
                 the means ofq  (W~). We further use the variational posterior marginal variances 5 for this particular
                 posterior approximation:              
                 
                                                <<FORMULA>>;                           (11)

                 to assess the bit precision of each weight in the weight matrix. More speciﬁcally, we employed the
                 mean variance across the weight matrixW^ to compute the unit round off necessary to represent the
                 weights. This method will give us the amount signiﬁcant bits, and by adding 3 exponent and 1 sign
                 bits we arrive at the ﬁnal bit precision for the entire weight matrixW^6 . We provide more details at
                 Appendix B.

                 4.2 Group horseshoe with half-Cauchy scale priors

                 Another choice for p(z) is a proper half-Cauchy distribution: <<FORMULA>>; it
                 induces a horseshoe prior [8] distribution over the weights, which is a well known sparsity inducing
                 prior in the statistics literature. More formally, the prior hierarchy over the weights is expressed as
                 (in a non-centered parametrization):

                                                  <<FORMULA>>;                           (12)

                 where 0 is the free parameter that can be tuned for speciﬁc desiderata. The idea behind the horseshoe
                 is that of the “global-local" shrinkage; the global scale variablespulls all of the variables towards
                 zero whereas the heavy tailed local variableszi can compensate and allow for some weights to escape.
                 Instead of directly working with the half-Cauchy priors we will employ a decomposition of the
                 half-Cauchy that relies upon (inverse) gamma distributions [54] as this will allow us to compute
                 the negative KL-divergence from the scale priorp(z)to an approximate log-normal scale posterior
                 q  (z)in closed form (the derivation is given in Appendix D). More speciﬁcally, we have that the
                 half-Cauchy prior can be expressed in a non-centered parametrization as:

                                                    <<FORMULA>>;                       (13)

                 where <<IG( ; );G( ; )>> correspond to the inverse Gamma and Gamma distributions in the scale
                 parametrization, and z follows a half-Cauchy distribution with scale k. Therefore we will re-express
                 the whole hierarchy as:

                                                  <<FORMULA>>;                           (14)

                 It should be mentioned that the improper log-uniform prior is the limiting case of the horseshoe prior
                 when the shapes of the (inverse) Gamma hyperpriors on <<FORMULA>> go to zero [8]. In fact, several well
                 known shrinkage priors can be expressed in this form by altering the shapes of the (inverse) Gamma
                 hyperpriors [3]. For the variational posterior we will employ the following mean ﬁeld approximation:

                                                <<FORMULA>>.

                Notice that the fact that we are using mean-ﬁeld variational approximations (which we chose for simplicity)
                 can potentially underestimate the variance, thus lead to higher bit precisions for the weights. We leave the
                 exploration of more involved posteriors for future work.

                  Where <<LN( ; )>> is a log-normal distribution. It should be mentioned that a similar form of non-
                 centered variational inference for the horseshoe has been also successfully employed for undirected
                 models at [q     33]. Notice that we can also apply local reparametrizations [37] when we are sampling
                   <<FORMULA>>
                    i  i and sa sb by exploiting properties of the log-normal distribution 7 and thus forming the
                 implied:

                                                    <<FORMULA>>                           (17)
                    
                 As a threshold rule for group pruning we will use the negative log-mode 8 of the local log-normal r.v.
                 <<FORMULA>> , i.e. prune when <<FORMULA>>, with <<FORMULA>>. This ignores <<FORMULA>> and <<FORMULA>>, but nonetheless we found <<FORMULA>> dependencies among the zi elements induced by the common scale
                 that it works well in practice. Similarly with the group normal-Jeffreys prior, we will replace the
                 distribution overWat each layer with the masked variational posterior mean during test time:

                                                       <<FORMULA>>;                        (19)

                 wheremis a binary mask determined according to the aforementioned threshold,MW are the means
                 ofq(W~)and ; 2 are the means and variances of the local log-normals over <<FORMULA>>. Furthermore,
                 similarly to the group normal-Jeffreys approach, we will use the variational posterior marginal
                 variances:
                                                      <<FORMULA>>;                           (20)

                 to compute the ﬁnal bit precision for the entire weight matrix W.

                 5 Experiments

                 We validated the compression and speed-up capabilities of our models on the well-known architectures
                 of LeNet-300-100 [41], LeNet-5-Caffe 9 on MNIST [42] and, similarly with [51], VGG [63]10 on
                 CIFAR 10 [38]. The groups of parameters were constructed by coupling the scale variables for each
                 ﬁlter for the convolutional layers and for each input neuron for the fully connected layers. We provide
                 the algorithms that describe the forward pass using local reparametrizations for fully connected
                 and convolutional layers with each of the employed approximate posteriors at appendix F. For the
                 horseshoe prior we set the scale  0 of the global half-Cauchy prior to a reasonably small value, e.g.
                  0 = 1e 5. This further increases the prior mass at zero, which is essential for sparse estimation
                 and compression. We also found that constraining the standard deviations as described at [46] and
                 “warm-up" [65] helps in avoiding bad local optima of the variational objective. Further details about
                 the experimental setup can be found at Appendix A. Determining the threshold for pruning can be
                 easily done with manual inspection as usually there are two well separated clusters (signal and noise).
                 We provide a sample visualization at Appendix E.

                 5.1 Architecture learning & bit precisions

                 We will ﬁrst demonstrate the group sparsity capabilities of our methods by illustrating the learned
                 architectures at Table 1, along with the inferred bit precision per layer. As we can observe, our
                 methods infer signiﬁcantly smaller architectures for the LeNet-300-100 and LeNet-5-Caffe, compared
                 to Sparse Variational Dropout, Generalized Dropout and Group Lasso. Interestingly, we observe
                 that for the VGG network almost all of big 512 feature map layers are drastically reduced to around
                 10 feature maps whereas the initial layers are mostly kept intact. Furthermore, all of the Bayesian
                 methods considered require far fewer than the standard 32 bits per-layer to represent the weights,
                 sometimes even allowing for 5 bit precisions.

                    The product of log-normal r.v.s is another log-normal and a power of a log-normal r.v. is another log-normal.
                    Empirically, it slightly better separates the scales compared to the negative log-mean <<FORMULA>>. 
                    https://github.com/BVLC/caffe/tree/master/examples/mnist
                    The adapted CIFAR 10 version described athttp://torch.ch/blog/2015/07/30/cifar.html.

                 Table 1: Learned architectures with Sparse VD [51], Generalized Dropout (GD) [66] and Group
                 Lasso (GL) [73]. Bayesian Compression (BC) with group normal-Jeffreys (BC-GNJ) and group
                 horseshoe (BC-GHS) priors correspond to the proposed models. We show the amount of neurons left
                 after pruning along with the average bit precisions for the weights at each layer.

                                        <<TABLE>>

                 5.2 Compression Rates

                 For the actual compression task we compare our method to current work in three different scenarios:
                 (i) compression achieved only by pruning, here, for non-group methods we use the CSC format
                 to store parameters; (ii) compression based on the former but with reduced bit precision per layer
                 (only for the weights); and (iii) the maximum compression rate as proposed by [25]. We believe

                 Table 2: Compression results for our methods. “DC” corresponds to Deep Compression method
                 introduced at [25], “DNS” to the method of [21] and “SWS” to the Soft-Weight Sharing of [70].
                 Numbers marked with * are best case guesses.

                            <<TABLE>>

                 these to be relevant scenarios because (i) can be applied with already existing frameworks such as
                 Tensorﬂow [1], (ii) is a practical scheme given upcoming GPUs and frameworks will be designed to
                 work with low and mixed precision arithmetics [43,23]. For (iii), we perform k-means clustering on
                 the weights with k=32 and consequently store a weight index that points to a codebook of available
                 weights. Note that the latter achieves highest compression rate but it is however fairly unpractical at
                 test time since the original matrix needs to be restored for each layer. As we can observe at Table 2,
                 our methods are competitive with the state-of-the art for LeNet-300-100 while offering signiﬁcantly
                 better compression rates on the LeNet-5-Caffe architecture, without any loss in accuracy. Do note
                 that group sparsity and weight sparsity can be combined so as to further prune some weights when a
                 particular group is not removed, thus we can potentially further boost compression performance at
                 e.g. LeNet-300-100. For the VGG network we observe that training from a random initialization
                 yielded consistently less accuracy (around 1%-2% less) compared to initializing the means of the
                 approximate posterior from a pretrained network, similarly with [51], thus we only report the latter
                 results 11 . After initialization we trained the VGG network regularly for 200 epochs using Adam with
                 the default hyperparameters. We observe a small drop in accuracy for the ﬁnal models when using
                 the deterministic version of the network for prediction, but nevertheless averaging across multiple
                 samples restores the original accuracy. Note, that in general we can maintain the original accuracy on
                 VGG without sampling by simply ﬁnetuning with a small learning rate, as done at [51]. This will
                 still induce (less) sparsity but unfortunately it does not lead to good compression as the bit precision
                 remains very high due to not appropriately increasing the marginal variances of the weights.

                 5.3 Speed and energy consumption

                 We demonstrate that our method is competitive with [73], denoted as GL, a method that explicitly
                 prunes convolutional kernels to reduce compute time. We measure the time and energy consumption
                 of one forward pass of a mini-batch with batch size 8192 through LeNet-5-Caffe. We average over10 4
                 forward passes and all experiments were run with Tensorﬂow 1.0.1, cuda 8.0 and respective cuDNN.
                 We apply 16 CPUs run in parallel (CPU) or a Titan X (GPU). Note that we only use the pruned
                 architecture as lower bit precision would further increase the speed-up but is not implementable in
                 any common framework. Further, all methods we compare to in the latter experiments would barely
                 show an improvement at all since they do not learn to prune groups but only parameters. In ﬁgure 1
                 we present our results. As to be expected the largest effect on the speed up is caused by GPU usage.
                 However, both our models and best competing models reach a speed up factor of around 8x. We
                 can further save about 3x energy costs by applying our architecture instead of the original one on a
                 GPU. For larger networks the speed-up is even higher: for the VGG experiments with batch size 256
                 we have a speed-up factor of 51x.

                                                <<FIGURE>>

                 Figure 1:Left:Avg. Time a batch of 8192 samples takes to pass through LeNet-5-Caffe. Numbers on
                 top of the bars represent speed-up factor relative to the CPU implementation of the original network.
                 Right:Energy consumption of the GPU of the same process (when run on GPU).

                 6 Conclusion

                 We introduced Bayesian compression, a way to tackle efﬁciency and compression in deep neural
                 networks in a uniﬁed and principled way. Our proposed methods allow for theoretically principled
                 compression of neural networks, improved energy efﬁciency with reduced computation while naturally
                 learning the bit precisions for each weight. This serves as a strong argument in favor of Bayesian
                 methods for neural networks, when we are concerned with compression and speed up.

                   11 We also tried to ﬁnetune the same network with Sparse VD, but unfortunately it increased the error
                 considerably (around 3% extra error), therefore we do not report those results.

                                                  8                   Acknowledgments
                   We would like to thank Dmitry Molchanov, Dmitry Vetrov, Klamer Schutte and Dennis Koelma for
                   valuable discussions and feedback. This research was supported by TNO, NWO and Google.


                   References
                    [1]M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
                       M. Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems.arXiv
                       preprint arXiv:1603.04467, 2016.
                    [2]D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions.Journal of the Royal Statistical
                       Society. Series B (Methodological), pages 99–102, 1974.
                    [3]A. Armagan, M. Clyde, and D. B. Dunson. Generalized beta mixtures of gaussians. InAdvances in neural
                       information processing systems, pages 523–531, 2011.
                    [4]E. Azarkhish, D. Rossi, I. Loi, and L. Benini. Neurostream: Scalable and energy efﬁcient deep learning
                       with smart memory cubes.arXiv preprint arXiv:1701.06420, 2017.
                    [5]J. Ba and R. Caruana. Do deep nets really need to be deep? InAdvances in neural information processing
                       systems, pages 2654–2662, 2014.
                    [6] E. Beale, C. Mallows, et al. Scale mixing of symmetric distributions with zero means.The Annals of
                       Mathematical Statistics, 30(4):1145–1151, 1959.
                    [7]C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks.
                       Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11
                       July 2015, 2015.
                    [8]C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals.Biometrika, 97
                       (2):465–480, 2010.
                    [9]S. Chai, A. Raghavan, D. Zhang, M. Amer, and T. Shields. Low precision neural networks using subband
                       decomposition.arXiv preprint arXiv:1703.08595, 2017.
                   [10]W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural
                       networks.arXiv preprint arXiv:1506.04449, 2015.
                   [11]M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations
                       constrained to+1or 1.arXiv preprint arXiv:1602.02830, 2016.
                   [12]M. Courbariaux, J.-P. David, and Y. Bengio. Training deep neural networks with low precision multiplica-
                       tions.arXiv preprint arXiv:1412.7024, 2014.
                   [13]M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary
                       weights during propagations. InAdvances in Neural Information Processing Systems, pages 3105–3113,
                       2015.
                   [14]M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. InAdvances in
                       Neural Information Processing Systems, pages 2148–2156, 2013.
                   [15]X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference
                       complexity.arXiv preprint arXiv:1703.08651, 2017.
                   [16]M. A. Figueiredo. Adaptive sparseness using jeffreys’ prior.Advances in neural information processing
                       systems, 1:697–704, 2002.
                   [17]Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep
                       learning.ICML, 2016.
                   [18]Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector
                       quantization.ICLR, 2015.
                   [19]A. Graves. Practical variational inference for neural networks. InAdvances in Neural Information
                       Processing Systems, pages 2348–2356, 2011.
                   [20]P. D. Grünwald.The minimum description length principle. MIT press, 2007.
                   [21]Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efﬁcient dnns. InAdvances In Neural
                       Information Processing Systems, pages 1379–1387, 2016.
                   [22]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical
                       precision.CoRR, abs/1502.02551, 392, 2015.
                   [23]P. Gysel. Ristretto: Hardware-oriented approximation of convolutional neural networks.Master’s thesis,
                       University of California, 2016.
                   [24]S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efﬁcient neural networks.
                       InAdvances in Neural Information Processing Systems, pages 1135–1143, 2015.
                   [25]S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning,
                       trained quantization and huffman coding.ICLR, 2016.
                   [26]K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on
                       imagenet classiﬁcation. InProceedings of the IEEE International Conference on Computer Vision, pages
                       1026–1034, 2015.
                   [27]G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint
                       arXiv:1503.02531, 2015.
                   [28]G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length
                       of the weights. InProceedings of the sixth annual conference on Computational learning theory, pages
                       5–13. ACM, 1993.
                   [29]G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural
                       networks by preventing co-adaptation of feature detectors.arXiv preprint arXiv:1207.0580, 2012.
                   [30]A. Honkela and H. Valpola. Variational learning and bits-back coding: an information-theoretic view to
                       bayesian learning.IEEE Transactions on Neural Networks, 15(4):800–810, 2004.
                   [31]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.
                       Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. arXiv preprint
                       arXiv:1704.04861, 2017.
                   [32]F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level
                       accuracy with 50x fewer parameters and< 0.5 mb model size.ICLR, 2017.
                   [33]J. B. Ingraham and D. S. Marks. Bayesian sparsity for intractable distributions. arXiv preprint
                       arXiv:1602.03807, 2016.
                   [34]T. Karaletsos and G. Rätsch. Automatic relevance determination for deep generative models.arXiv preprint
                       arXiv:1505.07765, 2015.
                   [35]D. Kingma and J. Ba. Adam: A method for stochastic optimization.International Conference on Learning
                       Representations (ICLR), San Diego, 2015.
                   [36]D. P. Kingma and M. Welling. Auto-encoding variational bayes.International Conference on Learning
                       Representations (ICLR), 2014.
                   [37]D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparametrization trick.
                       Advances in Neural Information Processing Systems, 2015.
                   [38]A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.
                   [39]N. D. Lawrence. Note relevance determination. InNeural Nets WIRN Vietri-01, pages 128–133. Springer,
                       2002.
                   [40]Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. InNIPs,
                       volume 2, pages 598–605, 1989.
                   [41]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
                       Proceedings of the IEEE, 86(11):2278–2324, 1998.
                   [42]Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998.
                   [43]D. D. Lin and S. S. Talathi. Overcoming challenges in ﬁxed point training of deep convolutional networks.
                       Workshop ICML, 2016.
                   [44]D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks.
                       arXiv preprint arXiv:1511.06393, 2015.
                   [45]C. Louizos. Smart regularization of deep architectures.Master’s thesis, University of Amsterdam, 2015.
                    [46]C. Louizos and M. Welling. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.
                       ArXiv e-prints, Mar. 2017.
                   [47]D. J. MacKay. Probable networks and plausible predictions—a review of practical bayesian methods for
                       supervised neural networks.Network: Computation in Neural Systems, 6(3):469–505, 1995.
                   [48]N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with
                       ﬁne-grained quantization.arXiv preprint arXiv:1705.01462, 2017.
                   [49]P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha. Deep neural networks are robust to
                       weight binarization and other non-linear distortions.arXiv preprint arXiv:1606.01981, 2016.
                   [50]T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the
                       American Statistical Association, 83(404):1023–1032, 1988.
                   [51]D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsiﬁes deep neural networks.arXiv
                       preprint arXiv:1701.05369, 2017.
                   [52]E. Nalisnick, A. Anandkumar, and P. Smyth. A scale mixture perspective of multiplicative noise in neural
                       networks.arXiv preprint arXiv:1506.03208, 2015.
                   [53]R. M. Neal.Bayesian learning for neural networks. PhD thesis, Citeseer, 1995.
                   [54]S. E. Neville, J. T. Ormerod, M. Wand, et al. Mean ﬁeld variational bayes for continuous sparse signal
                       shrinkage: pitfalls and remedies.Electronic Journal of Statistics, 8(1):1113–1151, 2014.
                   [55]O. Papaspiliopoulos, G. O. Roberts, and M. Sköld. A general framework for the parametrization of
                       hierarchical models.Statistical Science, pages 59–73, 2007.
                   [56]C. Peterson. A mean ﬁeld theory learning algorithm for neural networks.Complex systems, 1:995–1019,
                       1987.
                   [57]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classiﬁcation using binary
                       convolutional neural networks. InEuropean Conference on Computer Vision, pages 525–542. Springer,
                       2016.
                   [58]D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in
                       deep generative models. InProceedings of the 31th International Conference on Machine Learning, ICML
                       2014, Beijing, China, 21-26 June 2014, pages 1278–1286, 2014.
                   [59]J. Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978.
                   [60]J. Rissanen. Stochastic complexity and modeling.The annals of statistics, pages 1080–1100, 1986.
                    [61]S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural
                       networks.arXiv preprint arXiv:1607.00485, 2016.
                   [62]S. Shi and X. Chu. Speeding up convolutional neural networks by exploiting the sparsity of rectiﬁer units.
                       arXiv preprint arXiv:1704.07724, 2017.
                   [63]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
                       ICLR, 2015.
                   [64]M. Sites. Ieee standard for ﬂoating-point arithmetic. 2008.
                   [65]C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders.
                       arXiv preprint arXiv:1602.02282, 2016.
                   [66]S. Srinivas and R. V. Babu. Generalized dropout.arXiv preprint arXiv:1611.06791, 2016.
                   [67]N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to
                       prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958,
                       2014.
                   [68]V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer. Efﬁcient processing of deep neural networks: A tutorial and
                       survey.arXiv preprint arXiv:1703.09039, 2017.
                   [69]R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society.
                       Series B (Methodological), pages 267–288, 1996.
                   [70]K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression.ICLR, 2017.
                   [71]G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using low-precision
                       and sparsity.arXiv preprint arXiv:1610.00324, 2016.
                   [72]C. S. Wallace. Classiﬁcation by minimum-message-length inference. InInternational Conference on
                       Computing and Information, pages 72–81. Springer, 1990.
                   [73]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In
                       Advances In Neural Information Processing Systems, pages 2074–2082, 2016.
                   [74]T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efﬁcient convolutional neural networks using
                       energy-aware pruning.CVPR, 2017.
                   [75]S. Zagoruyko and N. Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016.
                   [76]C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization.ICLR, 2017.


                   Appendix

                   A. Detailed experimental setup

                   We implemented our methods in Tensorﬂow [1] and optimized the variational parameters using
                   Adam [35] with the default hyperparameters. The means of the conditional Gaussian <<q (W|z)>>


                     Table 3: Floating point formats Bits per Exponent 
                     
                                    <<TABLE>>

                 were initialized with the scheme proposed at [26], whereas the log of the standard deviations were
                 initialized by sampling from N( 9;1e 4). The parameters of q (z) were initialized such that the
                 overall mean of zise 1 and the overall variance is very low (1e^8); this ensures that all of the
                 groups are active during the initial training iterations.
                 As for the standard deviation constraints; for the LeNet-300-100 architecture we constrained the
                 standard deviation of the ﬁrst layer to be  0:2 whereas for the LeNet-5-Caffe we constrained
                 the standard deviation of the ﬁrst layer to be  0:5. The remaining standard deviations were left
                 unconstrained. For the VGG network we constrained the standard deviations of the 64 and 128
                 feature map layers to be  0:1, the standard deviations of the 256 feature map layers to be 0:2
                 and left the rest of the standard deviations unconstrained. We also found beneﬁcial the incorporation
                 of “warm-up” [65], i.e we annealed the negative KL-divergence from the prior to the approximate
                 posterior with a linear schedule for the ﬁrst 100 epochs. We initialized the means of the approximate
                 posterior by the weights and biases obtained from a VGG network trained with batch normalization
                 and dropout on CIFAR 10. For our method we disabled batch-normalization during training.
                 As for preprocessing the data; for MNIST the only preprocessing we did was to rescale the digits to
                 lie at the [-1,1] range and for CIFAR 10 we used the preprocessed dataset provided by [75].
                 Furthermore, do note that by pruning a given ﬁlter at a particular convolutional layer we can also
                 prune the parameters corresponding to that feature map for the next layer. This similarly holds for
                 fully connected layers; if we drop a given input neuron then the weights corresponding to that node
                 from the previous layer can also be pruned.

                 B. Standards for Floating-Point Arithmetic

                 Floating points values eventually need to be represented in a binary basis in a computer. The most
                 common standard today is the IEEE 754-2008 convention [64]. It deﬁnesx-bit base-2 formats,
                 ofﬁcially referred to as binaryx, withx2 f16;32;64;128g. The formats are also widely known as
                 half, single, double and quadruple precision ﬂoats, respectively and used in almost all programming
                 languages as a standard. The format considers 3 kinds of bits: one sign bit,wexponent bits andp
                 precision bits.

                                            <<FIGURE>>

                             Figure 2: A symbolic representation of the binaryxformat [64].


                 The Sign bit determines the sign of the number to be represented. The exponentEis anw-bit signed
                 integer, e.g. for single precisionw= 8and thusE2[ 127;128]. In practice, exponents range from
                 is smaller since the ﬁrst and the last number are reserved for special numbers. The true signiﬁcand or
                 mantissa includes t bits on the right of the binary point. There is an implicit leading bit with value
                 one. A values is consequently decomposed as follows

                                    <<FORMULA>>                          (21)
                                                
                                    <<FORMULA>>                          (22)

                 In table 3, we summarize common and less common ﬂoating point formats.

                 There is however the possibility to design a self deﬁned format. There are 3 important quantities
                 when choosing the right speciﬁcation: overﬂow, underﬂow and unit round off also known as machine
                 precision. Each one can be computed knowing the number of exponent and signiﬁcant bits. in
                 our work for example we consider a format that uses signiﬁcantly less exponent bits since network
                 parameters usually vary between [-10,10]. We set the unit round off equal to the precision and thus
                 can compute the signiﬁcant bits necessary to represent a speciﬁc weight.
                 Beyond designing a tailored ﬂoating point format for deep learning, recent work also explored the
                 possibility of deep learning with mixed formats [43,23]. For example, imagine the activations having
                 high precision while weights can be low precision.

                 C. Shrinkage properties of the normal-Jeffreys and horseshoe priors

                            <<FIGURE>>

                 Figure 3: Comparison of the behavior of the log-uniform / normal-Jeffreys (NJ) prior and the
                 horseshoe (HS) prior (wheres= 1). Both priors behave similarly at zero but the normal-Jeffreys has
                 an extremely heavy tail (thus making it non-normalizable).

                 In this section we will provide some insights about the behavior of each of the priors we employ by
                 following the excellent analysis of [8]; we can perform a change of variables and express the scale
                 mixture distribution of eq.3 in the main paper in terms of a shrinkage coefﬁcient,

                                                     <<FORMULA>>                  (23) 

                 It is easy to observe that eq. 23 corresponds to a continuous relaxation of the spike-and-slab prior:
                 when << = 0>> we have that <<FORMULA>>, i.e. no shrinkage/regularization forw, when
                 << = 1>> we have that <<FORMULA>>, i.e.wis exactly zero, and when << =1>> we have that <<FORMULA>>. Now by examining the implied prior on the shrinkage coefﬁcient   for both
                 the log-uniform and the horseshoe priors we can better study their behavior. As it is explained at                                                        
                 the half-Cauchy prior onzcorresponds to a beta prior on the shrinkage coefﬁcient, <<FORMULA>>,
                 whereas the normal-Jeffreys / log-uniform prior onzcorresponds <<top( ) =B( ; )>> with <<FORMULA>>.
                 The densities of both of these distributions can be seen at Figure 3b. As we can observe, the log-
                 uniform prior posits a distribution that concentrates almost all of its mass at either  0or  1,
                 essentially either pruning the parameter or keeping it close to the maximum likelihood estimate due
                 <<FORMULA>>. In contrast the horseshoe prior maintains enough probability mass for
                 the in-between values of   and thus can, potentially, offer better regularization and generalization.

                 D. Negative KL-divergences for log-normal approximating posteriors

                 Le <<FORMULA>> be a log-normal approximating posterior. Here we will derive the negative
                 KL-divergences toq(z)from inverse gamma, gamma and half-normal distributions.
                 Letp(z)be an inverse gamma distribution, i.e. <<p(z) =IG( ; )>>. The negative KL-divergence can
                 be expressed as follows:
                     
                              <<FORMULA>>         (24)


                The second term is the entropy of the log-normal distribution which has the following form:

                                    <<FORMULA>>         (25)

                 The ﬁrst term is the negative cross-entropy of the log-normal approximate posterior from the inverse-
                 Gamma prior:
                                  <<FORMULA>>        (26)

                                  <<FORMULA>>        (27)

                 Since the natural logarithm of a log-normal distribution <<FORMULA>> follows a normal distribution
                 <<FORMULA>> we have that <<FORMULA>>. Furthermore we have that <<FORMULA>> then <<FORMULA>>, therefore
                 <<FORMULA>>. Putting everything together we have that: 

                                  <<FORMULA>>         (28) 

                 Therefore the negative KL-divergence is:

                                          <<FORMULA>>                  (29)

                 Now let p(z) be a Gamma prior, i.e. <<p(z) =G( ; )>>. We have that the negative cross-entropy
                 changes to:
                                  <<FORMULA>>        (30)

                                <<FORMULA>>      (31)
                                                            
                                <<FORMULA>>        (32)2

                 Therefore the negative KL-divergence is:

                                          <<FORMULA>>                   (33)

                 Now, by employing the aforementioned we can express the negative KL-divergence from
                <<FORMULA>> to <<FORMULA>> as follows:

                                              <<FORMULA>>

                 with the KL-divergence for the weight distribution <<q  (W~)>> given by eq.8 in the main paper.

                            E. Visualizations

                                        <<FIGURE>>

                 Figure 4: Distribution of the thresholds for the Sparse Variational Dropout 4a, Bayesian Compression
                 with group normal-Jeffreys (BC-GNJ) 4b and group Horseshoe (BC-GHS) 4c priors for the three
                 layer LeNet-300-100 architecture. It is easily observed that there are usually two well separable
                 groups with BC-GNJ and BC-GHS, thus making the choice for the threshold easy. Smaller values
                 indicate signal whereas larger values indicate noise (i.e. useless groups).

                                        <<FIGURE>>

                 Figure 5: Distribution of the bit precisions for the Sparse Variational Dropout 5a, Bayesian Com-
                 pression with group normal-Jeffreys (BC-GNJ) 5b and group Horseshoe (BC-GHS) 5c priors for the
                 three layer LeNet-300-100 architecture. All of the methods usually require far fewer than 32bits for
                 the weights.

                 F. Algorithms for the feedforward pass

                 Algorithms 1, 2, 3, 4 describe the forward pass using local reparametrizations for fully connected and
                 convolutional layers with the approximate posteriors for the Bayesian Compression (BC) with group
                 normal-Jeffreys (BC-GNJ) and group Horseshoe (BC-GHS) priors employed at the experiments. For
                 the fully connected layers we coupled the scales for each input neuron whereas for the convolutional
                 we couple the scales for each output feature map.Mw ; w are the means and variances of each layer,
                 His a minibatch of activations of sizeK. For the ﬁrst layer we have thatH=XwhereXis the
                 minibatch of inputs. For the convolutional layersNf are the number of convolutional ﬁlters, is the
                 convolution operator and we assume the [batch, height, width, feature maps] convention.

                   Algorithm 1 Fully connected BC-GNJ layer h. 
                   
                            <<ALGORITHM>>
                   
                   Algorithm 2Convolutional BC-GNJ layerh.
                
                            <<ALGORITHM>>

                 Algorithm 3 Fully connected BC-GHS layerh. 
                 
                            <<ALGORITHM>>
                 
                 Algorithm 4Convolutional BC-GHS layerh.

                            <<ALGORITHM>>           

<<END>> <<END>> <<END>>


<<START>> <<START>> <<START>>
Channel Pruning for Accelerating Very Deep Neural Networks 
Yihui He*  Xiangyu Zhang  Jian Sun  
Xifian Jiaotong University  Megvii Inc.  Megvii Inc.  
Xifian, 710049, China  Beijing, 100190, China  Beijing, 100190, China  
heyihui@stu.xjtu.edu.cn  zhangxiangyu@megvii.com  sunjian@megvii.com  

Abstract 
In this paper, we introduce a new channel pruning method to accelerate very deep convolutional neural net.works. Given a trained CNN model, we propose an it.erative two-step algorithm to effectively prune each layer, by a LASSO regression based channel selection and least square reconstruction. We further generalize this algorithm to multi-layer and multi-branch cases. Our method re.duces the accumulated error and enhance the compatibility with various architectures. Our pruned VGG-16 achieves the state-of-the-art results by 5. speed-up along with only 0.3% increase of error. More importantly, our method is able to accelerate modern networks like ResNet, exception and suffers only 1.4%, 1.0% accuracy loss under 2. speed.up respectively, which is significant. 
1. Introduction 
Recent CNN acceleration works fall into three categories: optimized implementation (e.g., FFT [47]), quantization (e.g., BinaryNet [8]), and structured simplification that convert a CNN into compact one [22]. This work focuses on the last one. 
Structured simplification mainly involves: tensor factorization [22], sparse connection [17], and channel pruning [48]. Tensor factorization factorizes a convolutional layer into several efficient ones (Fig. 1(c)). However, feature map width (number of channels) could not be reduced, which makes it difficult to decompose 1 . 1 convolutional layer favored by modern networks (e.g., GoogleNet [45], ResNet [18], Xception [7]). This type of method also intro.duces extra computation overhead. Sparse connection deactivates connections between neurons or channels (Fig. 1(b)). Though it is able to achieves high theoretical speed-up ratio, the sparse convolutional layers have an fiirregularfi shape which is not implementation friendly. In contrast, channel pruning directly reduces feature map width, which shrinks 

<<FIGURE>>

Figure 1. Structured simplification methods that accelerate CNNs: 
(a) a network with 3 conv layers. (b) sparse connection deactivates some connections between channels. (c) tensor factorization factorizes a convolutional layer into several pieces. (d) channel pruning reduces number of channels in each layer (focus of this paper). 
a network into thinner one, as shown in Fig. 1(d). It is efficient on both CPU and GPU because no special implementation is required. 
Pruning channels is simple but challenging because re.moving channels in one layer might dramatically change the input of the following layer. Recently, training-based channel pruning works [1, 48] have focused on imposing sparse constrain on weights during training, which could adaptively determine hyper-parameters. However, training from scratch is very costly and results for very deep CNNs on ImageNet have been rarely reported. Inference-time at.tempts [31, 3] have focused on analysis of the importance of individual weight. The reported speed-up ratio is very limited. 
In this paper, we propose a new inference-time approach for channel pruning, utilizing redundancy inter channels. Inspired by tensor factorization improvement by feature maps reconstruction [52], instead of analyzing filter weights [22, 31], we fully exploits redundancy within feature maps. Specifically, given a trained CNN model, pruning each layer is achieved by minimizing reconstruction error on its output feature maps, as showed in Fig. 2. We solve this mini.

<<FIGURE>>

Figure 2. Channel pruning for accelerating a convolutional layer. We aim to reduce the width of feature map B, while minimizing the reconstruction error on feature map C. Our optimization algorithm (Sec. 3.1) performs within the dotted box, which does not involve nonlinearity. This figure illustrates the situation that two channels are pruned for feature map B. Thus corresponding channels of filters W can be removed. Furthermore, even though not directly optimized by our algorithm, the corresponding filters in the previous layer can also be removed (marked by dotted filters). c, n: number of channels for feature maps B and C, kh . kw : kernel size. 
minimization problem by two alternative steps: channels selection and feature map reconstruction. In one step, we figure out the most representative channels, and prune redundant ones, based on LASSO regression. In the other step, we reconstruct the outputs with remaining channels with linear least squares. We alternatively take two steps. Further, we approximate the network layer-by-layer, with accumulated error accounted. We also discuss methodologies to prune multi-branch networks (e.g., ResNet [18], exception [7]). 
For VGG-16, we achieve 4. acceleration, with only 1.0% increase of top-5 error. Combined with tensor factorization, we reach 5. acceleration but merely suffer 0.3% increase of error, which outperforms previous state-of-the.arts. We further speed up ResNet-50 and Xception-50 by 2. with only 1.4%, 1.0% accuracy loss respectively. 

2. Related Work

There has been a significant amount of work on accelerating CNNs. Many of them fall into three categories: optimized implementation [4], quantization [40], and structured simplification [22]. 
Optimized implementation based methods [35, 47, 27, 4] accelerate convolution, with special convolution algorithms like FFT [47]. Quantization [8, 40] reduces floating point computational complexity. 
Sparse connection eliminates connections between neurons [17, 32, 29, 15, 14]. [51] prunes connections based on weights magnitude. [16] could accelerate fully connected layers up to 50.. However, in practice, the actual speed-up maybe very related to implementation. 
Tensor factorization [22, 28, 13, 24] decompose weights into several pieces. [50, 10, 12] accelerate fully connected layers with truncated SVD. [52] factorize a layer into 3 . 3 and 1 . 1 combination, driven by feature map redundancy. 
Channel pruning removes redundant channels on feature maps. There are several training-based approaches. [1, 48] regularize networks to improve accuracy. Channel-wise SSL [48] reaches high compression ratio for first few conv layers of LeNet [30] and AlexNet [26]. However, training-based approaches are more costly, and the effectiveness for very deep networks on large datasets is rarely exploited. 
Inference-time channel pruning is challenging, as re.ported by previous works [2, 39]. Some works [44, 34, 19] focus on model size compression, which mainly operate the fully connected layers. Data-free approaches [31, 3] results for speed-up ratio (e.g., 5.) have not been reported, and requires long retraining procedure. [3] select channels via over 100 random trials, however it need long time to eval.ate each trial on a deep network, which makes it infeasible to work on very deep models and large datasets. [31] is even worse than naive solution from our observation sometimes (Sec. 4.1.1). 

3. Approach 

In this section, we first propose a channel pruning algorithm for a single layer, then generalize this approach to multiple layers or the whole model. Furthermore, we dis.cuss variants of our approach for multi-branch networks. 

3.1. Formulation 

Fig. 2 illustrates our channel pruning algorithm for a sin.gle convolutional layer. We aim to reduce the width of feature map B, while maintaining outputs in feature map 
C. Once channels are pruned, we can remove correspond.ing channels of the filters that take these channels as in.put. Also, filters that produce these channels can also be removed. It is clear that channel pruning involves two key points. The first is channel selection, since we need to select most representative channels to maintain as much information. The second is reconstruction. We need to reconstruct the following feature maps using the selected channels. 
Motivated by this, we propose an iterative two-step algorithm. In one step, we aim to select most representative channels. Since an exhaustive search is infeasible even for tiny networks, we come up with a LASSO regression based method to figure out representative channels and prune redundant ones. In the other step, we reconstruct the outputs with remaining channels with linear least squares. We alter.natively take two steps. 
Formally, to prune a feature map with c channels, we consider applying n.c.kh .kw convolutional filters W on <<FORMULA>> input volumes X sampled from this feature map, which produces N . n output matrix Y. Here, N is the number of samples, n is the number of output channels, and kh,kw are the kernel size. For simple representation, bias term is not included in our formulation. To prune the 
.. 
input channels from c to desired <<FORMULA>>, while minimizing reconstruction error, we formulate our problem as follow: 

<<FORMULA>>       (1)

F is Frobenius norm. <<FORMULA>> matrix sliced from ith channel of input volumes X_i, i =1, ..., c. W_i is n . filter weights sliced from ith channel of W. is coefficient vector of length c for channel selection, and .i is ith entry of . Notice that, if .i =0, X_i will be no longer useful, which could be safely pruned from feature map. W_i could also be removed. Optimization Solving this minimization problem in Eqn. 1 is NP-hard. Therefore, we relax the l_0 to l_1 regularization: 

<<FORMULA>>       (2)

. is a penalty coefficient. By increasing l, there will be more zero terms in and one can get higher speed-up ratio. We also add a constrain .i WiF =1 to this formulation, which avoids trivial solution. 
Now we solve this problem in two folds. First, we fix W, solve for channel selection. Second, we fix , solve W to reconstruct error. 
(i) The subproblem of . In this case, W is fixed. We solve for channel selection. This problem can be solved by LASSO regression [46, 5], which is widely used for model selection. 

<<FORMULA>>       (3) 
.
Here Zi =XiWi (size N .n). We will ignore ith channels if .i =0. 
(ii) The subproblem of W. In this case, is fixed. We utilize the selected channels to minimize reconstruction error. We can find optimized solution by least squares: 

<<FORMULA>>. (4)

Here <<FORMULA>> (size N.). W is n reshaped W, <<FORMULA>>. After obtained result W, it is reshaped back to W. Then we assign <<FORMULA>>. Constrain <<FORMULA>> satisfies.
We alternatively optimize (i) and (ii). In the beginning, W is initialized from the trained model, <<FORMULA>>, namely no penalty, and <<k = c>>. We gradually increase <<FORMULA>> For each
change of <<FORMULA>>, we iterate these two steps until k is stable. 

After <<FORMULA>> satisfies, we obtain the final solution W from <<FORMULA>> In practice, we found that the two steps iteration is time consuming. So we apply (i) multiple times, 

<<FORMULA>>

until <<FORMULA>> satisfies. Then apply (ii) just once, to obtain 

<<FORMULA>>

the final result. From our observation, this result is comparable with two steps iterations. Therefore, in the following experiments, we adopt this approach for efficiency. 
Discussion: Some recent works [48, 1, 17] (though train.
ing based) also introduce .1-norm or LASSO. However, we must emphasis that we use different formulations. Many of them introduced sparsity regularization into training loss, instead of explicitly solving LASSO. Other work [1] solved LASSO, while feature maps or data were not considered during optimization. Because of these differences, our ap.proach could be applied at inference time. 

3.2. Whole Model Pruning 
Inspired by [52], we apply our approach layer by layer sequentially. For each layer, we obtain input volumes from the current input feature map, and output volumes from the output feature map of the un-pruned model. This could be formalized as: 

<<FORMULA>> (5)

Different from Eqn. 1, Y is replaced by Y . , which is from feature map of the original model. Therefore, the accumulated error could be accounted during sequential pruning. 

3.3. Pruning Multi.Branch Networks 
The whole model pruning discussed above is enough for single-branch networks like LeNet [30], AlexNet [26] and VGG Nets [43]. However, it is insufficient for multi-branch networks like GoogLeNet [45] and ResNet [18]. We mainly focus on pruning the widely used residual structure (e.g., ResNet [18], Xception [7]). Given a residual block shown in Fig. 3 (left), the input bifurcates into shortcut and residual branch. On the residual branch, there are several convolutional layers (e.g., 3 convolutional layers which have spatial size of 1 . 1, 3 . 3, 1 . 1, Fig. 3, left). Other layers except the first and last layer can be pruned as is described previously. For the first layer, the challenge is that the large input feature map width (for ResNet, 4 times of its output) can it be easily pruned, since it is shared with shortcut. For the last layer, accumulated error from the shortcut is hard to be recovered, since there is no parameter on the shortcut. To address these challenges, we propose several variants of our approach as follows. 

<<FIGURE>>

Figure 3. Illustration of multi-branch enhancement for residual block. Left: original residual block. Right: pruned residual block with enhancement, cx denotes the feature map width. Input channels of the first convolutional layer are sampled, so that the large input feature map width could be reduced. As for the last layer, rather than approximate Y2 , we try to approximate <<Y1+Y2>> directly (Sec. 3.3 Last layer of residual branch). 
Last layer of residual branch: Shown in Fig. 3, the output layer of a residual block consists of two inputs: feature map Y1 and Y2 from the shortcut and residual branch. We aim to recover Y1 +Y2 for this block. Here, Y1, Y2 are the original feature maps before pruning. Y2 could be approximated as in Eqn. 1. However, shortcut branch is parameter-free, then Y1 could not be recovered directly. To compensate this error, the optimization goal of the last layer is changed from Y2 to Y1 .Y . +Y2, which does not change 

<<FORMULA>>

our optimization. Here, Y . is the current feature map after

<<FORMULA>>

previous layers pruned. When pruning, volumes should be sampled correspondingly from these two branches. 
First layer of residual branch: Illustrated in Fig. 3(left), the input feature map of the residual block could not be pruned, since it is also shared with the short.cut branch. In this condition, we could perform feature map sampling before the first convolution to save computation. We still apply our algorithm as Eqn. 1. Differently, we sample the selected channels on the shared feature maps to construct a new input for the later convolution, shown in Fig. 3(right). Computational cost for this operation could be ignored. More importantly, after introducing feature map sampling, the convolution is still irregular. 
Filter-wise pruning is another option for the first con.volution on the residual branch. Since the input channels of parameter-free shortcut branch could not be pruned, we apply our Eqn. 1 to each filter independently (each fil.ter chooses its own representative input channels). Under single layer acceleration, filter-wise pruning is more accurate than our original one. From our experiments, it im.proves 0.5% top-5 accuracy for 2. ResNet-50 (applied on the first layer of each residual branch) without fine-tuning. However, after fine-tuning, there is no noticeable improvement. In addition, it outputs irregular convolutional layers, which need special library implementation support. We do not adopt it in the following experiments. 

4. Experiment 

We evaluation our approach for the popular VGG Nets [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR.10 [25] and PASCAL VOC 2007 [11]. 
For Batch Normalization [21], we first merge it into convolutional weights, which do not affect the outputs of the networks. So that each convolutional layer is followed by ReLU [36]. We use Caffe [23] for deep network evaluation, and scikit-learn [38] for solvers implementation. For channel pruning, we found that it is enough to extract 5000 images, and 10 samples per image. On ImageNet, we evaluate the top-5 accuracy with single view. Images are re.sized such that the shorter side is 256. The testing is on center crop of 224 . 224 pixels. We could gain more per.formance with fine-tuning. We use a batch size of 128 and 
.5
learning rate 1e^-4. We fine-tune our pruned models for 10 epochs. The augmentation for fine-tuning is random crop of 224 . 224 and mirror. 

4.1. Experiments with VGG.16 

VGG-16 [43] is a 16 layers single path convolutional neural network, with 13 convolutional layers. It is widely used in recognition, detection and segmentation, etc. Single view top-5 accuracy for VGG-16 is 89.9%1. 

4.1.1 Single Layer Pruning 

In this subsection, we evaluate single layer acceleration performance using our algorithm in Sec. 3.1. For better under.standing, we compare our algorithm with two naive chan.nel selection strategies. first k selects the first k channels. max response selects channels based on corresponding filters that have high absolute weights sum [31]. For fair com.parison, we obtain the feature map indexes selected by each of them, then perform reconstruction (Sec. 3.1 (ii)). We hope that this could demonstrate the importance of channel selection. Performance is measured by increase of error af.ter a certain layer is pruned without fine-tuning, shown in Fig. 4. 
As expected, error increases as speed-up ratio increases. Our approach is consistently better than other approaches in different convolutional layers under different speed-up ra.tio. Unexpectedly, sometimes max response is even worse than first k. We argue that max response ignores correlations between different filters. Filters with large absolute weight may have strong correlation. Thus selection based on filter weights is less meaningful. Correlation on feature maps is worth exploiting. We can find that channel selection http://www.vlfeat.org/matconvnet/pretrained/ 

<<FIGURE>>

Figure 4. Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by increase of error. To verify the importance of channel selection referred in Sec. 3.1, we considered two naive baselines. first k selects the first k feature maps. max response selects channels based on absolute sum of corresponding weights filter [31]. Our approach is consistently better (smaller is better). 

<<TABLE>>

Table 1. Accelerating the VGG-16 model [43] using a speedup ratio of 2., 4., or 5. (smaller is better). 
affects reconstruction error a lot. Therefore, it is important for channel pruning. 
Also notice that channel pruning gradually becomes hard, from shallower to deeper layers. It indicates that shallower layers have much more redundancy, which is consistent with [52]. We could prune more aggressively on shallower layers in whole model acceleration. 


4.1.2 Whole Model Pruning 
Shown in Table 1, whole model acceleration results under 2., 4., 5. are demonstrated. We adopt whole model pruning proposed in Sec. 3.2. Guided by single layer experiments above, we pruning more aggressive for shallower layers. Remaining channels ratios for shallow lay.ers (conv 1_x to conv 3_x) and deep layers (conv4_x) is 1:1.5. conv 5_x are not pruned, since they only con.tribute 9% computation in total and are not redundant. 
After fine-tuning, we could reach 2. speed-up without losing accuracy. Under 4., we only suffers 1.0% drops. Consistent with single layer analysis, our approach outperforms previous channel pruning approach (Li et al. [31]) by large margin. This is because we fully exploits channel redundancy within feature maps. Compared with tensor factorization algorithms, our approach is better than Jaderberg et al. [22], without fine-tuning. Though worse than Asym. [52], our combined model outperforms its combined Asym. 3D (Table 2). This may indicate that channel pruning is more challenging than tensor factorization, since removing channels in one layer might dramatically change the input of the following layer. However, channel pruning keeps the original model architecture, do not introduce additional layers, and the absolute speed-up ratio on GPU is much higher (Table 3). 
Since our approach exploits a new cardinality, we further combine our channel pruning with spatial factorization [22] and channel factorization [52]. Demonstrated in Table 2, 

<<TABLE>>

Table 2. Performance of combined methods on the VGG-16 model 

[43] using a speed-up ratio of 4. or 5.. Our 3C solution outperforms previous approaches (smaller is better). 
our 3 cardinalities acceleration (spatial, channel factorization, and channel pruning, denoted by 3C) outperforms previous state-of-the-arts. Asym. 3D [52] (spatial and chan.nel factorization), factorizes a convolutional layer to three parts: <<FORMULA>>. 
We apply spatial factorization, channel factorization, and our channel pruning together sequentially layer-by-layer. We fine-tune the accelerated models for 20 epochs, since they are 3 times deeper than the original ones. After fine-tuning, our 4. model suffers no degradation. Clearly, a combination of different acceleration techniques is better than any single one. This indicates that a model is redundant in each cardinality. 


4.1.3 Comparisons of Absolute Performance 
We further evaluate absolute performance of acceleration on GPU. Results in Table 3 are obtained under Caffe [23], CUDA 8 [37] and cuDNN5 [6], with a mini-batch of 32 on a GPU (GeForce GTX TITAN X). Results are averaged from 50 runs. Tensor factorization approaches decompose weights into too many pieces, which heavily increase over.head. They could not gain much absolute speed-up. Though our approach also encountered performance decadence, it generalizes better on GPU than other approaches. Our re.sults for tensor factorization differ from previous research [52, 22], maybe because current library and hardware prefer single large convolution instead of several small ones. 

4.1.4 Comparisons with Training from Scratch 
Though training a compact model from scratch is time-consuming (usually 120 epochs), it worths comparing our approach and from scratch counterparts. To be fair, we evaluated both from scratch counterpart, and normal setting net.work that has the same computational complexity and same architecture. 
Shown in Table 4, we observed that it is difficult for from scratch counterparts to reach competitive accuracy. our model outperforms from scratch one. Our approach successfully picks out informative channels and constructs highly compact models. We can safely draw the conclusion that the same model is difficult to be obtained from scratch. This coincides with architecture design researches [20, 1] that the model could be easier to train if there are more channels in shallower layers. However, channel prun.ing favors shallower layers. 
For from scratch (uniformed), the filters in each layers is reduced by half (eg. reduce conv1_1 from 64 to 32). We can observe that normal setting networks of the same complexity couldn't reach same accuracy either. This consolidates our idea that there is much redundancy in networks while training. However, redundancy can be opt out at inference-time. This maybe an advantage of inference-time acceleration approaches over training-based approaches. 
Notice that there is a 0.6% gap between the from scratch model and uniformed one, which indicates that there is room for model exploration. Adopting our approach is much faster than training a model from scratch, even for a thin.ner one. Further researches could alleviate our approach to do thin model exploring. 

4.1.5 Acceleration for Detection 
VGG-16 is popular among object detection tasks [42, 41, 33]. We evaluate transfer learning ability of our 2./4. pruned VGG-16, for Faster R-CNN [42] object detections. PASCAL VOC 2007 object detection benchmark [11] contains 5k trainable images and 5k test images. The performance is evaluated by mean Average Precision (mAP). In our experiments, we first perform channel pruning for VGG-16 on the ImageNet. Then we use the pruned model as the pre-trained model for Faster R-CNN. 
The actual running time of Faster R-CNN is 220ms / im.age. The convolutional layers contributes about 64%. We got actual time of 94ms for 4. acceleration. From Table 5, we observe 0.4% mAP drops of our 2. model, which is not harmful for practice consideration. 

4.2. Experiments with Residual Architecture Nets 
For Multi-path networks [45, 18, 7], we further explore the popular ResNet [18] and latest Xception [7], on Ima.geNet and CIFAR-10. Pruning residual architecture nets is more challenging. These networks are designed for both efficiency and high accuracy. Tensor factorization algorithms [52, 22] have difficult to accelerate these model. Spatially, 1 . 1 convolution is favored, which could hardly be factorized. 

4.2.1 ResNet Pruning 
ResNet complexity uniformly drops on each residual block. Guided by single layer experiments (Sec. 4.1.1), we still prefer reducing shallower layers heavier than deeper ones. 
Following similar setting as Filter pruning [31], we keep 70% channels for sensitive residual blocks (res5 and blocks close to the position where spatial size 

<<TABLE>>

Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is better). 

<<TABLE>>

Table 4. Comparisons with training from scratch, under 4. acceleration. Our fine-tuned model outperforms scratch trained counterparts (smaller is better). 

<<TABLE>>

Table 5.Acceleration for Faster R-CNN detection. 
  
<<TABLE>>

Table 6. 2. acceleration for ResNet-50 on ImageNet, the base.line network is top-5 accuracy is 92.2% (one view). We improve performance with multi-branch enhancement (Sec. 3.3, smaller is better). 
change, e.g. res3a,res3d). As for other blocks, we keep 30% channels. With multi-branch enhancement, we prune branch 2a more aggressively within each residual block. The remaining channels ratios for branch 2a,branch 2b,branch 2c is 2:4:3 (e.g., Given 30%, we keep 40%, 80%, 60% respectively). 
We evaluate performance of multi-branch variants of our approach (Sec. 3.3). From Table 6, we improve 4.0% with our multi-branch enhancement. This is because we accounted the accumulated error from shortcut connection which could broadcast to every layer after it. And the large input feature map width at the entry of each residual block is well reduced by our feature map sampling. 
 
<<TABLE>>

Table 7. Comparisons for Xception-50, under 2. acceleration ra.tio. The baseline network is top-5 accuracy is 92.8%. Our approach outperforms previous approaches. Most structured simplification methods are not effective on Xception architecture (smaller is better). 


4.2.2 Xception Pruning 
Since computational complexity becomes important in model design, separable convolution has been payed much attention [49, 7]. Xception [7] is already spatially optimized and tensor factorization on 1 . 1 convolutional layer is destructive. Thanks to our approach, it could still be accelerated with graceful degradation. For the ease of comparison, we adopt Xception convolution on ResNet-50, denoted by Xception-50. Based on ResNet-50, we swap all convolutional layers with spatial conv blocks. To keep the same computational complexity, we increase the input channels of all branch2b layers by 2.. The baseline Xception.50 has a top-5 accuracy of 92.8% and complexity of 4450 MFLOPs. 
We apply multi-branch variants of our approach as de.scribed in Sec. 3.3, and adopt the same pruning ratio setting as ResNet in previous section. Maybe because of Xcep.tion block is unstable, Batch Normalization layers must be maintained during pruning. Otherwise it becomes nontrivial to fine-tune the pruned model. 
Shown in Table 7, after fine-tuning, we only suffer 1.0% increase of error under 2.. Filter pruning [31] could also apply on Xception, though it is designed for small speed.up ratio. Without fine-tuning, top-5 error is 100%. After training 20 epochs which is like training from scratch, in.creased error reach 4.3%. Our results for Xception-50 are not as graceful as results for VGG-16, since modern net.works tend to have less redundancy by design. 

<<TABLE>>

Table 8. 2. speed-up comparisons for ResNet-56 on CIFAR-10, the baseline accuracy is 92.8% (one view). We outperforms previous approaches and scratch trained counterpart (smaller is better). 


4.2.3 Experiments on CIFAR-10 
Even though our approach is designed for large datasets, it could generalize well on small datasets. We perform experiments on CIFAR-10 dataset [25], which is favored by many acceleration researches. It consists of 50k images for training and 10k for testing in 10 classes. 
We reproduce ResNet-56, which has accuracy of 92.8% (Serve as a reference, the official ResNet-56 [18] has ac.curacy of 93.0%). For 2. acceleration, we follow similar setting as Sec. 4.2.1 (keep the final stage unchanged, where the spatial size is 8 . 8). Shown in Table 8, our approach is competitive with scratch trained one, without fine-tuning, under 2. speed-up. After fine-tuning, our result is significantly better than Filter pruning [31] and scratch trained one. 

5. Conclusion 
To conclude, current deep CNNs are accurate with high inference costs. In this paper, we have presented an inference-time channel pruning method for very deep net.works. The reduced CNNs are inference efficient networks while maintaining accuracy, and only require off-the-shelf libraries. Compelling speed-ups and accuracy are demonstrated for both VGG Net and ResNet-like networks on Im.ageNet, CIFAR-10 and PASCAL VOC. 
In the future, we plan to involve our approaches into training time, instead of inference time only, which may also accelerate training procedure. 

References 
[1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pages 2262fi2270, 2016. 1, 2, 3, 6 
[2] S. Anwar, K. Hwang, and W. Sung. Structured prun.ing of deep convolutional neural networks. arXiv preprint arXiv:1512.08571, 2015. 2 
[3] S. Anwar and W. Sung. Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv:1610.09639, 2016. 1, 2 
[4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn: Lookup-based convolutional neural network. arXiv preprint arXiv:1611.06473, 2016. 2 
[5] L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37(4):373fi384, 1995. 3 
[6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, 
B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. 6 
[7] F. Chollet. Xception: Deep learning with depthwise separa.ble convolutions. arXiv preprint arXiv:1610.02357, 2016. 1, 2, 3, 4, 6, 7 
[8] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. 1, 2 
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248fi255. IEEE, 2009. 4 
[10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer.gus. Exploiting linear structure within convolutional net.works for efficient evaluation. In Advances in Neural In.formation Processing Systems, pages 1269fi1277, 2014. 2 
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal.network.org/challenges/VOC/voc2007/workshop/index.html. 4, 6 
[12] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Inter.national Conference on Computer Vision, pages 1440fi1448, 2015. 2 
[13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress.ing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014. 2 
[14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Process.ing Systems, pages 1379fi1387, 2016. 2 
[15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: efficient inference engine on com.pressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243fi254. IEEE Press, 2016. 2 
[16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com.pressing deep neural network with pruning, trained quantiza.tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. 
2 
[17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135fi1143, 2015. 1, 2, 3 
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn.ing for image recognition. arXiv preprint arXiv:1512.03385, 2015. 1,2,3,4,6,8 
[19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim.ming: A data-driven neuron pruning approach towards effi.cient deep architectures. arXiv preprint arXiv:1607.03250, 2016. 2 

[20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, 
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 6 
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 4 
[22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 1, 2, 5, 6, 7 
[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir.shick, S. Guadarrama, and T. Darrell. Caffe: Convolu.tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 4, 6 
[24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015. 2 
[25] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. 4, 8 
[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097fi1105, 2012. 2, 3 
[27] A. Lavin. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015. 2 
[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and 
V. Lempitsky. Speeding-up convolutional neural net.works using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2 
[29] V. Lebedev and V. Lempitsky. Fast convnets using group-wise brain damage. arXiv preprint arXiv:1506.02515, 2015. 
2 
[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed.ings of the IEEE, 86(11):2278fi2324, 1998. 2, 3 
[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710,2016. 1,2,4,5,6,7,8 
[32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni.tion, pages 806fi814, 2015. 2 
[33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, 
C. Fu, and A. C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. 6 
[34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint arXiv:1511.05077, 2015. 2 
[35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013. 2 
[36] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807fi814, 2010. 4 
[37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40fi53, 2008. 6 
[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, 
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, 
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, 
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma.chine learning in Python. Journal of Machine Learning Re.search, 12:2825fi2830, 2011. 4 
[39] A. Polyak and L. Wolf. Channel-level acceleration of deep face representations. IEEE Access, 3:2163fi2175, 2015. 2 
[40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor.net: Imagenet classification using binary convolutional neu.ral networks. In European Conference on Computer Vision, pages 525fi542. Springer, 2016. 2 
[41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015. 6 
[42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal net.works. CoRR, abs/1506.01497, 2015. 6 
[43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3, 4, 5, 6 
[44] S. Srinivas and R. V. Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015. 2 
[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, 
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1fi9, 2015. 1, 3, 6 
[46] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267fi288, 1996. 3 
[47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi.antino, and Y. LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014. 1, 2 
[48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances In Neural Information Processing Systems, pages 2074fi2082, 2016. 1, 2, 3 
[49] S. Xie, R. Girshick, P. Dollfiar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016. 7 
[50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In INTERSPEECH, pages 2365fi2369, 2013. 2 
[51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv preprint arXiv:1611.05128, 2016. 2 
[52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelli.gence, 38(10):1943fi1955, 2016. 1, 2, 3, 5, 6, 7 
<<END>> <<END>> <<END>>


<<START>> <<START>> <<START>>
                                    Convex Neural Networks

                      Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
                                         Dept. IRO, Universite de Montr´      eal´
                             P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada
                              fbengioy,lerouxni,vincentp,delallea,marcotteg@iro.umontreal.ca

                                                Abstract
                           Convexity has recently received a lot of attention in the machine learning
                           community, and the lack of convexity has been seen as a major disad-
                           vantage of many learning algorithms, such as multi-layer artiﬁcial neural
                           networks. We show that training multi-layer neural networks in which the
                           number of hidden units is learned can be viewed as a convex optimization
                           problem. This problem involves an inﬁnite number of variables, but can be
                           solved by incrementally inserting a hidden unit at a time, each time ﬁnding
                           a linear classiﬁer that minimizes a weighted sum of errors.

                     1 Introduction
                     The objective of this paper is not to present yet another learning algorithm, but rather to point
                     to a previously unnoticed relation between multi-layer neural networks (NNs),Boosting (Fre-
                     und and Schapire, 1997) and convex optimization. Its main contributions concern the mathe-
                     matical analysis of an algorithm that is similar to previously proposed incremental NNs, with
                     L1 regularization on the output weights. This analysis helps to understand the underlying
                     convex optimization problem that one is trying to solve.
                     This paper was motivated by the unproven conjecture (based on anecdotal experience) that
                     when the number of hidden units is “large”, the resulting average error is rather insensitive to
                     the random initialization of the NN parameters. One way to justify this assertion is that to re-
                     ally stay stuck in a local minimum, one must have second derivatives positive simultaneously
                     in all directions. When the number of hidden units is large, it seems implausible for none of
                     them to offer a descent direction. Although this paper does not prove or disprove the above
                     conjecture, in trying to do so we found an interesting characterization of the optimization
                     problem for NNs as a convex program if the output loss function is convex in the NN out-
                     put and if the output layer weights are regularized by a convex penalty. More speciﬁcally,
                     if the regularization is the L1 norm of the output layer weights, then we show that a “rea-
                     sonable” solution exists, involving a ﬁnite number of hidden units (no more than the number
                     of examples, and in practice typically much less). We present a theoretical algorithm that
                     is reminiscent of Column Generation (Chvatal, 1983), in which hidden neurons are inserted ´
                     one at a time. Each insertion requires solving a weighted classiﬁcation problem, very much
                     like in Boosting (Freund and Schapire, 1997) and in particular Gradient Boosting (Mason
                     et al., 2000; Friedman, 2001).
                     Neural Networks, Gradient Boosting, and Column Generation
                     Denote x~2Rd+1 the extension of vector x2Rd with one element with value 1. What
                     we call “Neural Network” (NN) here is a predictor for supervised learning of the form 
                     <<FORMULA>> where x is an input vector, <<h_i(x)>> is obtained from a linear dis-
                     criminant function hi <<FORMULA>> with e.g. <<s(a) = sign(a)>>, or <<s(a) = tanh(a)>> or
                     <<s(a) =  1>>. A learning algorithm must specify how to select m, the <<FORMULA>>                                        
                     i ’s and the vi ’s.                 

                     The classical solution (Rumelhart, Hinton and Williams, 1986) involves (a) selecting a loss
                     function Q(^y;y)that speciﬁes how to penalize for mismatches between y^(x)and the ob-
                     served y’s (target output or target class), (b) optionally selecting a regularization penalty that
                     favors “small” parameters, and (c) choosing a method to approximately minimize the sum of
                     the losses on the training data D=f(x1 ;y 1 );:::;(xn ;y n )gplus the regularization penalty.
                     Note that in this formulation, an output non-linearity can still be used, by inserting it in the
                     loss function Q. Examples of such loss functions are the quadratic loss jjy^ yjj 2 , the hinge
                     loss <<FORMULA>> (used in SVMs), the cross-entropy loss <<FORMULA>>
                     (used in logistic regression), and the exponential loss <<FORMULA>> (used in Boosting).
                     Gradient Boosting has been introduced in (Friedman, 2001) and (Mason et al., 2000) as a
                     non-parametric greedy-stagewise supervised learning algorithm in which one adds a function
                     at a time to the current solution <<y^(x)>>, in a steepest-descent fashion, to form an additive model
                     as above but with the functions hi typically taken in other kinds of sets of functions, such as
                     those obtained with decision trees. In a stagewise approach, when the (m+1)-th basis <<FORMULA>> is added, 
                     only <<w_m+1>> is optimized (by a line search), like in matching pursuit algorithms. Such
                     a greedy-stagewise approach is also at the basis of Boosting algorithms (Freund and Schapire,
                     1997), which is usually applied using decision trees as bases and Qthe exponential loss.
                     It may be difﬁcult to minimize exactly for wm+1 and hm+1 when the previous bases and
                     weights are ﬁxed, so (Friedman, 2001) proposes to “follow the gradient” in function space,
                     i.e., look for a base learner hm+1 that is best correlated with the gradient of the average
                     loss on the <<FORMULA>> (that would be the residue <<FORMULA>> in the case of the square loss). The
                     algorithm analyzed here also involves maximizing the correlation between Q0 (the derivative
                     of Q with respect to its ﬁrst argument, evaluated on the training predictions) and the next
                     basis hm+1 . However, we follow a “stepwise”, less greedy, approach, in which all the output
                     weights are optimized at each step, in order to obtain convergence guarantees.
                     Our approach adapts the Column Generation principle (Chvatal, 1983), a decomposition´
                     technique initially proposed for solving linear programs with many variables and few con-
                     straints. In this framework, active variables, or “columns”, are only generated as they are
                     required to decrease the objective. In several implementations, the column-generation sub-
                     problem is frequently a combinatorial problem for which efﬁcient algorithms are available.
                     In our case, the subproblem corresponds to determining an “optimal” linear classiﬁer.

                     2 Core Ideas
                     Informally, consider the set Hof all possible hidden unit functions (i.e., of all possible hidden
                     unit weight vectors vi ). Imagine a NN that has all the elements in this set as hidden units. We
                     might want to impose precision limitations on those weights to obtain either a countable or
                     even a ﬁnite set. For such a NN, we only need to learn the output weights. If we end up with
                     a ﬁnite number of non-zero output weights, we will have at the end an ordinary feedforward
                     NN. This can be achieved by using a regularization penalty on the output weights that yields
                     sparse solutions, such as the L1 penalty. If in addition the loss function is convex in the output
                     layer weights (which is the case of squared error, hinge loss,  -tube regression loss, and
                     logistic or softmax cross-entropy), then it is easy to show that the overall training criterion
                     is convex in the parameters (which are now only the output weights). The only problem is
                     that there are as many variables in this convex program as there are elements in the set H,
                     which may be very large (possibly inﬁnite). However, we ﬁnd that with L1 regularization,
                     a ﬁnite solution is obtained, and that such a solution can be obtained by greedily inserting
                     one hidden unit at a time. Furthermore, it is theoretically possible to check that the global
                     optimum has been reached.

                     Deﬁnition 2.1.Let Hbe a set of functions from an input space X to R. Elements of H
                     can be understood as “hidden units” in a NN. Let Wbe the Hilbert space of functions from
                     Hto R, with an inner product denoted by <<FORMULA>>. An element of W can be
                     understood as the output weights vector in a neural network. Let <<h(x):H -> R>> the function
                     that maps any element <<h_i>> of <<H to h_i(x)>>. <<h(x)>> can be understood as the vector of activations                     
                     of hidden units when input x is observed. Let w2 W represent a parameter(the output
                     weights). The NN prediction is denoted <<FORMULA>>. Let <<Q:R -> RxR>> be a
                     cost function convex in its ﬁrst argument that takes a scalar prediction y^(x)and a scalar
                     target value y and returns a scalar cost. This is the cost to be minimized on example pair
                     (x;y). Let <<FORMULA>> be the training set. Let <<FORMULA>> be a convex
                     regularization functional that penalizes for the choice of more “complex” parameters (e.g.,
                     <<FORMULA>> according to a 1-norm in W, if His countable). We deﬁne the convex NN
                     criterion C(H;Q; ;D;w)with parameter was follows: 

                                  <<FORMULA>>          (1)
 
                     The following is a trivial lemma, but it is conceptually very important as it is the basis for the
                     rest of the analysis in this paper.

                     Lemma 2.2.The convex NN cost <<FORMULA>> is a convex function of w.
                     Proof. <<FORMULA>> is convex in w and << >> is convex in w, by the above construction. C
                     is additive in <<FORMULA>> and additive in  . Hence C is convex in w.
                     Note that there are no constraints in this convex optimization program, so that at the global
                     minimum all the partial derivatives of C with respect to elements of w cancel.
                     Let jHj be the cardinality of the set H. If it is not ﬁnite, it is not obvious that an optimal
                     solution can be achieved in ﬁnitely many iterations.

                     Lemma 2.2 says that training NNs from a very large class (with one or more hidden layer)
                     can be seen as convex optimization problems, usually in a very high dimensional space,as
                     long as we allow the number of hidden units to be selected by the learning algorithm.
                     By choosing a regularizer that promotes sparse solutions, we obtain a solution that has a
                     ﬁnite number of “active” hidden units (non-zero entries in the output weights vector w).
                     This assertion is proven below, in theorem 3.1, for the case of the hinge loss.
                     However, even if the solution involves a ﬁnite number of active hidden units, the convex
                     optimization problem could still be computationally intractable because of the large number
                     of variables involved. One approach to this problem is to apply the principles already suc-
                     cessfully embedded in Gradient Boosting, but more speciﬁcally in Column Generation (an
                     optimization technique for very large scale linear programs), i.e., add one hidden unit at a
                     time in an incremental fashion. The important ingredient here is a way to know that we
                     have reached the global optimum, thus not requiring to actually visit all the possible
                     hidden units.We show that this can be achieved as long as we can solve the sub-problem
                     of ﬁnding a linear classiﬁer that minimizes the weighted sum of classiﬁcation errors. This
                     can be done exactly only on low dimensional data sets but can be well approached using
                     weighted linear SVMs, weighted logistic regression, or Perceptron-type algorithms.
                     Another idea (not followed up here) would be to consider ﬁrst a smaller set H1 , for which
                     the convex problem can be solved in polynomial time, and whose solution can theoretically
                     be selected as initialization for minimizing the criterion <<FORMULA>>, with <<FORMULA>>,
                     and where H2 may have inﬁnite cardinality (countable or not). In this way we could show
                     that we can ﬁnd a solution whose cost satisﬁes <<FORMULA>>,
                     i.e., is at least as good as the solution of a more restricted convex optimization problem. The
                     second minimization can be performed with a local descent algorithm, without the necessity
                     to guarantee that the global optimum will be found.

                     3 Finite Number of Hidden Neurons
                     In this section we consider the special case with <<FORMULA>> the hinge loss,
                     and <<L1>> regularization, and we show that the global optimum of the convex cost involves at
                     most n+ 1 hidden neurons, using an approach already exploited in (Ratsch, Demiriz and¨
                     Bennett, 2002) for L1-loss regression Boosting with L1 regularization of output weights.                                                    Xn
                     The training criterion is <<FORMULA>>. Let us rewrite t=1 this cost function as the 
                     constrained optimization problem: 
                     
                                          <<FORMULA>>      (C1)
                  
                                          <<FORMULA>>      (C2)

                     Using a standard technique, the above program can be recast as a linear program. Deﬁn-
                     ing <<FORMULA>> the vector of Lagrangian multipliers for the constraints C1 , its dual
                     problem (P)takes the form (in the case of a ﬁnite number Jof base learners): 
                     
                                          <<FORMULA>>
                          
                     In the case of a ﬁnite number Jof base learners, <<FORMULA>>. If
                     the number of hidden units is uncountable, then Iis a closed bounded interval of R.
                     Such an optimization problem satisﬁes all the conditions needed for using Theorem 4.2
                     from (Hettich and Kortanek, 1993). Indeed:
                     <<FORMULA>> it is compact (as a closed bounded interval of <<FORMULA>> is a concave function 
                     it is even a linear function);
                     <<FORMULA>> is convex in << >> (it is actually linear in << >>);
                     <<FORMULA>> (therefore ﬁnite) (  (P)is the largest value of F satisfying the constraints);
                       for every set of n+1 points <<FORMULA>>, there exists  ~such that <<FORMULA>> for
                     <<FORMULA>> (one can take <<FORMULA>> since K>0).

                     Then, from Theorem 4.2 from (Hettich and Kortanek, 1993), the following theorem holds:
                     Theorem 3.1.The solution of (P) can be attained with constraints C0 and only n+1 constraints C0 
                     (i.e., there exists a subset of n+1 constraints C0 giving rise to the same maximum 1                               
                     as when using the whole set of constraints). Therefore, the primal problem associated is the
                     minimization of the cost function of a NN with n+1 hidden neurons.

                     4 Incremental Convex NN Algorithm
                     In this section we present a stepwise algorithm to optimize a NN, and show that there is a cri-
                     terion that allows to verify whether the global optimum has been reached. This is a specializa-
                     tion of minimizing <<FORMULA>>, with <<FORMULA>> 1 and <<FORMULA>>
                     is the set of soft or hard linear classiﬁers (depending on choice of s( )).

                                        Algorithm ConvexNN( D, Q,  , s)

                                                <<ALGORITHM>>
                     
                     Theorem 4.1.AlgorithmConvexNN Pstops when it reaches the global optimum of

                                      <<FORMULA>>.

                     Proof.Let wbe the output weights vector when the algorithm stops. Because the set of
                     hidden units Hwe consider is such that when his in H,  h is also in H, we can assume
                     all weights to be non-negative. By contradiction, if w0 6=wis the global optimum, with
                     <<C(w_0) < C(w)>>, then, since Cis convex in the output weights, for any  2(0;1) , we have
                     <<FORMULA>>. For
                       small enough, we can assume all weights in w that are strictly positive to be also strictly
                     positive in w  . Let us denote by Ip the set of strictly positive weights in w (and w ), by 
                     Iz the set of weights set to zero in w but to a non-zero value in w  , and by   k the difference
                     w ;k  wk in the weight of hidden unit hk between wand w  . We can assume   j < 0 for
                     j2Iz , because instead of setting a small positive weight to hj , one can decrease the weight
                     of  hj by the same amount, which will give either the same cost, or possibly a lower one
                     when the weight of <<FORMULA>> is positive. With o( ) denoting a quantity such that    o( )!0
                     when  !0, the difference    (w) =XC(w  ) C(w)can now be written:

                                       <<FORMULA>>

                     since for i2Ip , thanks to step (7) of the algorithm, we have @C (w) = 0 . Thus the @w
                     inequality <<FORMULA>> rewrites into  <<FORMULA>>
                     which, when  !0, yields (note that <<FORMULA>> does not depend on !   since   j is linear in  ):

                                      <<FORMULA>>             (2)

                     i being the optimal classiﬁer chosen in step (5a) or (5c), all hidden units <<FORMULA>> verify <<FORMULA>>

                                       <<FORMULA>>

                     <<FORMULA>> , contradicting eq. 2.

                     (Mason et al., 2000) prove a related global convergence result for the AnyBoost algorithm,
                     a non-parametric Boosting algorithm that is also similar to Gradient Boosting (Friedman,
                     2001). Again, this requires solving as a sub-problem an exact minimization to ﬁnd a function
                     hi 2 H that is maximally correlated with the gradient Q0 on the output. We now show a
                     simple procedure to select a hyperplane with the best weighted classiﬁcation error.
                     Exact Minimization                     
                     In step (5a) we are required to ﬁnd a linear classiﬁer that minimizes the weighted sum of
                     classiﬁcation errors. Unfortunately, this is an NP-hard problem (w.r.t. d, see theorem 4
                     in (Marcotte and Savard, 1992)). However, an exact solution can be easily found in O(n3 )
                     computations for d= 2 inputs.

                     Proposition 4.2.Finding a linear classiﬁer that minimizes the weighted sum of classiﬁcation
                     error can be achieved in O(n3 )steps when the input dimension is d= 2 .
                     Proof.We want to <<FORMULA>> with respect to u and b, the c’s being
                     in <<FORMULA>> Consider u ﬁxed and sort the xi ’s according to their dot product with u and denote r
                     the function which maps ito r(i) such that xr(i) is in i-th position in the sort. Depending on P       
                     the value of b, we will have n+1 possible sums, respectively <<FORMULA>>,
                     <<FORMULA>>. It is obvious that those sums only depend on the order of the products <<FORMULA>>,
                     <<FORMULA>>. When u varies smoothly on the unit circle, as the dot product is a continuous
                     function of its arguments, the changes in the order of the dot products will occur only when
                     there is a pair (i,j) such that <<FORMULA>>. Therefore, there are at most as many order
                     changes as there are pairs of different points, i.e., <<FORMULA>>. In the case of d=2, we
                     can enumerate all the different angles for which there is a change, namely a1 ;:::;a z with
                     <<FORMULA>>. We then need to test at least one <<FORMULA>> for each interval a2                                                    i <
                     <<FORMULA>>, and also one u for <<FORMULA>>, which makes a total of <<FORMULA>> possibilities. 2
                     It is possible to generalize this result in higher dimensions, and as shown in (Marcotte and
                     Savard, 1992), one can achieve <<O(log(n)nd)>> time.

                     Algorithm 1 Optimal linear classiﬁer search
                    
                                       <<ALGORITHM>>

                     Approximate Minimization

                     For data in higher dimensions, the exact minimization scheme to ﬁnd the optimal linear
                     classiﬁer is not practical. Therefore it is interesting to consider approximate schemes for
                     obtaining a linear classiﬁer with weighted costs. Popular schemes for doing so are the linear
                     SVM (i.e., linear classiﬁer with hinge loss), the logistic regression classiﬁer, and variants of
                     the Perceptron algorithm. In that case, step (5c) of the algorithm is not an exact minimization,
                     and one cannot guarantee that the global optimum will be reached. However, it might be
                     reasonable to believe that ﬁnding a linear classiﬁer by minimizing a weighted hinge loss
                     should yield solutions close to the exact minimization. Unfortunately, this is not generally
                     true, as we have found out on a simple toy data set described below. On the other hand,
                     if in step (7) one performs an optimization not only of the output weights wj (j i) but
                     also of the corresponding weight vectors vj , then the algorithm ﬁnds a solution close to the
                     global optimum (we could only verify this on 2-D data sets, where the exact solution can be
                     computed easily). It means that at the end of each stage, one ﬁrst performs a few training
                     iterations of the whole NN (for the hidden units j i) with an ordinary gradient descent
                     mechanism (we used conjugate gradients but stochastic gradient descent would work too),
                     optimizing the wj ’s and the vj ’s, and then one ﬁxes the vj ’s and obtains the optimal wj ’s for
                     these vj ’s (using a convex optimization procedure). In our experiments we used a quadratic                     
                     Q, for which the optimization of the output weights can be done with a neural network, using
                     the outputs of the hidden layer as inputs.

                     Let us consider now a bit more carefully what it means to tune the v_j’s in step (7). Indeed,
                     changing the weight vector vj of a selected hidden neuron to decrease the cost is equivalent
                     to a change in the output weights w’s. More precisely, consider the step in which the
                     value of vj becomes v0 . This is equivalent to the following operation on the w’s, when wj                                            j is the corresponding output weight value: the output weight associated with the value vj of
                     a hidden neuron is set to 0, and the output weight associated with the value v0 of a hidden j 
                     neuron is set to wj . This corresponds to an exchange between two variables in the convex
                     program. We are justiﬁed to take any such step as long as it allows us to decrease the cost
                     C(w). The fact that we are simultaneously making such exchanges on all the hidden units
                     when we tune the vj ’s allows us to move faster towards the global optimum.
                     Extension to multiple outputs
                     The multiple outputs case is more involved than the single-output case because it is not P 
                     enough to check the condition <<FORMULA>>. Consider a new hidden neuron whose output is
                     hi when the input is xi . Let us also denote <<FORMULA>> the vector of output weights
                     between the new hidden neuron and the <<FORMULA>> output neurons. The gradient with respect to  j
                     is <<FORMULA>> with <<FORMULA>> the value of the j-th output neuron with input <<FORMULA>>. 
                     This means that if, for a given j, we have <<FORMULA>>, moving P j away from 0 can
                     only increase the cost. Therefore, the right quantity to consider is <<FORMULA>>.
                     We must therefore ﬁnd <<FORMULA>>. As before, this sub-problem is not + convex, but it is not 
                     as obvious how to approximate it by a convex problem. The stopping P criterion becomes: if there is no j 
                     such that <<FORMULA>>, then all weights must remain equal to 0 and a global minimum is reached.

                     Experimental Results
                     We performed experiments on the 2-D double moon toy dataset (as used in (Delalleau, Ben-
                     gio and Le Roux, 2005)), to be able to compare with the exact version of the algorithm. In
                     these experiments, <<FORMULA>>. The set-up is the following:

                      Select a new linear classiﬁer, either (a) the optimal one or (b) an approximate using logistic
                     regression.
                      Optimize the output weights using a convex optimizer.
                      In case (b), tune both input and output weights by conjugate gradient descent on Cand
                     ﬁnally re-optimize the output weights using LASSO regression.
                      Optionally, remove neurons whose output weight has been set to 0.
                     Using the approximate algorithm yielded for 100 training examples an average penalized
                     (  = 1 ) squared error of 17.11 (over 10 runs), an average test classiﬁcation error of 3.68%
                     and an average number of neurons of 5.5 . The exact algorithm yielded a penalized squared
                     error of 8.09, an average test classiﬁcation error of 5.3%, and required 3 hidden neurons. A
                     penalty of  = 1 was nearly optimal for the exact algorithm whereas a smaller penalty further
                     improved the test classiﬁcation error of the approximate algorithm. Besides, when running
                     the approximate algorithm for a long time, it converges to a solution whose quadratic error is
                     extremely close to the one of the exact algorithm.

                     5 Conclusion
                     We have shown that training a NN can be seen as a convex optimization problem, and have
                     analyzed an algorithm that can exactly or approximately solve this problem. We have shown
                     that the solution with the hinge loss involved a number of non-zero weights bounded by
                     the number of examples, and much smaller in practice. We have shown that there exists a
                     stopping criterion to verify if the global optimum has been reached, but it involves solving a
                     sub-learning problem involving a linear classiﬁer with weighted errors, which can be computationally 
                     hard if the exact solution is sought, but can be easily implemented for toy data
                     sets (in low dimension), for comparing exact and approximate solutions.
                     The above experimental results are in agreement with our initial conjecture: when there are
                     many hidden units we are much less likely to stall in the optimization procedure, because
                     there are many more ways to descend on the convex cost C(w). They also suggest, based
                     on experiments in which we can compare with the exact sub-problem minimization, that
                     applying Algorithm ConvexNN with an approximate minimization for adding each hidden
                     unit while continuing to tune the previous hidden unit s tends to lead to fast convergence
                     to the global minimum. What can get us stuck in a “local minimum” (in the traditional sense,
                     i.e., of optimizing w’s and v’s together) is simply the inability to ﬁnd a new hidden unit
                     weight vector that can improve the total cost (ﬁt and regularization term) even if there
                     exists one.

                     Note that as a side-effect of the results presented here, we have a simple way to train P neural
                     networks with hard-threshold hidden units, since increasing <<FORMULA>> can be either achieved 
                     exactly (at great price) or approximately (e.g. by using a cross-entropy
                     or hinge loss on the corresponding linear classiﬁer).

                     Acknowledgments

                     The authors thank the following for support: NSERC, MITACS, and the Canada Research
                     Chairs. They are also grateful for the feedback and stimulating exchanges with Sam Roweis,
                     Nathan Srebro, and Aaron Courville.

                     References

                     Chvatal, V. (1983).´        Linear Programming. W.H. Freeman.
                     Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efﬁcient non-parametric function induction
                        in semi-supervised learning. In Cowell, R. and Ghahramani, Z., editors,Proceedings of AIS-
                        TATS’2005, pages 96–103.
                     Freund, Y. and Schapire, R. E. (1997). A decision theoretic generalization of on-line learning and an
                        application to boosting.Journal of Computer and System Science, 55(1):119–139.
                     Friedman, J. (2001). Greedy function approximation: a gradient boosting machine.Annals of Statis-
                        tics, 29:1180.
                     Hettich, R. and Kortanek, K. (1993). Semi-inﬁnite programming: theory, methods, and applications.
                        SIAM Review, 35(3):380–429.
                     Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem.Zeitschrift fr
                        Operations Research (Theory), 36:517–545.
                     Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. (2000). Boosting algorithms as gradient descent.
                        InAdvances in Neural Information Processing Systems 12, pages 512–518.
                     Ratsch, G., Demiriz, A., and Bennett, K. P. (2002). Sparse regression ensembles in inﬁnite and ﬁnite¨
                        hypothesis spaces.Machine Learning.
                     Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating
                        errors.Nature, 323:533–536
<<END>> <<END>> <<END>>


<<START>> <<START>> <<START>>                  
                  DEEP COMPRESSION: COMPRESSING DEEP NEURAL
                  NETWORKS WITH PRUNING , T RAINED QUANTIZATION
                 AND HUFFMAN CODING


                  Song Han
                  Stanford University, Stanford, CA 94305, USA
                  songhan@stanford.edu

                  Huizi Mao
                  Tsinghua University, Beijing, 100084, China
                  mhz12@mails.tsinghua.edu.cn

                  William J. Dally
                  Stanford University, Stanford, CA 94305, USA
                  NVIDIA, Santa Clara, CA 95050, USA
                  dally@stanford.edu


                                              ABSTRACT

                       Neural networks are both computationally intensive and memory intensive, making
                       them difﬁcult to deploy on embedded systems with limited hardware resources. To
                       address this limitation, we introduce “deep compression”, a three stage pipeline:
                       pruning, trained quantization and Huffman coding, that work together to reduce
                       the storage requirement of neural networks by 35% to 49% without affecting their
                       accuracy. Our method ﬁrst prunes the network by learning only the important
                       connections. Next, we quantize the weights to enforce weight sharing, ﬁnally, we
                       apply Huffman coding. After the ﬁrst two steps we retrain the network to ﬁne
                       tune the remaining connections and the quantized centroids. Pruning, reduces the
                       number of connections by 9% to 13%; Quantization then reduces the number of
                       bits that represent each connection from 32 to 5. On the ImageNet dataset, our
                       method reduced the storage required by AlexNet by 35%, from 240MB to 6.9MB,
                       without loss of accuracy. Our method reduced the size of VGG-16 by 49% from
                       552MB to 11.3MB, again with no loss of accuracy. This allows ﬁtting the model
                       into on-chip SRAM cache rather than off-chip DRAM memory. Our compression
                       method also facilitates the use of complex neural networks in mobile applications
                       where application size and download bandwidth are constrained. Benchmarked on
                       CPU, GPU and mobile GPU, compressed network has 3% to 4% layerwise speedup
                       and 3% to 7% better energy efﬁciency.


                  1 INTRODUCTION

                 Deep neural networks have evolved to the state-of-the-art technique for computer vision tasks
                 (Krizhevsky et al., 2012)(Simonyan & Zisserman, 2014). Though these neural networks are very
                 powerful, the large number of weights consumes considerable storage and memory bandwidth. For
                 example, the AlexNet Caffemodel is over 200MB, and the VGG-16 Caffemodel is over 500MB
                 (BVLC). This makes it difﬁcult to deploy deep neural networks on mobile system.
                 First, for many mobile-ﬁrst companies such as Baidu and Facebook, various apps are updated via
                 different app stores, and they are very sensitive to the size of the binary ﬁles. For example, App
                 Store has the restriction “apps above 100 MB will not download until you connect to Wi-Fi”. As a
                 result, a feature that increases the binary size by 100MB will receive much more scrutiny than one
                 that increases it by 10MB. Although having deep neural networks running on mobile has many great

                                                 <<FIGURE>>

                 Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning
                 reduces the number of weights by10%, while quantization further improves the compression rate:
                 between27%and31%. Huffman coding gives more compression: between35%and49%. The
                 compression rate already included the meta-data for sparse representation. The compression scheme
                 doesn’t incur any accuracy loss.

                 features such as better privacy, less network bandwidth and real time processing, the large storage
                 overhead prevents deep neural networks from being incorporated into mobile apps.
                 The second issue is energy consumption. Running large neural networks require a lot of memory
                 bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes
                 considerable energy. Mobile devices are battery constrained, making power hungry applications such
                 as deep neural networks hard to deploy.
                 Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit
                 ﬂoating point add consumes 0.9PJ, a 32bit SRAM cache access takes 5PJ, while a 32bit DRAM
                 memory access takes 640PJ, which is 3 orders of magnitude of an add operation. Large networks
                 do not ﬁt in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion
                 connection neural network, for example, at 20fps would require (20Hz)(1G)(640PJ) = 12.8W just
                 for DRAM access - well beyond the power envelope of a typical mobile device.
                 Our goal is to reduce the storage and energy required to run inference on such large networks so they
                 can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three-
                 stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves
                 the original accuracy. First, we prune the networking by removing the redundant connections, keeping
                 only the most informative connections. Next, the weights are quantized so that multiple connections
                 share the same weight, thus only the codebook (effective weights) and the indices need to be stored.
                 Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights.
                 Our main insight is that, pruning and trained quantization are able to compress the network without
                 interfering each other, thus lead to surprisingly high compression rate. It makes the required storage
                 so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM
                 which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al.
                 (2016) was later proposed that works on the compressed model, achieving signiﬁcant speedup and
                 energy efﬁciency improvement.

                  2 NETWORK PRUNING

                 Network pruning has been widely studied to compress CNN models. In early work, network pruning
                 proved to be a valid way to reduce the network complexity and over-ﬁtting (LeCun et al., 1989;
                 Hanson & Pratt, 1989; Hassibi et al., 1993; Strom, 1997). Recently Han et al. (2015) pruned state- ¨
                 of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on
                 the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we
                 prune the small-weight connections: all connections with weights below a threshold are removed
                 from the network. Finally, we retrain the network to learn the ﬁnal weights for the remaining sparse
                 connections. Pruning reduced the number of parameters by9%and13%for AlexNet and VGG-16
                 model.

                                                <<FIGURE>>

                 Figure 2: Representing the matrix sparsity with relative index. Padding ﬁller zero to prevent overﬂow.

                                                <<FIGURE>>

                     Figure 3: Weight sharing by scalar quantization (top) and centroids ﬁne-tuning (bottom).


                 We store the sparse structure that results from pruning using compressed sparse row (CSR) or
                 compressed sparse column (CSC) format, which requires2a+n+1numbers, where a is the number
                 of non-zero elements and n is the number of rows or columns.
                 To compress further, we store the index difference instead of the absolute position, and encode this
                 difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger
                 than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds
                 8, the largest 3-bit (as an example) unsigned number, we add a ﬁller zero.

                  3 TRAINED QUANTIZATION AND WEIGHT SHARING

                 Network quantization and weight sharing further compresses the pruned network by reducing the
                 number of bits required to represent each weight. We limit the number of effective weights we need to
                 store by having multiple connections share the same weight, and then ﬁne-tune those shared weights.
                 Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4
                 output neurons, the weight is a 4x4 matrix. On the top left is the 4x4 weight matrix, and on the
                 bottom left is the 4x4 gradient matrix. The weights are quantized to 4 bins (denoted with 4 colors),
                 all the weights in the same bin share the same value, thus for each weight, we then need to store only
                 a small index into a table of shared weights. During update, all the gradients are grouped by the color
                 and summed together, multiplied by the learning rate and subtracted from the shared centroids from
                 last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each
                 CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
                 To calculate the compression rate, given k clusters, we only need log_2(k) bits to encode the index. In
                 general, for a network with n connections and each connection is represented with b bits, constraining
                 the connections to have only k shared weights will result in a compression rate of:

                                                  <<FORMULA>>                                   (1)

                 For example, Figure 3 shows the weights of a single layer neural network with four input units and
                 four output units. There are4%4 = 16weights originally but there are only4shared weights: similar
                 weights are grouped together to share the same value. Originally we need to store 16 weights each

                                                    <<FIGURE>>                 

                 Figure 4: Left: Three different methods for centroids initialization. Right: Distribution of weights
                 (blue) and distribution of codebook before (green cross) and after ﬁne-tuning (red dot).


                 has 32 bits, now we need to store only 4 effective weights (blue, green, red and orange), each has 32
                 bits, together with 16 2-bit indices giving a compression rate of <<FORMULA>>

                  3.1 WEIGHT SHARING

                 We use k-means clustering to identify the shared weights for each layer of a trained network, so that
                 all the weights that fall into the same cluster will share the same weight. Weights are not shared across
                 layers. We partition n original weights <<FORMULA>> into k clusters <<FORMULA>>,
                 n%k, so as to minimize the within-cluster sum of squares (WCSS):

                                               <<FORMULA>>                      (2)

                 Different from HashNet (Chen et al., 2015) where weight sharing is determined by a hash function
                 before the networks sees any training data, our method determines weight sharing after a network is
                 fully trained, so that the shared weights approximate the original network.

                  3.2 INITIALIZATION OF SHARED WEIGHTS

                 Centroid initialization impacts the quality of clustering and thus affects the network’s prediction
                 accuracy. We examine three initialization methods: Forgy(random), density-based, and linear
                 initialization. In Figure 4 we plotted the original weights’ distribution of conv3 layer in AlexNet
                 (CDF in blue, PDF in red). The weights forms a bimodal distribution after network pruning. On the
                 bottom it plots the effective weights (centroids) with 3 different initialization methods (shown in blue,
                 red and yellow). In this example, there are 13 clusters.
                 Forgy(random) initialization randomly chooses k observations from the data set and uses these as
                 the initial centroids. The initialized centroids are shown in yellow. Since there are two peaks in the
                 bimodal distribution, Forgy method tend to concentrate around those two peaks.
                 Density-based initialization linearly spaces the CDF of the weights in the y-axis, then ﬁnds the
                 horizontal intersection with the CDF, and ﬁnally ﬁnds the vertical intersection on the x-axis, which
                 becomes a centroid, as shown in blue dots. This method makes the centroids denser around the two
                 peaks, but more scatted than the Forgy method.
                 Linear initialization linearly spaces the centroids between the [min, max] of the original weights.
                 This initialization method is invariant to the distribution of the weights and is the most scattered
                 compared with the former two methods.
                 Larger weights play a more important role than smaller weights (Han et al., 2015), but there are fewer
                 of these large weights. Thus for both Forgy initialization and density-based initialization, very few
                 centroids have large absolute value which results in poor representation of these few large weights.
                 Linear initialization does not suffer from this problem. The experiment section compares the accuracy

                                                  <<FIGURE>>

                      Figure 5: Distribution for weight (Left) and index (Right). The distribution is biased.
                 of different initialization methods after clustering and ﬁne-tuning, showing that linear initialization
                 works best.

                  3.3 FEED-FORWARD AND BACK-PROPAGATION

                 The centroids of the one-dimensional k-means clustering are the shared weights. There is one level
                 of indirection during feed forward phase and back-propagation phase looking up the weight table.
                 An index into the shared weight table is stored for each connection. During back-propagation, the
                 gradient for each shared weight is calculated and used to update the shared weight. This procedure is
                 shown in Figure 3.
                 We denote the loss byL, the weight in the ith column and jth row by Wij, the centroid index of
                 element Wij by Iij, the kth centroid of the layer by Ck. By using the indicator function <<1(.)>>, the
                 gradient of the centroids is calculated as:

                                                   <<FORMULA>>               (3)
 
                  4 HUFFMAN CODING

                 A Huffman code is an optimal preﬁx code commonly used for lossless data compression(Van Leeuwen,
                 1976). It uses variable-length codewords to encode source symbols. The table is derived from the
                 occurrence probability for each symbol. More common symbols are represented with fewer bits.
                 Figure 5 shows the probability distribution of quantized weights and the sparse matrix index of the
                 last fully connected layer in AlexNet. Both distributions are biased: most of the quantized weights are
                 distributed around the two peaks; the sparse matrix index difference are rarely above 20. Experiments
                 show that Huffman coding these non-uniformly distributed values saves 20% to 30% of network
                 storage.

                  5 EXPERIMENTS

                 We pruned, quantized, and Huffman encoded four networks: two on MNIST and two on ImageNet
                 data-sets. The network parameters and accuracy- 1 before and after pruning are shown in Table 1. The
                 compression pipeline saves network storage by 35% to 49% across different networks without loss
                 of accuracy. The total size of AlexNet decreased from 240MB to 6.9MB, which is small enough to
                 be put into on-chip SRAM, eliminating the need to store the model in energy-consuming DRAM
                 memory.

                 Training is performed with the Caffe framework (Jia et al., 2014). Pruning is implemented by adding
                 a mask to the blobs to mask out the update of the pruned connections. Quantization and weight
                 sharing are implemented by maintaining a codebook structure that stores the shared weight, and
                 group-by-index after calculating the gradient of each layer. Each shared weight is updated with all
                 the gradients that fall into that bucket. Huffman coding doesn’t require training and is implemented
                 ofﬂine after all the ﬁne-tuning is ﬁnished.

                  5.1 LE NET-300-100 AND LE NET-5 ON MNIST

                 We ﬁrst experimented on MNIST dataset with LeNet-300-100 and LeNet-5 network (LeCun et al.,
                 1998). LeNet-300-100 is a fully connected network with two hidden layers, with 300 and 100
                    1 Reference model is from Caffe model zoo, accuracy is measured without data augmentation

                 Table 1: The compression pipeline can save35%to49%parameter storage with no loss of accuracy.

                                                <<TABLE>>

                 Table 2: Compression statistics for LeNet-300-100. P: pruning, Q:quantization, H:Huffman coding.

                                                <<TABLE>>

                 Table 3: Compression statistics for LeNet-5. P: pruning, Q:quantization, H:Huffman coding.

                                                <<TABLE>>

                 neurons each, which achieves 1.6% error rate on Mnist. LeNet-5 is a convolutional network that
                 has two convolutional layers and two fully connected layers, which achieves 0.8% error rate on
                 Mnist. Table 2 and table 3 show the statistics of the compression pipeline. The compression rate
                 includes the overhead of the codebook and sparse indexes. Most of the saving comes from pruning
                 and quantization (compressed 32%), while Huffman coding gives a marginal gain (compressed 40%)

                  5.2 ALEX NET ON IMAGE NET

                 We further examine the performance of Deep Compression on the ImageNet ILSVRC-2012 dataset,
                 which has 1.2M training examples and 50k validation examples. We use the AlexNet Caffe model as
                 the reference model, which has 61 million parameters and achieved a top-1 accuracy of 57.2% and a
                 top-5 accuracy of 80.3%. Table 4 shows that AlexNet can be compressed to2:88%of its original size
                 without impacting accuracy. There are 256 shared weights in each CONV layer, which are encoded
                 with 8 bits, and 32 shared weights in each FC layer, which are encoded with only 5 bits. The relative
                 sparse index is encoded with 4 bits. Huffman coding compressed additional 22%, resulting in 35%
                 compression in total.

                  5.3 VGG-16 ON IMAGE NET

                 With promising results on AlexNet, we also looked at a larger, more recent network, VGG-16 (Si-
                 monyan & Zisserman, 2014), on the same ILSVRC-2012 dataset. VGG-16 has far more convolutional
                 layers but still only three fully-connected layers. Following a similar methodology, we aggressively
                 compressed both convolutional and fully-connected layers to realize a signiﬁcant reduction in the
                 number of effective weights, shown in Table5.
                 The VGG16 network as a whole has been compressed by49%. Weights in the CONV layers are
                 represented with 8 bits, and FC layers use 5 bits, which does not impact the accuracy. The two largest
                 fully-connected layers can each be pruned to less than 1.6% of their original size. This reduction

                    Table 4: Compression statistics for AlexNet. P: pruning, Q: quantization, H:Huffman coding.

                                                   <<TABLE>>

                    Table 5: Compression statistics for VGG-16. P: pruning, Q:quantization, H:Huffman coding.

                                                    <<TABLE>>

                 is critical for real time image processing, where there is little reuse of these layers across images
                 (unlike batch processing). This is also critical for fast object detection algorithms where one CONV
                 pass is used by many FC passes. The reduced layers will ﬁt in an on-chip SRAM and have modest
                 bandwidth requirements. Without the reduction, the bandwidth requirements are prohibitive.

                  6 DISCUSSIONS

                  6.1 PRUNING AND QUANTIZATION WORKING TOGETHER

                 Figure 6 shows the accuracy at different compression rates for pruning and quantization together
                 or individually. When working individually, as shown in the purple and yellow lines, accuracy of
                 pruned network begins to drop signiﬁcantly when compressed below 8% of its original size; accuracy
                 of quantized network also begins to drop signiﬁcantly when compressed below 8% of its original
                 size. But when combined, as shown in the red line, the network can be compressed to 3% of original
                 size with no loss of accuracy. On the far right side compared the result of SVD, which is inexpensive
                 but has a poor compression rate.
                 The three plots in Figure 7 show how accuracy drops with fewer bits per connection for CONV layers
                 (left), FC layers (middle) and all layers (right). Each plot reports both top-1 and top-5 accuracy.
                 Dashed lines only applied quantization but without pruning; solid lines did both quantization and
                 pruning. There is very little difference between the two. This shows that pruning works well with
                 quantization.
                 Quantization works well on pruned network because unpruned AlexNet has 60 million weights to
                 quantize, while pruned AlexNet has only 6.7 million weights to quantize. Given the same amount of
                 centroids, the latter has less error.

                                              <<FIGURE>>

                 Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning and
                 quantization works best when combined.

                                              <<FIGURE>>

                 Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid:
                 quantization on pruned network; Accuracy begins to drop at the same number of quantization bits
                 whether or not the network has been pruned. Although pruning made the number of parameters less,
                 quantization still works well, or even better(3 bits case on the left ﬁgure) as in the unpruned network.

                                                <<FIGURE>>

                 Figure 8: Accuracy of different initialization methods. Left: top-1 accuracy. Right: top-5 accuracy.
                 Linear initialization gives best result.

                 The ﬁrst two plots in Figure 7 show that CONV layers require more bits of precision than FC layers.
                 For CONV layers, accuracy drops signiﬁcantly below 4 bits, while FC layer is more robust: not until
                 2 bits did the accuracy drop signiﬁcantly.


                  6.2 CENTROID INITIALIZATION

                 Figure 8 compares the accuracy of the three different initialization methods with respect to top-1
                 accuracy (Left) and top-5 accuracy (Right). The network is quantized to2%8bits as shown on
                 x-axis. Linear initialization outperforms the density initialization and random initialization in all
                 cases except at 3 bits.
                 The initial centroids of linear initialization spread equally across the x-axis, from the min value to the
                 max value. That helps to maintain the large weights as the large weights play a more important role
                 than smaller ones, which is also shown in network pruning Han et al. (2015). Neither random nor
                 density-based initialization retains large centroids. With these initialization methods, large weights are
                 clustered to the small centroids because there are few large weights. In contrast, linear initialization
                 allows large weights a better chance to form a large centroid.

                                            <<FIGURE>>

                 Figure 9: Compared with the original network, pruned network layer achieved 3% speedup on CPU,
                 3.5% on GPU and 4.2% on mobile GPU on average. Batch size = 1 targeting real time processing.
                 Performance number normalized to CPU.

                                            <<FIGURE>>

                 Figure 10: Compared with the original network, pruned network layer takes 7% less energy on CPU,
                 3.3% less on GPU and 4.2% less on mobile GPU on average. Batch size = 1 targeting real time
                 processing. Energy number normalized to CPU.

                  6.3 SPEEDUP AND ENERGY EFFICIENCY

                 Deep Compression is targeting extremely latency-focused applications running on mobile, which
                 requires real-time inference, such as pedestrian detection on an embedded processor inside an
                 autonomous vehicle. Waiting for a batch to assemble signiﬁcantly adds latency. So when bench-
                 marking the performance and energy efﬁciency, we consider the case when batch size = 1. The cases
                 of batching are given in Appendix A.
                 Fully connected layer dominates the model size (more than90%) and got compressed the most by
                 Deep Compression (96%weights pruned in VGG-16). In state-of-the-art object detection algorithms
                 such as fast R-CNN (Girshick, 2015), up to 38% computation time is consumed on FC layers on
                 uncompressed model. So it’s interesting to benchmark on FC layers, to see the effect of Deep
                 Compression on performance and energy. Thus we setup our benchmark on FC6, FC7, FC8 layers of
                 AlexNet and VGG-16. In the non-batched case, the activation matrix is a vector with just one column,
                 so the computation boils down to dense / sparse matrix-vector multiplication for original / pruned
                 model, respectively. Since current BLAS library on CPU and GPU doesn’t support indirect look-up
                 and relative indexing, we didn’t benchmark the quantized model.
                 We compare three different off-the-shelf hardware: the NVIDIA GeForce GTX Titan X and the Intel
                 Core i7 5930K as desktop processors (same package as NVIDIA Digits Dev Box) and NVIDIA Tegra
                 K1 as mobile processor. To run the benchmark on GPU, we used cuBLAS GEMV for the original
                 dense layer. For the pruned sparse layer, we stored the sparse matrix in in CSR format, and used
                 cuSPARSE CSRMV kernel, which is optimized for sparse matrix-vector multiplication on GPU. To
                 run the benchmark on CPU, we used MKL CBLAS GEMV for the original dense model and MKL
                 SPBLAS CSRMV for the pruned sparse model.

                 To compare power consumption between different systems, it is important to measure power at a
                 consistent manner (NVIDIA, b). For our analysis, we are comparing pre-regulation power of the
                 entire application processor (AP) / SOC and DRAM combined. On CPU, the benchmark is running on
                 single socket with a single Haswell-E class Core i7-5930K processor. CPU socket and DRAM power
                 are as reported by the pcm-power utility provided by Intel. For GPU, we used nvidia-smi
                 utility to report the power of Titan X. For mobile GPU, we use a Jetson TK1 development board and
                 measured the total power consumption with a power-meter. We assume 15% AC to DC conversion
                 loss,85% regulator efﬁciency and 15% power consumed by peripheral components (NVIDIA, a) to
                 report the AP+DRAM power for Tegra K1.

                 Table 6: Accuracy of AlexNet with different aggressiveness of weight sharing and quantization. 8/5
                 bit quantization has no loss of accuracy; 8/4 bit quantization, which is more hardware friendly, has
                 negligible loss of accuracy of 0.01%; To be really aggressive, 4/2 bit quantization resulted in 1.99%
                 and 2.60% loss of accuracy.

                                                <<TABLE>>

                 The ratio of memory access over computation characteristic with and without batching is different.
                 When the input activations are batched to a matrix the computation becomes matrix-matrix multipli-
                 cation, where locality can be improved by blocking. Matrix could be blocked to ﬁt in caches and
                 reused efﬁciently. In this case, the amount of memory access isO(n2 ), and that of computation is
                 O(n3 ), the ratio between memory access and computation is in the order of1=n.
                 In real time processing when batching is not allowed, the input activation is a single vector and the
                 computation is matrix-vector multiplication. In this case, the amount of memory access isO(n2 ), and
                 the computation isO(n2 ), memory access and computation are of the same magnitude (as opposed
                 to1=n). That indicates MV is more memory-bounded than MM. So reducing the memory footprint
                 is critical for the non-batching case.

                 Figure 9 illustrates the speedup of pruning on different hardware. There are 6 columns for each
                 benchmark, showing the computation time of CPU / GPU / TK1 on dense / pruned network. Time is
                 normalized to CPU. When batch size = 1, pruned network layer obtained 3% to 4% speedup over the
                 dense network on average because it has smaller memory footprint and alleviates the data transferring
                 overhead, especially for large matrices that are unable to ﬁt into the caches. For example VGG16’s
                 FC6 layer, the largest layer in our experiment, contains 400MB data, which is far from the capacity of L3 cache.

                 In those latency-tolerating applications, batching improves memory locality, where weights could
                 be blocked and reused in matrix-matrix multiplication. In this scenario, pruned network no longer
                 shows its advantage. We give detailed timing results in Appendix A.

                 Figure 10 illustrates the energy efﬁciency of pruning on different hardware. We multiply power
                 consumption with computation time to get energy consumption, then normalized to CPU to get
                 energy efﬁciency. When batch size = 1, pruned network layer consumes 3% to 7% less energy over
                 the dense network on average. Reported by nvidia-smi, GPU utilization is 99% for both dense
                 and sparse cases.

                  6.4 RATIO OF WEIGHTS, INDEX AND CODEBOOK

                 Pruning makes the weight matrix sparse, so extra space is needed to store the indexes of non-zero
                 elements. Quantization adds storage for a codebook. The experiment section has already included
                 these two factors. Figure 11 shows the breakdown of three different components when quantizing
                 four networks. Since on average both the weights and the sparse indexes are encoded with 5 bits,
                 their storage is roughly half and half. The overhead of codebook is very small and often negligible.

                                                <<FIGURE>>

                                Figure 11: Storage ratio of weight, index and codebook.

                 Table 7: Comparison with other compression methods on AlexNet. (Collins & Kohli, 2014) reduced
                 the parameters by 4% and with inferior accuracy. Deep Fried Conv nets(Yang et al., 2014) worked
                 on fully connected layers and reduced the parameters by less than 4%. SVD save parameters but
                 suffers from large accuracy loss as much as 2%. Network pruning (Han et al., 2015) reduced the
                 parameters by 9%, not including index overhead. On other networks similar to AlexNet, (Denton
                 et al., 2014) exploited linear structure of conv nets and compressed the network by 2.4% to 13.4%
                 layer wise, with 0.9% accuracy loss on compressing a single layer. (Gong et al., 2014) experimented
                 with vector quantization and compressed the network by 16% to 24%, incurring 1% accuracy loss.

                                                     <<TABLE>>

                  7 RELATED WORK

                 Neural networks are typically over-parametrized, and there is signiﬁcant redundancy for deep learning
                 models(Denil et al., 2013). This results in a waste of both computation and memory usage. There
                 have been various proposals to remove the redundancy: Vanhoucke et al. (2011) explored a ﬁxed-
                 point implementation with 8-bit integer (vs 32-bit ﬂoating point) activations. Hwang & Sung
                 (2014) proposed an optimization method for the ﬁxed-point network with ternary weights and 3-bit
                 activations. Anwar et al. (2015) quantized the neural network using L2 error minimization and
                 achieved better accuracy on MNIST and CIFAR-10 datasets.Denton et al. (2014) exploited the linear
                 structure of the neural network by ﬁnding an appropriate low-rank approximation of the parameters
                 and keeping the accuracy within 1% of the original model.
                 The empirical success in this paper is consistent with the theoretical study of random-like sparse
                 networks with +1/0/-1 weights (Arora et al., 2014), which have been proved to enjoy nice properties
                 (e.g. reversibility), and to allow a provably polynomial time algorithm for training.
                 Much work has been focused on binning the network parameters into buckets, and only the values in
                 the buckets need to be stored. HashedNets(Chen et al., 2015) reduce model sizes by using a hash
                 function to randomly group connection weights, so that all connections within the same hash bucket
                 share a single parameter value. In their method, the weight binning is pre-determined by the hash
                 function, instead of being learned through training, which doesn’t capture the nature of images. Gong
                 et al. (2014) compressed deep conv nets using vector quantization, which resulted in 1% accuracy
                 loss. Both methods studied only the fully connected layer, ignoring the convolutional layers.
                 There have been other attempts to reduce the number of parameters of neural networks by replacing
                 the fully connected layer with global average pooling. The Network in Network architecture(Lin et al.,
                 2013) and GoogLenet(Szegedy et al., 2014) achieves state-of-the-art results on several benchmarks by
                 adopting this idea. However, transfer learning, i.e. reusing features learned on the ImageNet dataset
                 and applying them to new tasks by only ﬁne-tuning the fully connected layers, is more difﬁcult with
                 this approach. This problem is noted by Szegedy et al. (2014) and motivates them to add a linear
                 layer on the top of their networks to enable transfer learning.
                 Network pruning has been used both to reduce network complexity and to reduce over-ﬁtting. An
                 early approach to pruning was biased weight decay (Hanson & Pratt, 1989). Optimal Brain Damage
                 (LeCun et al., 1989) and Optimal Brain Surgeon (Hassibi et al., 1993) prune networks to reduce
                 the number of connections based on the Hessian of the loss function and suggest that such pruning
                 is more accurate than magnitude-based pruning such as weight decay. A recent work (Han et al.,
                 2015) successfully pruned several state of the art large scale networks and showed that the number of
                 parameters could be reduce by an order of magnitude. There are also attempts to reduce the number
                 of activations for both compression and acceleration Van Nguyen et al. (2015).

                  8 FUTURE WORK

                 While thE pruned network has been benchmarked on various hardware, the quantized network with
                 weight sharing has not, because off-the-shelf cuSPARSE or MKL SPBLAS library does not support
                 indirect matrix entry lookup, nor is the relative index in CSC or CSR format supported. So the full
                 advantage of Deep Compression that ﬁt the model in cache is not fully unveiled. A software solution
                 is to write customized GPU kernels that support this. A hardware solution is to build custom ASIC
                 architecture specialized to traverse the sparse and quantized network structure, which also supports
                 customized quantization bit width. We expect this architecture to have energy dominated by on-chip
                 SRAM access instead of off-chip DRAM access.

                  9 CONCLUSION

                 We have presented “Deep Compression” that compressed neural networks without affecting accuracy.
                 Our method operates by pruning the unimportant connections, quantizing the network using weight
                 sharing, and then applying Huffman coding. We highlight our experiments on AlexNet which
                 reduced the weight storage by 35% without loss of accuracy. We show similar results for VGG-16
                 and LeNet networks compressed by 49% and 39% without loss of accuracy. This leads to smaller
                 storage requirement of putting conv nets into mobile app. After Deep Compression the size of these
                 networks ﬁt into on-chip SRAM cache (5pJ/access) rather than requiring off-chip DRAM memory
                 (640pJ/access). This potentially makes deep neural networks more energy efﬁcient to run on mobile.
                 Our compression method also facilitates the use of complex neural networks in mobile applications
                 where application size and download bandwidth are constrained.

                  REFERENCES
                 Anwar, Sajid, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point optimization of deep convolutional
                   neural networks for object recognition. InAcoustics, Speech and Signal Processing (ICASSP),
                   2015 IEEE International Conference on, pp. 1131–1135. IEEE, 2015.
                 Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma, Tengyu. Provable bounds for learning some
                   deep representations. InProceedings of the 31th International Conference on Machine Learning,
                   ICML 2014, pp. 584–592, 2014.
                 BVLC. Caffe model zoo. URLhttp://caffe.berkeleyvision.org/model_zoo.
                 Chen, Wenlin, Wilson, James T., Tyree, Stephen, Weinberger, Kilian Q., and Chen, Yixin. Compress-
                   ing neural networks with the hashing trick.arXiv preprint arXiv:1504.04788, 2015.
                 Collins, Maxwell D and Kohli, Pushmeet. Memory bounded deep convolutional networks.arXiv
                   preprint arXiv:1412.1442, 2014.
                 Denil, Misha, Shakibi, Babak, Dinh, Laurent, de Freitas, Nando, et al. Predicting parameters in deep
                   learning. InAdvances in Neural Information Processing Systems, pp. 2148–2156, 2013.
                 Denton, Emily L, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear
                   structure within convolutional networks for efﬁcient evaluation. InAdvances in Neural Information
                   Processing Systems, pp. 1269–1277, 2014.
                 Girshick, Ross. Fast r-cnn.arXiv preprint arXiv:1504.08083, 2015.
                 Gong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir. Compressing deep convolutional
                   networks using vector quantization.arXiv preprint arXiv:1412.6115, 2014.
                 Han, Song, Pool, Jeff, Tran, John, and Dally, William J. Learning both weights and connections for
                   efﬁcient neural networks. InAdvances in Neural Information Processing Systems, 2015.
                 Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A, and Dally,
                   William J. EIE: Efﬁcient inference engine on compressed deep neural network.arXiv preprint
                   arXiv:1602.01528, 2016.
                 Hanson, Stephen Jose and Pratt, Lorien Y. Comparing biases for minimal network construction with´
                   back-propagation. InAdvances in neural information processing systems, pp. 177–185, 1989.
                 Hassibi, Babak, Stork, David G, et al. Second order derivatives for network pruning: Optimal brain
                   surgeon.Advances in neural information processing systems, pp. 164–164, 1993.
                 Hwang, Kyuyeon and Sung, Wonyong. Fixed-point feedforward deep neural network design using
                   weights+ 1, 0, and- 1. InSignal Processing Systems (SiPS), 2014 IEEE Workshop on, pp. 1–6.
                   IEEE, 2014.
                 Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross,
                   Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature
                   embedding.arXiv preprint arXiv:1408.5093, 2014.
                 Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classiﬁcation with deep
                   convolutional neural networks. InNIPS, pp. 1097–1105, 2012.
                 LeCun, Yann, Denker, John S, Solla, Sara A, Howard, Richard E, and Jackel, Lawrence D. Optimal
                   brain damage. InNIPs, volume 89, 1989.
                 LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied
                   to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.
                 Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network.arXiv:1312.4400, 2013.
                 NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing
                   to embedded systems, a. URLhttp://www.nvidia.com.
                 NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis, b.
                   URLhttp://www.nvidia.com/object/white-papers.html.
                 Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image
                   recognition.arXiv preprint arXiv:1409.1556, 2014.
                 Strom, Nikko. Phoneme probability estimation with dynamic sparsely connected artiﬁcial neural¨
                   networks.The Free Speech Journal, 1(5):1–41, 1997.
                 Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir,
                   Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions.
                   arXiv preprint arXiv:1409.4842, 2014.
                  Van Leeuwen, Jan. On the construction of huffman trees. InICALP, pp. 382–410, 1976.
                 Van Nguyen, Hien, Zhou, Kevin, and Vemulapalli, Raviteja. Cross-domain synthesis of medical
                   images using efﬁcient location-sensitive deep network. InMedical Image Computing and Computer-
                   Assisted Intervention–MICCAI 2015, pp. 677–684. Springer, 2015.
                 Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Improving the speed of neural networks on
                   cpus. InProc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011.
                 Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and
                   Wang, Ziyu. Deep fried convnets.arXiv preprint arXiv:1412.7149, 2014.

                  A APPENDIX :DETAILED TIMING / POWER REPORTS OF DENSE & SPARSE
                     NETWORK LAYERS

                 Table 8: Average time on different layers. To avoid variance, we measured the time spent on each
                 layer for 4096 input samples, and averaged the time regarding each input sample. For GPU, the time
                 consumed bycudaMallocandcudaMemcpyis not counted. For batch size = 1,gemvis used;
                 For batch size = 64,gemmis used. For sparse case,csrmvandcsrmmis used, respectively.

                                              <<TABLE>>

                 Table 9: Power consumption of different layers. We measured the Titan X GPU power with
                 nvidia-smi, Core i7-5930k CPU power withpcm-powerand Tegra K1 mobile GPU power with
                 an external power meter (scaled to AP+DRAM, see paper discussion). During power measurement,
                 we repeated each computation multiple times in order to get stable numbers. On CPU, dense matrix
                 multiplications consume2xenergy than sparse ones because it is accelerated with multi-threading.

                                              <<TABLE>>
<<END>> <<END>> <<END>>