21524 lines
2.0 MiB
21524 lines
2.0 MiB
<<START>> <<START>> <<START>>
|
||
|
||
Neural Ordinary Differential Equations
|
||
|
||
Ricky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud
|
||
University of Toronto, Vector Institute
|
||
{rtqichen, rubanova, jessebett, duvenaud}@cs.toronto.edu
|
||
|
||
Abstract
|
||
We introduce a new family of deep neural network models. Instead of specifying a
|
||
discrete sequence of hidden layers, we parameterize the derivative of the hidden
|
||
state using a neural network. The output of the network is computed using a black-
|
||
box differential equation solver. These continuous-depth models have constant
|
||
memory cost, adapt their evaluation strategy to each input, and can explicitly trade
|
||
numerical precision for speed. We demonstrate these properties in continuous-depth
|
||
residual networks and continuous-time latent variable models. We also construct
|
||
continuous normalizing flows, a generative model that can train by maximum
|
||
likelihood, without partitioning or ordering the data dimensions. For training, we
|
||
show how to scalably backpropagate through any ODE solver, without access to its
|
||
internal operations. This allows end-to-end training of ODEs within larger models.
|
||
|
||
1 Introduction
|
||
|
||
Models such as residual networks, recurrent neural network decoders, and normalizing flows build
|
||
complicated transformations by composing a sequence of transformations to a hidden state:
|
||
|
||
<<FORMULA>> (1)
|
||
|
||
where t ∈ {0 . . . T } and ht ∈ R . These iterative updates can be seen as an Euler discretization of a
|
||
continuous transformation (Lu et al., 2017; Haber and Ruthotto, 2017; Ruthotto and Haber, 2018).
|
||
What happens as we add more layers and take smaller steps? In the limit, we parameterize the continuous
|
||
dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network:
|
||
Starting from the input layer h(0), we can define the output layer h(T ) to be the solution to this
|
||
|
||
<<FORMULA>> (2)
|
||
|
||
ODE initial value problem at some time T . This value can be computed by a black-box differential
|
||
equation solver, which evaluates the hidden unit dynamics f wherever necessary to determine the
|
||
solution with the desired accuracy. Figure 1 contrasts these two approaches.
|
||
Defining and evaluating models using ODE solvers has several benefits:
|
||
Memory efficiency In Section 2, we show how to compute gradients of a scalar-valued loss with
|
||
respect to all inputs of any ODE solver, without backpropagating through the operations of the solver.
|
||
Not storing any intermediate quantities of the forward pass allows us to train our models with constant
|
||
memory cost as a function of depth, a major bottleneck of training deep models.
|
||
32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
|
||
Adaptive computation Euler’s method is perhaps the simplest method for solving ODEs. There
|
||
have since been more than 120 years of development of efficient and accurate ODE solvers (Runge,
|
||
1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth
|
||
of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to
|
||
achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with
|
||
problem complexity. After training, accuracy can be reduced for real-time or low-power applications.
|
||
Scalable and invertible normalizing flows An unexpected side-benefit of continuous transforma-
|
||
tions is that the change of variables formula becomes easier to compute. In Section 4, we derive
|
||
this result and use it to construct a new class of invertible density models that avoids the single-unit
|
||
bottleneck of normalizing flows, and can be trained directly by maximum likelihood.
|
||
Continuous time-series models Unlike recurrent neural networks, which require discretizing
|
||
observation and emission intervals, continuously-defined dynamics can naturally incorporate data
|
||
which arrives at arbitrary times. In Section 5, we construct and demonstrate such a model.
|
||
2 Reverse-mode automatic differentiation of ODE solutions
|
||
The main technical difficulty in training continuous-depth networks is performing reverse-mode
|
||
differentiation (also known as backpropagation) through the ODE solver. Differentiating through
|
||
the operations of the forward pass is straightforward, but incurs a high memory cost and introduces
|
||
additional numerical error.
|
||
We treat the ODE solver as a black box, and compute gradients using the adjoint sensitivity
|
||
method (Pontryagin et al., 1962). This approach computes gradients by solving a second, aug-
|
||
mented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly
|
||
with problem size, has low memory cost, and explicitly controls numerical error.
|
||
Consider optimizing a scalar-valued loss function L(), whose input is the result of an ODE solver:
|
||
|
||
<<FORMULA>> (3)
|
||
|
||
To optimize L, we require gradients with respect to θ. The first step is to determining how the gradient
|
||
of the loss depends on the hidden state z(t) at each instant. This quantity is called the adjoint a(t) = ∂L/∂z(t).
|
||
Its dynamics are given by another ODE, which can be thought of as the State instantaneous analog of the chain rule:
|
||
Adjoint State
|
||
|
||
<<FORMULA>> (4)
|
||
|
||
We can compute ∂L/∂z(t0 ) by another call to an ODE solver. This solver must run backwards, starting from the initial
|
||
value of ∂L/∂z(t1 ). One complication is that solving this ODE requires the knowing value of z(t) along its entire tra-
|
||
jectory. However, we can simply recompute z(t) backwards in time together with the adjoint, starting from its final
|
||
value z(t1 ).
|
||
|
||
If the loss depends directly on the state at multi- Computing the gradients with respect to the pa-
|
||
ple observation times, the adjoint state must be parameters θ requires evaluating a third integral,
|
||
updated in the direction of the partial derivative of which depends on both z(t) and a(t):
|
||
the loss with respect to each observation.
|
||
|
||
<<FORMULA>> (5)
|
||
|
||
The vector-Jacobian products <<FORMULA>> and <<FORMULA>> in (4) and (5) can be efficiently evaluated by
|
||
automatic differentiation, at a time cost similar to that of evaluating f . All integrals for solving z,
|
||
and <<FORMULA>> can be computed in a single call to an ODE solver, which concatenates the original state, the
|
||
adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the
|
||
necessary dynamics, and call an ODE solver to compute all gradients at once.
|
||
|
||
<<ALGORITHM>>
|
||
|
||
Most ODE solvers have the option to output the state z(t) at multiple times. When the loss depends
|
||
on these intermediate states, the reverse-mode derivative must be broken into a sequence of separate
|
||
solves, one between each consecutive pair of output times (Figure 2). At each observation, the adjoint
|
||
must be adjusted in the direction of the corresponding partial derivative ∂L/∂z(ti ).
|
||
The results above extend those of Stapor et al. (2018, section 2.4.2). An extended version of
|
||
Algorithm 1 including derivatives w.r.t. t0 and t1 can be found in Appendix C. Detailed derivations
|
||
are provided in Appendix B. Appendix D provides Python code which computes all derivatives for
|
||
scipy.integrate.odeint by extending the autograd automatic differentiation package. This
|
||
code also supports all higher-order derivatives. We have since released a PyTorch (Paszke et al.,
|
||
2017) implementation, including GPU-based implementations of several standard ODE solvers at
|
||
github.com/rtqichen/torchdiffeq.
|
||
|
||
Replacing residual networks with ODEs for supervised learning
|
||
|
||
In this section, we experimentally investigate the training of neural ODEs for supervised learning.
|
||
Software To solve ODE initial value problems numerically, we use the implicit Adams method
|
||
implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being
|
||
an implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires
|
||
solving a nonlinear optimization problem at every step. This setup makes direct backpropagation
|
||
through the integrator difficult. We implement the adjoint sensitivity method in Python’s autograd
|
||
framework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden
|
||
state dynamics and their derivatives on the GPU using Tensorflow, which were then called from the
|
||
Fortran ODE solvers, which were called from Python autograd code.
|
||
|
||
Model Architectures We experiment with a small residual network which downsamples the et al. (1998).
|
||
input twice then applies 6 standard residual blocks He et al. (2016b), which are replaced by an ODESolve
|
||
module in the ODE-Net variant. We also test a network with the same architecture but where gradients are
|
||
backpropagated directly through a Runge-Kutta integrator, re-ferred to as RK-Net. Table 1 shows test error,
|
||
number of parameters, and memory cost. L denotes the number of layers in the ResNet, and L̃ is the number
|
||
of function evaluations that the ODE solver
|
||
requests in a single forward pass, which can be interpreted as an implicit number of layers. We find
|
||
that ODE-Nets and RK-Nets can achieve around the same performance as the ResNet.
|
||
Error Control in ODE-Nets ODE solvers can approximately ensure that the output is within a
|
||
given tolerance of the true solution. Changing this tolerance changes the behavior of the network.
|
||
We first verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is
|
||
proportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a
|
||
3
|
||
trade-off between accuracy and computational cost. One could train with high accuracy, but switch to
|
||
a lower accuracy at test time.
|
||
Figure 3: Statistics of a trained ODE-Net. (NFE = number of function evaluations.)
|
||
Figure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly
|
||
half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory
|
||
efficient, but also more computationally efficient than directly backpropagating through the integrator,
|
||
because the latter approach will need to backprop through each function evaluation in the forward
|
||
pass.
|
||
Network Depth It’s not clear how to define the ‘depth‘ of an ODE solution. A related quantity is
|
||
the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver
|
||
and dependent on the initial state or input. Figure 3d shows that he number of function evaluations
|
||
increases throughout training, presumably adapting to increasing complexity of the model.
|
||
|
||
4 Continuous Normalizing Flows
|
||
|
||
The discretized equation (1) also appears in normalizing flows (Rezende and Mohamed, 2015) and
|
||
the NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to
|
||
compute exact changes in probability if samples are transformed through a bijective function f :
|
||
|
||
<<FORMULA>> (6)
|
||
|
||
An example is the planar normalizing flow (Rezende and Mohamed, 2015):
|
||
|
||
<<FORMULA>> (7)
|
||
|
||
Generally, the main bottleneck to using the change of variables formula is computing of the deter-
|
||
minant of the Jacobian ∂f/∂z, which has a cubic cost in either the dimension of z, or the number
|
||
of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow
|
||
layers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018).
|
||
Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the
|
||
computation of the change in normalizing constant:
|
||
Theorem 1 (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable
|
||
with probability p(z(t)) dependent on time. Let dz dt = f (z(t), t) be a differential equation describing
|
||
a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z
|
||
and continuous in t, then the change in log probability also follows a differential equation,
|
||
|
||
<<FORMULA>> (8)
|
||
|
||
Proof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation.
|
||
Also unlike standard finite flows, the differential equation f does not need to be bijective, since if
|
||
uniqueness is satisfied, then the entire transformation is automatically bijective.
|
||
As an example application of the instantaneous change of variables, we can examine the continuous
|
||
analog of the planar flow, and its change in normalization constant:
|
||
|
||
<<FORMULA>> (9)
|
||
|
||
Given an initial distribution p(z(0)), we can sample from p(z(t)) and evaluate its density by solving
|
||
this combined ODE.
|
||
Using multiple hiddenP units with P linear cost While det is not a linear function, the trace function
|
||
is, which implies tr( n Jn ) = n tr(Jn ). Thus if our dynamics is given by a sum of functions then
|
||
the differential equation for the log density is also a sum:
|
||
|
||
|
||
<<FORMULA>> (10)
|
||
|
||
This means we can cheaply evaluate flow models having many hidden units, with a cost only linear in
|
||
the number of hidden units M . Evaluating such ‘wide’ flow layers using standard normalizing flows
|
||
costs O(M 3 ), meaning that standard NF architectures use many layers of only a single hidden unit.
|
||
Time-dependent dynamics We can specify the parameters of a flow as a function of t, making the
|
||
differential equation f (z(t), t) change with t. This is parameterization is a kind of hypernetwork
|
||
(Ha et al., 2016). We also introduce a gating mechanism for each hidden unit,
|
||
|
||
<<FORMULA>>
|
||
|
||
where σn (t) ∈ (0, 1) is a neural network that learns when the dynamic fn (z) should be applied. We
|
||
call these models continuous normalizing flows (CNF).
|
||
|
||
4.1 Experiments with Continuous Normalizing Flows
|
||
|
||
We first compare continuous and discrete planar flows at learning to sample from a known distribution.
|
||
We show that a planar CNF with M hidden units can be at least as expressive as a planar NF with
|
||
K = M layers, and sometimes much more expressive.
|
||
Density matching We configure the CNF as described above, and train for 10,000 iterations
|
||
using Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using
|
||
RMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we
|
||
minimize KL (q(x)kp(x)) as the loss function where q is the flow model and the target density p(·)
|
||
can be evaluated. Figure 4 shows that CNF generally achieves lower loss.
|
||
Maximum Likelihood Training A useful property of continuous-time normalizing flows is that
|
||
we can compute the reverse transformation for about the same cost as the forward pass, which cannot
|
||
be said for normalizing flows. This lets us train the flow on a density estimation task by performing
|
||
maximum likelihood estimation, which maximizes Ep(x) [log q(x)] where q(·) is computed using
|
||
the appropriate change of variables theorem, then afterwards reverse the CNF to generate random
|
||
samples from q(x).
|
||
For this task, we use 64 hidden units for CNF, and 64 stacked one-hidden-unit layers for NF. Figure 5
|
||
shows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the
|
||
transformed distribution after a small amount of time which shows the locations of the initial planar
|
||
flows. Interestingly, to fit the Two Circles distribution, the CNF rotates the planar flows so that
|
||
the particles can be evenly spread into circles. While the CNF transformations are smooth and
|
||
interpretable, we find that NF transformations are very unintuitive and this model has difficulty fitting
|
||
the two moons dataset in Figure 5b.
|
||
|
||
5 A generative latent function time-series model
|
||
|
||
Applying neural networks to irregularly-sampled data such as medical records, network traffic, or
|
||
neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the
|
||
latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill-
|
||
defined latent variables. Missing data can be addressed using generative time-series models (Álvarez
|
||
and Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data
|
||
imputation (Che et al., 2018). Another approach concatenates time-stamp information to the input of
|
||
an RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017).
|
||
We present a continuous-time, generative approach to modeling time series. Our model represents
|
||
each time series by a latent trajectory. Each trajectory is determined from a local initial state, zt0 , and
|
||
a global set of latent dynamics shared across all time series. Given observation times t0 , t1 , . . . , tN
|
||
and an initial state zt0 , an ODE solver produces zt1 , . . . , ztN , which describe the latent state at each
|
||
observation.We define this generative model formally through a sampling procedure:
|
||
<<FORMULA>> (11)
|
||
<<FORMULA>> (12)
|
||
<<FORMULA>> (13)
|
||
Function f is a time-invariant function that takes the value z at the current time step and outputs the
|
||
gradient: ∂z(t)/∂t = f (z(t), θf ). We parametrize this function using a neural net. Because f is time-
|
||
|
||
invariant, given any latent state z(t), the entire latent trajectory is uniquely defined. Extrapolating
|
||
this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.
|
||
Training and Prediction We can train this latent-variable model as a variational autoen-
|
||
coder (Kingma and Welling, 2014; Rezende et al., 2014), with sequence-valued observations. Our
|
||
recognition net is an RNN, which consumes the data sequentially backwards in time, and out-
|
||
puts qφ (z0 |x1 , x2 , . . . , xN ). A detailed algorithm can be found in Appendix E. Using ODEs as a
|
||
generative model allows us to make predictions for arbitrary time points t1 ...tM on a continuous
|
||
timeline.
|
||
Poisson Process likelihoods The fact that an observation oc-
|
||
curred often tells us something about the latent state. For ex-
|
||
ample, a patient may be more likely to take a medical test if
|
||
they are sick. The rate of events can be parameterized by a
|
||
function of the latent state: p(event at time t| z(t)) = λ(z(t)).
|
||
Given this rate function, the likelihood of a set of indepen-
|
||
dent observation times in the interval [tstart , tend ] is given by an t
|
||
inhomogeneous Poisson process (Palm, 1943):
|
||
|
||
We can parameterize λ(·) using another neural network. Con-
|
||
veniently, we can evaluate both the latent trajectory and the
|
||
Poisson process likelihood together in a single call to an ODE solver. Figure 7 shows the event rate
|
||
learned by such a model on a toy dataset.
|
||
A Poisson process likelihood on observation
|
||
times can be combined with a data likelihood to
|
||
jointly model all observations and the times at
|
||
which they were made.
|
||
|
||
5.1 Time-series Latent ODE Experiments
|
||
|
||
We investigate the ability of the latent ODE
|
||
model to fit and extrapolate time series. The
|
||
recognition network is an RNN with 25 hidden
|
||
units. We use a 4-dimensional latent space. We
|
||
parameterize the dynamics function f with a
|
||
one-hidden-layer network with 20 hidden units.
|
||
The decoder computing p(xti |zti ) is another
|
||
neural network with one hidden layer with 20
|
||
hidden units. Our baseline was a recurrent neu-
|
||
ral net with 25 hidden units trained to minimize
|
||
negative Gaussian log-likelihood. We trained a
|
||
second version of this RNN whose inputs were
|
||
concatenated with the time difference to the next
|
||
observation to aid RNN with irregular observations.
|
||
Bi-directional spiral dataset We generated neural network. (b): Reconstructions and extrapo-
|
||
a dataset of 1000 2-dimensional spirals, each lations by a latent neural ODE. Blue curve shows
|
||
starting at a different point, sampled at 100 model prediction. Red shows extrapolation. (c) A
|
||
equally-spaced timesteps. The dataset contains projection of inferred 4-dimensional latent ODE
|
||
two types of spirals: half are clockwise while trajectories onto their first two dimensions. Color
|
||
the other half counter-clockwise. To make the indicates the direction of the corresponding trajec-
|
||
task more realistic, we add gaussian noise to the tory. The model has learned latent dynamics which
|
||
observations.
|
||
|
||
progression through time, starting at purple and ending at red. Note that the trajectories on the left
|
||
are counter-clockwise, while the trajectories on the right are clockwise.
|
||
Time series with irregular time points To generate irregular timestamps, we randomly sample
|
||
points from each trajectory without replacement (n = {30, 50, 100}). We report predictive root-
|
||
mean-squared error (RMSE) on 100 time points extending beyond those that were used for training.
|
||
Table 2 shows that the latent ODE has substantially lower predictive RMSE.
|
||
|
||
We observed that reconstructions and extrapolations are consistent with the ground truth
|
||
regardless of number of observed points and despite the noise.
|
||
Latent space interpolation Figure 8c shows latent trajectories projected onto the first two dimen-
|
||
sions of the latent space. The trajectories form two separate clusters of trajectories, one decoding to
|
||
clockwise spirals, the other to counter-clockwise. Figure 9 shows that the latent trajectories change
|
||
smoothly as a function of the initial point z(t0 ), switching from a clockwise to a counter-clockwise
|
||
spiral.
|
||
|
||
6 Scope and Limitations
|
||
|
||
Minibatching The use of mini-batches is less straightforward than for standard neural networks.
|
||
One can still batch together evaluations through the ODE solver by concatenating the states of each
|
||
batch element together, creating a combined ODE with dimension D × K. In some cases, controlling
|
||
error on all batch elements together might require evaluating the combined system K times more
|
||
often than if each system was solved individually. However, in practice the number of evaluations did
|
||
not increase substantially when using minibatches.
|
||
Uniqueness When do continuous dynamics have a unique solution? Picard’s existence theo-
|
||
rem (Coddington and Levinson, 1955) states that the solution to an initial value problem exists and is
|
||
unique if the differential equation is uniformly Lipschitz continuous in z and continuous in t. This
|
||
theorem holds for our model if the neural network has finite weights and uses Lipshitz nonlinearities,
|
||
such as tanh or relu.
|
||
Setting tolerances Our framework allows the user to trade off speed for precision, but requires
|
||
the user to choose an error tolerance on both the forward and reverse passes during training. For
|
||
sequence modeling, the default value of 1.5e-8 was used. In the classification and density estimation
|
||
experiments, we were able to reduce the tolerance to 1e-3 and 1e-5, respectively, without degrading
|
||
performance.
|
||
Reconstructing forward trajectories Reconstructing the state trajectory by running the dynamics
|
||
backwards can introduce extra numerical error if the reconstructed trajectory diverges from the
|
||
original. This problem can be addressed by checkpointing: storing intermediate values of z on the
|
||
forward pass, and reconstructing the exact forward trajectory by re-integrating from those points. We
|
||
did not find this to be a practical problem, and we informally checked that reversing many layers of
|
||
continuous normalizing flows with default tolerances recovered the initial states.
|
||
8
|
||
7 Related Work
|
||
|
||
The use of the adjoint method for training continuous-time neural networks was previously pro-
|
||
posed (LeCun et al., 1988; Pearlmutter, 1995), though was not demonstrated practically. The
|
||
interpretation of residual networks He et al. (2016a) as approximate ODE solvers spurred research
|
||
into exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al.,
|
||
2017). We demonstrate these same properties in more generality by directly using an ODE solver.
|
||
Adaptive computation One can adapt computation time by training secondary neural networks
|
||
to choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al.,
|
||
2016; Figurnov et al., 2017; Chang et al., 2018). However, this introduces overhead both at training
|
||
and test time, and extra parameters that need to be fit. In contrast, ODE solvers offer well-studied,
|
||
computationally cheap, and generalizable rules for adapting the amount of computation.
|
||
Constant memory backprop through reversibility Recent work developed reversible versions
|
||
of residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives
|
||
the same constant memory advantage as our approach. However, these methods require restricted
|
||
architectures, which partition the hidden units. Our approach does not have these restrictions.
|
||
Learning differential equations Much recent work has proposed learning differential equations
|
||
from data. One can train feed-forward or recurrent neural networks to approximate a differential
|
||
equation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applica-
|
||
tions such as fluid simulation (Wiewel et al., 2018). There is also significant work on connecting
|
||
Gaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to fit
|
||
differential equations (Raissi et al., 2018b) and can naturally model continuous-time effects and
|
||
interventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic
|
||
variational inference to recover the solution of a given stochastic differential equation.
|
||
Differentiating through ODE solvers The dolfin library (Farrell et al., 2013) implements adjoint
|
||
computation for general ODE and PDE solutions, but only by backpropagating through the individual
|
||
operations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient
|
||
estimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity
|
||
analysis is quadratic-time in the number of variables, whereas the adjoint sensitivity analysis is
|
||
linear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint
|
||
method to train bespoke latent dynamic models.
|
||
In contrast, by providing a generic vector-Jacobian product, we allow an ODE solver to be trained
|
||
end-to-end with any other differentiable model components. While use of vector-Jacobian products
|
||
for solving the adjoint method has been explored in optimal control (Andersson, 2013; Andersson
|
||
et al., In Press, 2018), we highlight the potential of a general integration of black-box ODE solvers
|
||
into automatic differentiation (Baydin et al., 2018) for deep learning and generative modeling.
|
||
8 Conclusion
|
||
We investigated the use of black-box ODE solvers as a model component, developing new models
|
||
for time-series modeling, supervised learning, and density estimation. These models are evaluated
|
||
adaptively, and allow explicit control of the tradeoff between computation speed and accuracy.
|
||
Finally, we derived an instantaneous version of the change of variables formula, and developed
|
||
continuous-time normalizing flows, which can scale to large layer sizes.
|
||
9 Acknowledgements
|
||
We thank Wenyi Wang and Geoff Roeder for help with proofs, and Daniel Duckworth, Ethan Fetaya,
|
||
Hossein Soleimani, Eldad Haber, Ken Caluwaerts, Daniel Flam-Shepherd, and Harry Braviner for
|
||
feedback. We thank Chris Rackauckas, Dougal Maclaurin, and Matthew James Johnson for helpful
|
||
discussions. We also thank Yuval Frommer for pointing out an unsupported claim about parameter
|
||
efficiency.
|
||
9
|
||
References
|
||
Mauricio A Álvarez and Neil D Lawrence. Computationally efficient convolved multiple output
|
||
Gaussian processes. Journal of Machine Learning Research, 12(May):1459–1500, 2011.
|
||
Brandon Amos and J Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks.
|
||
In International Conference on Machine Learning, pages 136–145, 2017.
|
||
Joel Andersson. A general-purpose software framework for dynamic optimization. PhD thesis, 2013.
|
||
Joel A E Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi – A
|
||
software framework for nonlinear optimization and optimal control. Mathematical Programming
|
||
Computation, In Press, 2018.
|
||
Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind.
|
||
Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18
|
||
(153):1–153, 2018.
|
||
Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
|
||
normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
|
||
Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betan-
|
||
court. The Stan math library: Reverse-mode automatic differentiation in c++. arXiv preprint
|
||
arXiv:1509.07164, 2015.
|
||
Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible
|
||
architectures for arbitrarily deep residual neural networks. arXiv preprint arXiv:1709.03698, 2017.
|
||
Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks
|
||
from dynamical systems view. In International Conference on Learning Representations, 2018.
|
||
URL https://openreview.net/forum?id=SyJS-OgR-.
|
||
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural
|
||
networks for multivariate time series with missing values. Scientific Reports, 8(1):6085, 2018.
|
||
URL https://doi.org/10.1038/s41598-018-24271-9.
|
||
Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun.
|
||
Doctor AI: Predicting clinical events via recurrent neural networks. In Proceedings of the 1st
|
||
Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning
|
||
Research, pages 301–318. PMLR, 18–19 Aug 2016. URL http://proceedings.mlr.press/
|
||
v56/Choi16.html.
|
||
Earl A Coddington and Norman Levinson. Theory of ordinary differential equations. Tata McGraw-
|
||
Hill Education, 1955.
|
||
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
|
||
estimation. arXiv preprint arXiv:1410.8516, 2014.
|
||
Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song.
|
||
Recurrent marked temporal point processes: Embedding event history to vector. In International
|
||
Conference on Knowledge Discovery and Data Mining, pages 1555–1564. ACM, 2016.
|
||
Patrick Farrell, David Ham, Simon Funke, and Marie Rognes. Automated derivation of the adjoint of
|
||
high-level transient finite element programs. SIAM Journal on Scientific Computing, 2013.
|
||
Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and
|
||
Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint,
|
||
2017.
|
||
J. Futoma, S. Hariharan, and K. Heller. Learning to Detect Sepsis with a Multitask Gaussian Process
|
||
RNN Classifier. ArXiv e-prints, 2017.
|
||
Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network:
|
||
Backpropagation without storing activations. In Advances in Neural Information Processing
|
||
Systems, pages 2211–2221, 2017.
|
||
10
|
||
Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint
|
||
arXiv:1603.08983, 2016.
|
||
David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
|
||
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34
|
||
(1):014004, 2017.
|
||
E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems.
|
||
Springer, 1987.
|
||
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
|
||
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
|
||
pages 770–778, 2016a.
|
||
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
|
||
networks. In European conference on computer vision, pages 630–645. Springer, 2016b.
|
||
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture
|
||
6a overview of mini-batch gradient descent, 2012.
|
||
Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in
|
||
recurrent neural networks. arXiv preprint arXiv:1611.06188, 2016.
|
||
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
|
||
arXiv:1412.6980, 2014.
|
||
Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. International Conference
|
||
on Learning Representations, 2014.
|
||
Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.
|
||
Improved variational inference with inverse autoregressive flow. In Advances in Neural Information
|
||
Processing Systems, pages 4743–4751, 2016.
|
||
W. Kutta. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für
|
||
Mathematik und Physik, 46:435–453, 1901.
|
||
Yann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework for back-propagation.
|
||
In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21–28. CMU,
|
||
Pittsburgh, Pa: Morgan Kaufmann, 1988.
|
||
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
|
||
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
|
||
Yang Li. Time-dependent representation for neural event sequence prediction. arXiv preprint
|
||
arXiv:1708.00065, 2017.
|
||
Zachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with
|
||
RNNs: Improved classification of clinical time series. In Proceedings of the 1st Machine Learning
|
||
for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 253–
|
||
270. PMLR, 18–19 Aug 2016. URL http://proceedings.mlr.press/v56/Lipton16.html.
|
||
Z. Long, Y. Lu, X. Ma, and B. Dong. PDE-Net: Learning PDEs from Data. ArXiv e-prints, 2017.
|
||
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks:
|
||
Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121,
|
||
2017.
|
||
Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Autograd: Reverse-mode differentiation of
|
||
native Python. In ICML workshop on Automatic Machine Learning, 2015.
|
||
Hongyuan Mei and Jason M Eisner. The neural Hawkes process: A neurally self-modulating
|
||
multivariate point process. In Advances in Neural Information Processing Systems, pages 6757–
|
||
6767, 2017.
|
||
11
|
||
Valdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives of likelihood functionals for
|
||
ODE based models using adjoint-state method. Computational Statistics, 32(4):1621–1643, 2017.
|
||
Conny Palm. Intensitätsschwankungen im fernsprechverker. Ericsson Technics, 1943.
|
||
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
|
||
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
|
||
pytorch. 2017.
|
||
Barak A Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE
|
||
Transactions on Neural networks, 6(5):1212–1228, 1995.
|
||
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathemat-
|
||
ical theory of optimal processes. 1962.
|
||
M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial
|
||
differential equations. Journal of Computational Physics, pages 125–141, 2018.
|
||
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data-
|
||
driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236, 2018a.
|
||
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Numerical Gaussian processes for
|
||
time-dependent and nonlinear partial differential equations. SIAM Journal on Scientific Computing,
|
||
40(1):A172–A198, 2018b.
|
||
Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate
|
||
inference in deep generative models. In Proceedings of the 31st International Conference on
|
||
Machine Learning, pages 1278–1286, 2014.
|
||
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv
|
||
preprint arXiv:1505.05770, 2015.
|
||
C. Runge. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46:
|
||
167–178, 1895.
|
||
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations.
|
||
arXiv preprint arXiv:1804.04272, 2018.
|
||
T. Ryder, A. Golightly, A. S. McGough, and D. Prangle. Black-box Variational Inference for
|
||
Stochastic Differential Equations. ArXiv e-prints, 2018.
|
||
Michael Schober, David Duvenaud, and Philipp Hennig. Probabilistic ODE solvers with Runge-Kutta
|
||
means. In Advances in Neural Information Processing Systems 25, 2014.
|
||
Peter Schulam and Suchi Saria. What-if reasoning with counterfactual Gaussian processes. arXiv
|
||
preprint arXiv:1703.10651, 2017.
|
||
Hossein Soleimani, James Hensman, and Suchi Saria. Scalable joint models for reliable uncertainty-
|
||
aware event prediction. IEEE transactions on pattern analysis and machine intelligence, 2017a.
|
||
Hossein Soleimani, Adarsh Subbaswamy, and Suchi Saria. Treatment-response models for coun-
|
||
terfactual reasoning with continuous-time, continuous-valued interventions. arXiv preprint
|
||
arXiv:1704.02038, 2017b.
|
||
Jos Stam. Stable fluids. In Proceedings of the 26th annual conference on Computer graphics and
|
||
interactive techniques, pages 121–128. ACM Press/Addison-Wesley Publishing Co., 1999.
|
||
Paul Stapor, Fabian Froehlich, and Jan Hasenauer. Optimization and uncertainty analysis of ODE
|
||
models using second order adjoint sensitivity analysis. bioRxiv, page 272005, 2018.
|
||
Jakub M Tomczak and Max Welling. Improving variational auto-encoders using Householder flow.
|
||
arXiv preprint arXiv:1611.09630, 2016.
|
||
Steffen Wiewel, Moritz Becher, and Nils Thuerey. Latent-space physics: Towards learning the
|
||
temporal evolution of fluid flow. arXiv preprint arXiv:1802.10123, 2018.
|
||
Hong Zhang and Adrian Sandu. Fatode: a library for forward, adjoint, and tangent linear integration
|
||
of ODEs. SIAM Journal on Scientific Computing, 36(5):C504–C523, 2014.
|
||
|
||
|
||
Appendix A Proof of the Instantaneous Change of Variables Theorem
|
||
|
||
Theorem (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability
|
||
p(z(t)) dependent on time. Let dz/dt = f (z(t), t) be a differential equation describing a continuous-in-time
|
||
transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the
|
||
change in log probability also follows a differential equation:
|
||
|
||
<<FORMULA>>
|
||
|
||
Proof. To prove this theorem, we take the infinitesimal limit of finite changes of log p(z(t)) through time. First
|
||
we denote the transformation of z over an ε change in time as
|
||
<<FORMULA>> (14)
|
||
We assume that f is Lipschitz continuous in z(t) and continuous in t, so every initial value problem has a unique
|
||
solution by Picard’s existence theorem. We also assume z(t) is bounded. These conditions imply that f , Tε , and
|
||
∂T are all bounded. In the following, we use these conditions to exchange limits and products.
|
||
|
||
<<FORMULA>>
|
||
|
||
We can write the differential equation <<FORMULA>> using the discrete change of variables formula, and the
|
||
definition of the derivative:
|
||
|
||
<<FORMULA>> (15)
|
||
|
||
<<FORMULA>> (16)
|
||
|
||
<<FORMULA>> (by L’Hôpital’s rule) (17)
|
||
|
||
<<FORMULA>> (18)
|
||
|
||
<<FORMULA>> (19)
|
||
|
||
<<FORMULA>> (20)
|
||
|
||
The derivative of the determinant can be expressed using Jacobi’s formula, which gives
|
||
|
||
<<FORMULA>> (21)
|
||
|
||
<<FORMULA>> (22)
|
||
|
||
<<FORMULA>> (23)
|
||
|
||
|
||
Substituting Tε with its Taylor series expansion and taking the limit, we complete the proof.
|
||
|
||
<<FORMULA>> (24)
|
||
|
||
<<FORMULA>> (25)
|
||
|
||
<<FORMULA>> (26)
|
||
|
||
<<FORMULA>> (27)
|
||
|
||
|
||
A.1 Special Cases
|
||
|
||
Planar CNF. Let f (z) = uh(wz + b), then ∂z = u ∂h ∂z. Since the trace of an outer product is the inner
|
||
product, we have
|
||
|
||
<<FORMULA>> (28)
|
||
|
||
This is the parameterization we use in all of our experiments.
|
||
Hamiltonian CNF. The continuous analog of NICE (Dinh et al., 2014) is a Hamiltonian flow, which splits
|
||
|
||
the data into two equal partitions and is a volume-preserving transformation, implying that ∂t = 0. We
|
||
can verify this. Let
|
||
|
||
<<FORMULA>> (29)
|
||
|
||
Then because the Jacobian is all zeros on its diagonal, the trace is zero. This is a volume-preserving flow.
|
||
A.2 Connection to Fokker-Planck and Liouville PDEs
|
||
The Fokker-Planck equation is a well-known partial differential equation (PDE) that describes the probability
|
||
density function of a stochastic differential equation as it changes with time. We relate the instantaneous change
|
||
of variables to the special case of Fokker-Planck with zero diffusion, the Liouville equation.
|
||
As with the instantaneous change of variables, let z(t) ∈ RD evolve through time following dz(t)/dt = f (z(t), t).
|
||
Then Liouville equation describes the change in density of z–a fixed point in space–as a PDE,
|
||
|
||
<<FORMULA>> (30)
|
||
|
||
However, (30) cannot be easily used as it requires the partial derivatives of p(z,t)/∂z, which is typically approximated
|
||
using finite difference. This type of PDE has its own literature on efficient and accurate simulation (Stam, 1999).
|
||
Instead of evaluating p(·, t) at a fixed point, if we follow the trajectory of a particle z(t), we obtain
|
||
|
||
<<FORMULA>>
|
||
|
||
partial derivative from first argument, z(t) partial derivative from second argument, t
|
||
|
||
<<FORMULA>> (31)
|
||
|
||
We arrive at the instantaneous change of variables by taking the log,
|
||
|
||
<<FORMULA>> (32)
|
||
|
||
While still a PDE, (32) can be combined with z(t) to form an ODE of size D + 1,
|
||
|
||
<<FORMULA>> (33)
|
||
|
||
Compared to the Fokker-Planck and Liouville equations, the instantaneous change of variables is of more
|
||
practical impact as it can be numerically solved much more easily, requiring an extra state of D for following
|
||
the trajectory of z(t). Whereas an approach based on finite difference approximation of the Liouville equation
|
||
would require a grid size that is exponential in D.
|
||
Appendix B A Modern Proof of the Adjoint Method
|
||
We present an alternative proof to the adjoint method (Pontryagin et al., 1962) that is short and easy to follow.
|
||
14
|
||
B.1 Continuous Backpropagation
|
||
|
||
Let z(t) follow the differential equation dt = f (z(t), t, θ), where θ are the parameters. We will prove that if
|
||
we define an adjoint state
|
||
|
||
<<FORMULA>> (34)
|
||
|
||
then it follows the differential equation
|
||
|
||
<<FORMULA>> (35)
|
||
|
||
For ease of notation, we denote vectors as row vectors, whereas the main text uses column vectors.
|
||
The adjoint state is the gradient with respect to the hidden state at a specified time t. In standard neural networks,
|
||
the gradient of a hidden layer ht depends on the gradient from the next layer ht+1 by chain rule
|
||
|
||
<<FORMULA>> (36)
|
||
|
||
With a continuous hidden state, we can write the transformation after an ε change in time as
|
||
|
||
<<FORMULA>> (37)
|
||
|
||
<<FORMULA>> (38)
|
||
|
||
The proof of (35) follows from the definition of derivative:
|
||
|
||
<<FORMULA>> (39)
|
||
|
||
<<FORMULA>> (by Eq 38) (40)
|
||
|
||
<<FORMULA>> (Taylor series around z(T)) (41)
|
||
|
||
<<FORMULA>> (42)
|
||
|
||
<<FORMULA>> (43)
|
||
|
||
<<FORMULA>> (44)
|
||
|
||
<<FORMULA>> (45)
|
||
|
||
We pointed out the similarity between adjoint method and backpropagation (eq. 38). Similarly to backpropaga-
|
||
tion, ODE for the adjoint state needs to be solved backwards in time. We specify the constraint on the last time
|
||
point, which is simply the gradient of the loss wrt the last time point, and can obtain the gradients with respect to
|
||
the hidden state at any time, including the initial value.
|
||
|
||
<<FORMULA>> (46)
|
||
|
||
Here we assumed that loss function L depends only on the last time point tN . If function L depends also on
|
||
intermediate time points t1 , t2 , . . . , tN −1 , etc., we can repeat the adjoint step for each of the intervals [tN −1 , tN ],
|
||
[tN −2 , tN −1 ] in the backward order and sum up the obtained gradients.
|
||
B.2 Gradients wrt. θ and t
|
||
We can generalize (35) to obtain gradients with respect to θ–a constant wrt. t–and and the initial and end times,
|
||
t0 and tN . We view θ and t as states with constant differential equations and write
|
||
|
||
<<FORMULA>> (47)
|
||
|
||
We can then combine these with z to form an augmented state1 with corresponding differential equation and
|
||
adjoint state,
|
||
|
||
<<FORMULA>> (48)
|
||
|
||
Note this formulates the augmented ODE as an autonomous (time-invariant) ODE, but the derivations in the
|
||
previous section still hold as this is a special case of a time-variant ODE. The Jacobian of f has the form
|
||
|
||
<<FORMULA>> (49)
|
||
|
||
where each 0 is a matrix of zeros with the appropriate dimensions. We plug this into (35) to obtain
|
||
|
||
<<FORMULA>> (50)
|
||
|
||
The first element is the adjoint differential equation (35), as expected. The second element can be used to obtain
|
||
the total gradient with respect to the parameters, by integrating over the full interval and setting aθ (tN ) = 0.
|
||
|
||
<<FORMULA>> (51)
|
||
|
||
Finally, we also get gradients with respect to t0 and tN , the start and end of the integration interval.
|
||
|
||
<<FORMULA>> (52)
|
||
|
||
Between (35), (46), (51), and (52) we have gradients for all possible inputs to an initial value problem solver.
|
||
|
||
Appendix C Full Adjoint sensitivities algorithm
|
||
|
||
This more detailed version of Algorithm 1 includes gradients with respect to the start and end times of integration.
|
||
Algorithm 2 Complete reverse-mode derivative of an ODE initial value problem
|
||
|
||
Input: dynamics parameters θ, start time t0 , stop time t1 , final state z(t1 ), loss gradient ∂L/∂z(t1 )
|
||
|
||
<<ALGORITHM>>
|
||
|
||
Note that we’ve overloaded t to be both a part of the state and the (dummy) independent variable. The
|
||
distinction is clear given context, so we keep t as the independent variable for consistency with the rest of the
|
||
text.
|
||
|
||
Appendix D Autograd Implementation
|
||
|
||
<<ALGORITHM>>
|
||
|
||
Appendix E Algorithm for training the latent ODE model
|
||
|
||
To obtain the latent representation zt0 , we traverse the sequence using RNN and obtain parameters of distribution
|
||
q(zt0 |{xti , ti }i , θenc ). The algorithm follows a standard VAE algorithm with an RNN variational posterior and
|
||
an ODESolve model:
|
||
<<ALGORITHM>>
|
||
<<FORMULA>> (53)
|
||
<<ALGORITHM>>
|
||
|
||
Appendix F Extra Figures
|
||
|
||
<<FIGURE>>
|
||
|
||
<<END>> <<END>> <<END>>
|
||
|
||
|
||
<<START>> <<START>> <<START>>
|
||
|
||
Learning differential equations that are easy to solve
|
||
|
||
Jacob Kelly∗ Jesse Bettencourt∗
|
||
University of Toronto, Vector Institute University of Toronto, Vector Institute
|
||
jkelly@cs.toronto.edu jessebett@cs.toronto.edu
|
||
Matthew James Johnson David Duvenaud
|
||
Google Brain University of Toronto, Vector Institute
|
||
|
||
mattjj@google.com duvenaud@cs.toronto.edu
|
||
|
||
Abstract
|
||
|
||
|
||
Differential equations parameterized by neural networks become expensive to solve
|
||
numerically as training progresses. We propose a remedy that encourages learned
|
||
dynamics to be easier to solve. Specifically, we introduce a differentiable surrogate
|
||
for the time cost of standard numerical solvers, using higher-order derivatives
|
||
of solution trajectories. These derivatives are efficient to compute with Taylor-
|
||
mode automatic differentiation. Optimizing this additional objective trades model
|
||
performance against the time cost of solving the learned dynamics. We demonstrate
|
||
our approach by training substantially faster, while nearly as accurate, models in
|
||
supervised classification, density estimation, and time-series modelling tasks.
|
||
|
||
1 Introduction
|
||
|
||
Differential equations describe a system’s behavior by specifying its instantaneous dynamics.
|
||
Historically, differential equations have been derived from theory, such as Newtonian mechanics,
|
||
Maxwell’s equations, or epidemiological models of infectious disease, with parameters inferred
|
||
from observations. Solutions to these equations usually cannot be expressed in closed-form,
|
||
requiring numerical approximation. Recently, ordinary differential equations parameterized by
|
||
millions of learned parameters, called neural ODEs, have been fit for latent time series models,
|
||
density models, or as a replacement for very deep neural networks (Rubanova et al., 2019; Grath-
|
||
wohl et al., 2019; Chen et al., 2018). These models are not constrained to match a theoretical
|
||
model,and sometimes substantially different dynamics can give nearly indistinguishable predictions.
|
||
This raises the possibility that we can find nearly equivalent models that are substantially easier
|
||
and faster to solve. Yet standard training methods have no way to penalize the complexity of the
|
||
dynamics being learned.
|
||
|
||
<<FIGURE>>
|
||
|
||
Equal Contribution. Code available at: github.com/jacobjinkelly/easy-neural-ode
|
||
|
||
How can we learn dynamics that are faster to solve numerically without substantially changing their
|
||
predictions? Much of the computational advantages of a continuous-time formulation come from
|
||
using adaptive solvers, and most of the time cost of these solvers comes from repeatedly evaluating
|
||
the dynamics function, which in our settings is a moderately-sized neural network. So, we’d like to
|
||
reduce the number of function evaluations (NFE) required for these solvers to reach a given error
|
||
tolerance. Ideally, we would add a term penalizing the NFE to the training objective, and let a
|
||
gradient-based optimizer trade off between solver cost and predictive performance. But because NFE
|
||
is integer-valued, we need to find a differentiable surrogate.
|
||
The NFE taken by an adaptive solver depends on how far it can extrapolate the trajectory forward
|
||
without introducing too much error. For example, for a standard adaptive-step Runge-Kutta solver
|
||
with order m, the step size is approximately inversely proportional to the norm of the local mth total
|
||
derivative of the solution trajectory with respect to time. That is, a larger mth derivative leads to a
|
||
smaller step size and thus more function evaluations. Thus, we propose to minimize the norm of this
|
||
total derivative during training, as a way to control the time required to solve the learned dynamics.
|
||
In this paper, we investigate the effect of this speed regularization in various models and solvers.
|
||
We examine the relationship between the solver order and the regularization order, and characterize
|
||
the tradeoff between speed and performance. In most instances, we find that solver speed can be
|
||
approximately doubled without a substantial increase in training loss. We also provide an extension
|
||
to the JAX program transformation framework that provides Taylor-mode automatic differentiation,
|
||
which is asymptotically more efficient for computing the required total derivatives than standard
|
||
nested gradients.
|
||
Our work compares against and generalizes that of Finlay et al. (2020), who proposed regularizing
|
||
dynamics in the FFJORD density estimation model, and showed that it stabilized dynamics enough
|
||
in that setting to allow the use of fixed-step solvers during training.
|
||
2 Background
|
||
An ordinary differential equation (ODE) specifies the instantaneous change of a vector-valued state
|
||
<<FORMULA>>, computing the state at a later time:
|
||
|
||
<<FORMULA>>
|
||
|
||
is called an initial value problem (IVP). For example, f could describe the equations of motion for a
|
||
particle, or the transmission and recovery rates for a virus across a population. Usually, the required
|
||
integral has no analytic solution, and must be approximated numerically.
|
||
Adaptive-step Runge-Kutta ODE Solvers Runge-Kutta methods (Runge, 1895; Kutta, 1901)
|
||
approximate the solution trajectories of ODEs through a series of small steps, starting at time t0 .
|
||
At each step, they choose a step size h, and fit a local approximation to the solution, ẑ(t), using
|
||
several evaluations of f . When h is sufficiently small, the numerical error of a mth-order method
|
||
is bounded by kẑ(t + h) − z(t + h)k ≤ chm+1 for some constant c (Hairer et al., 1993). So, for a
|
||
mth-order method, the local error grows approximately in proportion to the size of the mth coefficient
|
||
in the Taylor expansion of the true solution. All else being equal, controlling this coefficient for all
|
||
dimensions of z(t) will allow larger steps to be taken without surpassing the error tolerance.
|
||
Neural Ordinary Differential Equations The dynamics function f can be a moderately-sized
|
||
neural network, and its parameters θ trained by gradient descent. Solving the resulting IVP is
|
||
analogous to evaluating a very deep residual network in which the number of layers corresponds
|
||
to the number of function evaluations of the solver (Chang et al., 2017; Ruthotto & Haber, 2018;
|
||
Chen et al., 2018). Solving such continuous-depth models using adaptive numerical solvers has
|
||
several computational advantages over standard discrete-depth network architectures. However, this
|
||
approach is often slower than using a fixed-depth network, due to an inability to control the number
|
||
of steps required by an adaptive-step solver.
|
||
|
||
3 Regularizing Higher-Order Derivatives for Speed
|
||
|
||
The ability of Runge-Kutta methods to take large and accurate steps is limited by the Kth-order
|
||
Taylor coefficients of the solution trajectory. We would like these coefficients to be small. Specifically,
|
||
we propose to regularize the squared norm of the Kth-order total derivatives of the state with respect
|
||
to time, integrated along the entire solution trajectory:
|
||
|
||
<<FORMULA>> (1)
|
||
|
||
where k·k2 is the squared `2 norm, and the dependence on the dynamics parameters θ is implicit
|
||
through the solution z(t) integrating dz(t)
|
||
|
||
<<dt = f (z(t), t, θ)>>.
|
||
|
||
During training, we weigh this regularization term by a hyperparameter λ and add it to our original loss
|
||
to get our regularized objective:
|
||
|
||
<<FORMULA>> (2)
|
||
|
||
What kind of solutions are allowed when RK = 0? For K = 0,
|
||
|
||
<<FORMULA>>
|
||
|
||
we have kz(t)k2 = 0, so the only possible solution is z(t) = 0.
|
||
|
||
For K = 1, we have kf (z(t), t)k2 = 0, so all solutions are
|
||
constant, flat trajectories. For K = 2 solutions are straight-line
|
||
trajectories. Higher values of K shrink higher derivatives, but
|
||
don’t penalize lower-order dynamics. For instance, a quadratic
|
||
trajectory will have R3 = 0. Setting the Kth order dynamics to
|
||
exactly zero everywhere automatically makes all higher orders
|
||
zero as well. Figure 1 shows that regularizing R3 on a toy 1D
|
||
neural ODE reduces NFE.
|
||
|
||
<<FIGURE>>
|
||
|
||
Which orders should we regularize? We propose matching the
|
||
order of the regularizer to that of the solver being used. We
|
||
conjecture that regularizing dynamics of lower orders than that
|
||
of the solver restricts the model unnecessarily, and that let-
|
||
ting the lower orders remain unregularized should not increase
|
||
NFE very much. Figure 2 shows empirically which orders
|
||
of Runge-Kutta solvers can efficiently solve which orders of
|
||
toy polynomial trajectories. We test these conjectures on real
|
||
models and datasets in section 6.2.
|
||
|
||
The solution trajectory and our regularization term can be computed in a single call to an ODE solver
|
||
by augmenting the system with the integrand in eq. (1).
|
||
|
||
4 Efficient Higher Order Differentiation with Taylor Mode
|
||
|
||
The number of terms in higher-order forward derivatives grows exponentially in K, becoming
|
||
prohibitively expensive for K = 5, and causing substantial slowdowns even for K = 2 and K = 3.
|
||
Luckily, there exists a generalization of forward-mode automatic differentiation (AD), known as
|
||
Taylor mode, which can compute the total derivative exactly for a cost of only O(K 2 ). We found
|
||
that this asymptotic improvement reduced wall-clock time by an order of magnitude, even for K as
|
||
low as 3.
|
||
First-order forward-mode AD Standard forward-mode AD computes, for a function f (x) and
|
||
an input perturbation vector v, the product ∂f ∂x v. This Jacobian-vector product, or JVP, can be
|
||
computed efficiently without explicitly instantiating the Jacobian. This implicit computation of JVPs
|
||
is straightforward whenever f is a composition of operations for which which implicit JVP rules are
|
||
known.
|
||
Higher-order Jacobian-vector products Forward-mode AD can be generalized to higher orders
|
||
K
|
||
to compute Kth-order Jacobians contracted K times against the perturbation vector: ∂∂xKf v ⊗K .
|
||
Similarly, this can also be computed without representing any Jacobian matrices explicitly.
|
||
|
||
A naïve approach to higher-order forward mode is to recursively apply first-order forward mode.
|
||
K
|
||
Specifically, nesting JVPs K times gives the right answer: <<FORMULA>> but
|
||
causes an unnecessary exponential slowdown, costing O(exp(K)). This is because expressions that
|
||
appear in lower derivatives also appear in higher derivatives, but the work to compute is not shared
|
||
across orders.
|
||
Taylor Mode Taylor-mode AD generalizes Function Taylor propagation rule
|
||
first-order forward mode to compute the first <<y = z + cw>> <<y[k] = z[k] + cw[k]>>
|
||
K derivatives exactly with a time cost of only <<Pk>>
|
||
O(K 2 ) or O(K log K), depending on the op- <<y =z∗w>> << y[k] = h j=0 z[j] w[k−j] i>>
|
||
<<Pk−1>>
|
||
erations involved. Instead of providing rules <<y = z/w>> <<y[k] = w10 zk − j=0 z[j] w[k−j]>>
|
||
for propagating perturbation vectors, one pro- <<Pk>>
|
||
<<y = exp(z)>> <<ỹ[k] = j=1 y[k−j] z̃[j]>>
|
||
vides rules for propagating truncated Taylor <<Pk>>
|
||
series. Some example rules are shown in ta- <<s = sin(z)>> <<s̃[k] = j=1 z̃[j] c[k−j]>>
|
||
<<Pk>>
|
||
ble 1. For more details see the Appendix and <<c = cos(z)>> <<c̃[k] = j=1 −z̃[j] s[k−j]>>
|
||
Griewank & Walther (2008, Chapter 12). We
|
||
provide an open source implementation of Table 1: Rules for propagating Taylor polynomial
|
||
Taylor mode AD in the JAX Python library coefficients through standard functions. These rules
|
||
(Bradbury et al., 2018). generalize standard first-order derivatives. Notation
|
||
<<z[i] = i!1 zi>> and <<ỹ[i] = i!i zi>>.
|
||
|
||
5 Experiments
|
||
|
||
We consider three different tasks in which continuous-
|
||
|
||
depth or continuous time models might have computa-
|
||
|
||
|
||
tional advantages over standard discrete-depth models:
|
||
supervised learning, continuous generative modeling of
|
||
time-series (Rubanova et al., 2019), and density estima-
|
||
tion using continuous normalizing flows (Grathwohl et al.,
|
||
2019). Unless specified otherwise, we use the standard
|
||
|
||
dopri5 Runge-Kutta 4(5) solver (Dormand & Prince,
|
||
1980; Shampine, 1986). <<FIGURE>>
|
||
|
||
5.1 Supervised Learning Figure 3: Number of function evalua-
|
||
tions (NFE) and training error during
|
||
We construct a model for MNIST classification: it takes in training. Speed regularization (solid)
|
||
as input a flattened MNIST image and integrates it through decreases the NFE throughout training
|
||
dynamics given by a simple MLP, then applies a linear without substantially changing the train-
|
||
classification layer. In fig. 3 we compare the NFE and ing error.
|
||
training error of a model with and without regularizing
|
||
R3 .
|
||
|
||
5.2 Continuous Generative Time Series Models
|
||
|
||
As in Rubanova et al. (2019), we use the Latent ODE
|
||
architecture for modelling trajectories of ICU patients
|
||
using the PhysioNet Challenge 2012 dataset (Silva
|
||
et al., 2012). This variational autoencoder architec-
|
||
ture uses an RNN recognition network, and models
|
||
the state dynamics using an ODE in a latent space.
|
||
In the supervised learning setting described in the
|
||
previous section only the final state affects model pre- Figure 4: Regularizing dynamics in a la-
|
||
dictions. In contrast, time-series models’ predictions tent ODE modeling PhysioNet clinical data.
|
||
also depend on the value of the trajectory at all inter- Shown are a representative 2-dimensional
|
||
mediate times when observations were made. So, we slice of 20 dimensional dynamics. We re-
|
||
might expect speed regularization to be ineffective duce average NFE from 281 to 90 while only
|
||
due to these extra constraints on the dynamics. How- incurring an 8% increase in loss.
|
||
ever, fig. 4 shows that, without changing their overall
|
||
shape the latent dynamics can be adjusted to reduce their NFE by a factor of 3.
|
||
|
||
5.3 Density Estimation with Continuous Normalizing Flows
|
||
|
||
Our third task is unsupervised density estimation, using a scalable variant of continuous normalizing
|
||
flows called FFJORD (Grathwohl et al., 2019). We fit the MINIBOONE tabular dataset from
|
||
Papamakarios et al. (2017) and the MNIST image dataset (LeCun et al., 2010). We use the respective
|
||
singe-flow architectures from Grathwohl et al. (2019).
|
||
Grathwohl et al. (2019) noted that the NFE required to numerically integrate their dynamics could
|
||
become prohibitively expensive throughout training. Table 2 shows that we can reduce NFE by 38%
|
||
for only a 0.6% increase in log-likelihood measured in bits/dim.
|
||
How to train your Neural ODE We compare against the approach of Finlay et al. (2020), who
|
||
design two regularization terms specifically for stabilizing the dynamics of FFJORD models:
|
||
|
||
<<FORMULA>>
|
||
|
||
The first term is designed to encourage straight-line paths, and the second, stochastic, term is designed
|
||
to reduce overfitting. Finlay et al. (2020) used fixed-step solvers during training for some datasets.
|
||
We compare these two regularization on training with each of adaptive and fixed-step solvers, and
|
||
evaluated using an adaptive solver, in section 6.3.
|
||
6 Analysis and Discussion
|
||
6.1 Trading off function evaluations for loss
|
||
What does the trade off between accuracy and speed look like? Ideally, we could reduce the solver
|
||
time a lot without substantially reducing model performance. Indeed, this is demonstrated in all three
|
||
settings we explored. Figure 5 shows that generally, model performance starts getting substantially
|
||
worse only after a 50% reduction in solver speed when controlling R2 .
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 5: Tuning the regularization of R2 trades off between training loss and solver speed in three
|
||
different applications of neural ODEs. Horizontal axes show average number of function evaluations,
|
||
and vertical axes show unregularized training loss, both at the end of training.
|
||
|
||
6.2 Order of regularization vs. order of solver
|
||
|
||
Which order of total derivatives should we regularize for a particular solver? As mentioned earlier,
|
||
we conjecture that the best choice would be to match the order of the solver being used. Regularizing
|
||
too low an order might needlessly constrain the dynamics and make it harder to fit the data, while
|
||
regularizing too high an order might leave the dynamics difficult to solve for a lower-order solver.
|
||
However, we also expect that optimizing higher-order derivatives might be challenging, since these
|
||
higher derivatives can change quickly even for small changes to the dynamics parameters.
|
||
Figures 6 and 7 investigate this question on the task of MNIST classification. Figure 6 compares the
|
||
effectiveness of regularizing different orders when using a solver of a particular order. For a 2nd
|
||
order solver, regularizing K = 2 produces a strictly better trade-off between performance and speed,
|
||
as expected. For higher-order solvers, including ones with adaptive order, we found that regularizing
|
||
orders above K = 3 gave little benefit.
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 7 investigates the relationship between RK and the quantity it is meant to be a surrogate
|
||
for: NFE. We observe a clear monotonic relationship between the two, for all orders of solver and
|
||
regularization.
|
||
|
||
6.3 Do we reduce training time?
|
||
|
||
Our approach produces models that are fastest to evaluate at test time. However, when we train
|
||
with adaptive solvers we do not improve overall training time, due to the additional expense of
|
||
computing our regularizer. Training with a fixed-grid solver is faster, but can be unstable if dynamics
|
||
are unregularized. Finlay et al. (2020)’s regularization and ours allow us to use fixed grid solvers and
|
||
reduce training time. However, ours is 2.4× slower than Finlay et al. (2020) for FFJORD because
|
||
their regularization re-uses terms already computed in the FFJORD training objective. For objectives
|
||
where these cannot be re-used, like MNIST classification, our method is 1.7× slower, but achieves
|
||
better test-time NFE.
|
||
|
||
6.4 Are we making the solver overconfident?
|
||
|
||
Because we optimize dynamics in a way specifically designed to make the solver take longer steps,
|
||
we might fear that we are “adversarially attacking” our solver, making it overconfident in its ability
|
||
to extrapolate. Figure 8c shows that this is not the case for MNIST classification.
|
||
|
||
6.5 Does speed regularization overfit?
|
||
|
||
Finlay et al. (2020) motivated one of their regularization terms by the possibility of overfitting: having
|
||
faster dynamics only for the examples in the training set, but still low on the test set. However, they
|
||
did not check whether overfitting was occurring. In fig. 8b we confirm that our regularized dynamics
|
||
have nearly identical average solve time on a held-out test set, on MNIST classification.
|
||
|
||
7 Related Work
|
||
|
||
Although the field of numerical ODE solvers is extremely mature, as far as we know, there has
|
||
been almost no work specifically on tuning differential equations to be faster to solve. The closest
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 8: Figure 8c We observe that the actual solver error is about equally well-calibrated for
|
||
regularized dynamics as random dynamics, indicating that regularization does not make the solver
|
||
overconfident. Figure 8b: There is negligible overfitting of solver speed. ??: Speed regularization
|
||
does not usefully improve generalization. For large λ, our method reduces overfitting, but increases
|
||
overall test error due to under-fitting.
|
||
|
||
related work is Grathwohl et al. (2019) who mention attempting to use weight decay and spectral
|
||
normalization to reduce NFE, and of course Finlay et al. (2020), who, among other contributions,
|
||
introduced the use of fixed-step solvers for stable training.
|
||
Stabilizing dynamics Simard et al. (1991) regularized the dynamics of discrete-time recurrent
|
||
neural networks to improve their stability, by constraining the norm of the Jacobian of the dynamics
|
||
function in the direction of its largest eigenvalue. However, this approach has an O(D3 ) time cost.
|
||
De Brouwer et al. (2019) introduced a parameterization of neural ODEs analogous to instantaneous
|
||
Gated Recurrent Unit (GRU) recurrent neural network architectures in order to stabilize training
|
||
dynamics. Dupont et al. (2019) provided theoretical arguments that adding extra dimensions to the
|
||
state of a neural ODE should make training easier, and showed that this helped reduce NFE during
|
||
training.
|
||
Gradually increasing depth Chang et al. (2017) noted the connection between residual networks
|
||
and ODEs, and took advantage of this connection to gradually make resnets deeper during training,
|
||
in order to save time. One can view the increase in NFE while neural ODEs as an automatic, but
|
||
uncontrolled, version of their method. Their results suggest we might benefit from introducing a
|
||
speed regularization schedule that gradually tapers off during training.
|
||
Gradient Regularization Novak et al. (2018); Drucker & LeCun (1992) regularized the gradients
|
||
of neural networks to improve generalization.
|
||
Table 2: Density Estimation on MNIST using FFJORD. For adaptive solvers, indicated by ∞ Steps,
|
||
our approach is slowest to train, but requires the fewest NFE once trained. For fixed-step solvers our
|
||
approach achieves lower bits/dim and NFE when comparing across fixed-grid solvers using the same
|
||
number of steps. Fixed step solvers that diverged due to instability are indicated by NaN bits/dim.
|
||
|
||
8 Scope
|
||
|
||
The initial speedups obtained in this paper are not yet enough to make neural ODEs competitive with
|
||
standard fixed-depth architectures in terms of speed for standard supervised learning. However, there
|
||
are many applications where continuous-depth architectures provide a unique advantage. Besides
|
||
density models such as FFJORD and time series models, continuous-depth architectures have been
|
||
applied in solving mean-field games (Ruthotto et al., 2019), image segmentation (Pinckaers & Litjens,
|
||
2019), image super-resolution (Scao, 2020), and molecular simulations (Wang et al., 2020). These
|
||
applications, which already use continuous-time models, could benefit from the speed regularization
|
||
proposed in this paper.
|
||
While we investigated only ODEs in this paper, this approach could presumably be extended straight-
|
||
forwardly to neural stochastic differential equations fit by adaptive solvers (Li et al., 2020) and other
|
||
flavors of parametric differential equations fit by gradient descent (Rackauckas et al., 2019).
|
||
|
||
9 Limitations
|
||
|
||
Hyperparameters The hyperparameter λ needs to be chosen to balance speed and training loss.
|
||
One the other hand, neural ODEs don’t require choosing the outer number of layers, which needs to
|
||
be chosen separately for each stack of layers in standard architectures.
|
||
One also needs to choose solver order and tolerances, and these can substantially affect solver speed.
|
||
We did not investigate loosening tolerances, or modifying other parameters of the solver. The default
|
||
tolerance of 1.4e-8 for both atol and rtol behaved well in all our experiments.
|
||
One also needs to choose K. Higher K seems to generally work better, but is slower per step at
|
||
training time. In principle, if one can express their utility explicitly in terms of training loss and NFE,
|
||
it may be possible to tune λ automatically during training using the predictable relationship between
|
||
RK and NFE shown in fig. 7.
|
||
Slower overall training Although speed regularization reduces the overall NFE during training, it
|
||
makes each step more expensive. In our density estimation experiments (table 2), the overall effect
|
||
was about about 70% slower training, compared to no regularization, when using adaptive solvers.
|
||
However, test-time evaluation is much faster, since there is no slowdown per step.
|
||
10 Conclusions
|
||
This paper is an initial attempt at controlling the integration time of differential equations by regular-
|
||
izing their dynamics. This is an almost unexplored problem, and there are almost certainly better
|
||
quantities to optimize than the ones examined in this paper.
|
||
Based on these initial experiments, we propose three practical takeaways:
|
||
1. Across all tasks, tuning the regularization usually gave at least a 2x speedup without
|
||
substantially hurting model performance.
|
||
2. Overall training time with speed regularization is in general about 30% to 50% slower with
|
||
adaptive solvers.
|
||
3. For standard solvers, regularizing orders higher than R2 or R3 provided little additional
|
||
benefit.
|
||
Future work It may be possible to adapt solver architectures to take advantage of flexibility in
|
||
choosing the dynamics. Standard solver design has focused on robustly and accurately solving a
|
||
given set of differential equations. However, in a learning setting, we could consider simply rejecting
|
||
some kinds of dynamics as being too difficult to solve, analogous to other kinds of constraints we put
|
||
on models to encourage statistical regularization.
|
||
|
||
Acknowledgements
|
||
|
||
We thank Barak Perlmutter, Ken Jackson, Ricky T.Q. Chen, Will Grathwohl, Chris Finlay, and
|
||
Chris Rackauckas for feedback and helpful discussions. Resources used in preparing this research
|
||
were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and
|
||
companies sponsoring the Vector Institute.
|
||
|
||
References
|
||
|
||
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-
|
||
Milne, S. JAX: composable transformations of Python+NumPy programs, 2018. URL http:
|
||
//github.com/google/jax.
|
||
Chang, B., Meng, L., Haber, E., Tung, F., and Begert, D. Multi-level residual networks from
|
||
dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
|
||
Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential
|
||
equations. In Advances in neural information processing systems, pp. 6571–6583, 2018.
|
||
De Brouwer, E., Simm, J., Arany, A., and Moreau, Y. GRU-ODE-Bayes: Continuous modeling of
|
||
sporadically-observed time series. In Advances in Neural Information Processing Systems, pp.
|
||
7377–7388, 2019.
|
||
Dormand, J. R. and Prince, P. J. A family of embedded Runge-Kutta formulae. Journal of computa-
|
||
tional and applied mathematics, 6(1):19–26, 1980.
|
||
Drucker, H. and LeCun, Y. Improving generalization performance using double backpropagation.
|
||
IEEE Trans. Neural Networks, 3(6):991–997, 1992. doi: 10.1109/72.165600. URL https:
|
||
//doi.org/10.1109/72.165600.
|
||
Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural ODEs. In Advances in Neural Information
|
||
Processing Systems, pp. 3134–3144, 2019.
|
||
Finlay, C., Jacobsen, J.-H., Nurbekyan, L., and Oberman, A. M. How to train your neural ODE.
|
||
arXiv preprint arXiv:2002.02798, 2020.
|
||
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: Free-form
|
||
continuous dynamics for scalable reversible generative models. International Conference on
|
||
Learning Representations, 2019.
|
||
Griewank, A. and Walther, A. Evaluating derivatives. 2008.
|
||
Hairer, E., Norsett, S., and Wanner, G. Solving Ordinary Differential Equations I: Nonstiff Problems,
|
||
volume 8. 01 1993. doi: 10.1007/978-3-540-78862-1.
|
||
Kutta, W. Beitrag zur näherungsweisen Integration totaler Differentialgleichungen. Zeitschrift für
|
||
Mathematik und Physik, 46:435–453, 1901.
|
||
LeCun, Y., Cortes, C., and Burges, C. MNIST handwritten digit database. ATT Labs [Online].
|
||
Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
|
||
Li, X., Chen, R. T. Q., Wong, T.-K. L., and Duvenaud, D. Scalable gradients for stochastic differential
|
||
equations. In Artificial Intelligence and Statistics, 2020.
|
||
Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and
|
||
generalization in neural networks: an empirical study. In 6th International Conference on Learning
|
||
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
|
||
Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.
|
||
Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation.
|
||
Advances in Neural Information Processing Systems, 2017.
|
||
Pinckaers, H. and Litjens, G. Neural ordinary differential equations for semantic segmentation of
|
||
individual colon glands. arXiv preprint arXiv:1910.10470, 2019.
|
||
9
|
||
Rackauckas, C., Innes, M., Ma, Y., Bettencourt, J., White, L., and Dixit, V. Diffeqflux.jl-a Julia
|
||
library for neural differential equations. arXiv preprint arXiv:1902.02376, 2019.
|
||
Rubanova, Y., Chen, T. Q., and Duvenaud, D. K. Latent ordinary differential equations for irregularly-
|
||
sampled time series. In Advances in Neural Information Processing Systems, pp. 5321–5331,
|
||
2019.
|
||
Runge, C. Über die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46:
|
||
167–178, 1895.
|
||
Ruthotto, L. and Haber, E. Deep neural networks motivated by partial differential equations. Journal
|
||
of Mathematical Imaging and Vision, pp. 1–13, 2018.
|
||
Ruthotto, L., Osher, S. J., Li, W., Nurbekyan, L., and Fung, S. W. A machine learning framework for
|
||
solving high-dimensional mean field game and mean field control problems. CoRR, abs/1912.01825,
|
||
2019. URL http://arxiv.org/abs/1912.01825.
|
||
Scao, T. L. Neural differential equations for single image super-resolution. arXiv preprint
|
||
arXiv:2005.00865, 2020.
|
||
Shampine, L. F. Some practical Runge-Kutta formulas. Mathematics of Computation, 46(173):
|
||
135–150, 1986. ISSN 00255718, 10886842. URL http://www.jstor.org/stable/2008219.
|
||
Silva, I., Moody, G., Scott, D. J., Celi, L. A., and Mark, R. G. Predicting in-hospital mortality of
|
||
ICU patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in
|
||
Cardiology, pp. 245–248, 2012.
|
||
Simard, P., Raysz, J. P., and Victorri, B. Shaping the state space landscape in recurrent networks. In
|
||
Advances in neural information processing systems, pp. 105–112, 1991.
|
||
Wang, W., Axelrod, S., and Gómez-Bombarelli, R. Differentiable molecular simulations for control
|
||
and learning. arXiv preprint arXiv:2003.00868, 2020.
|
||
|
||
Appendix A Taylor-mode Automatic Differentiation
|
||
|
||
A.1 Taylor Polynomials
|
||
|
||
To clarify the relationship between the presentation in Chapter 13 of Griewank & Walther (2008) and
|
||
our results we give the distinction between the Taylor coefficients and derivative coefficients, also
|
||
known, unhelpfully, as Tensor coefficients.
|
||
|
||
For a sufficiently smooth vector valued function f : Rn → Rm and the polynomial
|
||
|
||
<< x(t) = x[0] + x[1] t + x[2] t2 + x[3] t3 + · · · + x[d] td ∈ Rn>> (5)
|
||
|
||
we are interested in the d-truncated Taylor expansion
|
||
|
||
<<y(t) = f (x(t)) + O(td+1 )>> (6)
|
||
|
||
<<≡ y[0] + y[1] t + y[2] t + y[3] t + · · · + y[d] t ∈ R >> (7)
|
||
|
||
with the notation that <<FORMULA>> is the Taylor coefficient, which is the normalized derivative coefficient.
|
||
|
||
The Taylor coefficients of the expansion, y[j] , are smooth functions of the i ≤ j coefficients x[i],
|
||
|
||
<<FORMULA>> (8)
|
||
|
||
<<FORMULA>> (9)
|
||
|
||
<<FORMULA>> (10)
|
||
|
||
<<FORMULA>> (11)
|
||
|
||
These, as given in Griewank & Walther (2008), are written in terms of the normalized, Taylor
|
||
coefficients. This obscures their direct relationship with the derivatives, which we make explicit.
|
||
Consider the polynomial eq. (5) with Taylor coefficients expanded so their normalization is clear.
|
||
Further, let’s use suggestive notation that these coefficients correspond to the higher derivatives of
|
||
x with respect to t, making x(t) a Taylor polynomial. That is <<FORMULA>>.
|
||
<<FORMULA>> (12)
|
||
|
||
<<FORMULA>> (13)
|
||
|
||
<<FORMULA>> (14)
|
||
|
||
Again, we are interested in the polynomial eq. (7), but with the normalization terms explicit
|
||
|
||
<<FORMULA>> (15)
|
||
|
||
Now we can expand the expressions for the Taylor coefficients y[i] to expressions for derivative
|
||
coefficients yi = i!y[i].
|
||
|
||
The coefficients of the Taylor expansion, yj , are smooth functions of the i ≤ j coefficients xi,
|
||
|
||
<<FORMULA>> (16)
|
||
|
||
<<FORMULA>> (17)
|
||
|
||
<<FORMULA>> (18)
|
||
|
||
<<FORMULA>> (19)
|
||
|
||
<<FORMULA>> (20)
|
||
|
||
<<FORMULA>> (21)
|
||
|
||
Therefore, eqs. (16), (17), (19) and (21) show that the derivative coefficient yi are exactly the ith
|
||
order higher derivatives of the composition f (x(t)) with respect to t. The key insight to this exercise
|
||
is that by writing the derivative coefficients explicitly we reveal that the expressions for the terms,
|
||
eqs. (16) to (18) and (20), involve terms previously computed for lower order terms.
|
||
In general, it will be useful to consider that the yk derivative coefficients is a function of all lower
|
||
order input derivatives
|
||
|
||
<<yk = yk (x0 , . . . , xk )>>. (22)
|
||
|
||
We provide the API to compute this in JAX by indexing the k-output of jet
|
||
|
||
<<yk = jet(f, x0 , (x1 , . . . , xk ))[k]>>.
|
||
|
||
A.2 Relationship with Differential Equations
|
||
|
||
A.2.1 Autonomous Form
|
||
|
||
We can transform the initial value problem
|
||
|
||
<<FORMULA>> (23)
|
||
|
||
into an autonomous dynamical system by augmenting the system to include the independent variable
|
||
with trivial dynamics Hairer et al. (1993):
|
||
|
||
<<FORMULA>> (24)
|
||
|
||
We do this for notational convenience, as well it disambiguates that derivatives with respect to t are
|
||
meant in the “total" sense. This is aleviates the potential ambiguity of ∂t f (x(t), t) which could mean
|
||
both the derivative with respect to the second argument and the derivative through x(t) by the chain
|
||
rule <<FORMULA>>.
|
||
|
||
A.2.2 Taylor Coefficients for ODE Solution with jet
|
||
|
||
Recall that jet gives us the coefficients for yi as a function of f and the coefficients xj≤i . We
|
||
can use jet and the relationship xk+1 = yk to recursively compute the coefficients of the solution
|
||
polynomial.
|
||
|
||
Algorithm 1 Taylor Coefficients for ODE Solution by Recursive Jet
|
||
|
||
<<ALGORITHM>>
|
||
|
||
A.3 Regularizing Taylor Terms
|
||
|
||
Computing the Taylor coefficients for the ODE solution as in algorithm 1 will give a local approx-
|
||
imation to the ODE solution. If infinitely many Taylor coefficients could be computed this would
|
||
give the exact solution. The order of the final Taylor coefficient, determining the truncation of the
|
||
polynomial, gives the order of the approximation.
|
||
If the higher order Taylor coefficients of the solution are large, then truncation will result in a local
|
||
approximation that quickly diverts from the solution. However, if the higher Taylor coefficients are
|
||
small then the local approximation will remain close to the solution. This motivates our regularization
|
||
method. The effect of our regularizer on the Taylor expansion of a solution to a neural ODE can be
|
||
seen in fig. 9.
|
||
|
||
Appendix B Experimental Details
|
||
|
||
Experiments were conducted using GPU-based ODE solvers. Training gradients were computed
|
||
using the adjoint method, in which the trajectory is reconstructed backwards in time to save memory,
|
||
for backpropagation. As in Finlay et al. (2020), we normalize our regularization term in eq. (1) by
|
||
the dimension of the vector-valued trajectory z(t) so that we may choose λ free of scaling by the
|
||
dimension of the problem.
|
||
|
||
B.1 Efficient computation of the gradient of regularization term
|
||
|
||
To optimize our regularized objective, we must compute its gradient. We use the adjoint method
|
||
as described in Chen et al. (2018) to differentiate through the solution to the ODE. In particular, to
|
||
optimize our model we only need to compute the gradient of the regularization term. The adjoint
|
||
method gives the gradient of the ODE solution as a solution to an augmented ODE.
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 9: Left: The dynamics and a trajectory of a neural ODE trained on a toy supervised learning
|
||
problem. The dynamics are poorly approximated by a 6th-order local Taylor series, and requires 92
|
||
NFE by a solve by a 5th-order Runge-Kutta solver. Right: Regularizing the 6th-order derivatives of
|
||
trajectories gives dynamics that are easier to solve numerically, requiring only 68 NFE.
|
||
|
||
B.2 Supervised Learning
|
||
|
||
The dynamics function f : Rd × R → Rd is given by an MLP as follows
|
||
|
||
<<z1 = σ(x)>>
|
||
<<h1 = W1 [z1 ; t] + b1>>
|
||
<<z2 = σ(h1 )>>
|
||
<<y = W2 [z2 ; t] + b2>>
|
||
|
||
Where <<[·; ·]>> denotes concatenation of a scalar onto a column vector. The parameters are <<W1 ∈
|
||
R^h×d>>, <<b1 ∈ R^h>> and <<W2 ∈ R^d×h>> , <<b2 ∈ R^d>> . Here we use 100 hidden units, i.e.<< h = 100>>. We have
|
||
<<d = 784>>, the dimension of an MNIST image.
|
||
We train with a batch size of 100 for 160 epochs. We use the standard training set of 60,000 images,
|
||
and the standard test set of 10,000 images as a validation/test set. We optimize our model using SGD
|
||
with momentum with β = 0.9. Our learning rate schedule is 1e-1 for the first 60 epochs, 1e-2 until
|
||
epoch 100, 1e-3 until epoch 140, and 1e-4 for the final 20 epochs.
|
||
B.3 Continuous Generative Modelling of Time-Series
|
||
The PhysioNet dataset consists of observations of 41 distinct traits over a time period of 48 hours.
|
||
We remove the parameters “Age”, “Gender”, “Height”, and “ICUType” as these attributes do not vary
|
||
in time. We also quantize the measurements for each attribute by the hour by averaging multiple
|
||
measurements within the same hour. This leaves 49 unique time stamps (the extra time stamp for
|
||
observations at exactly the endpoint of the 48 hour observation period). We report all our losses on
|
||
this quantized data. We performed this rather coarse quantization for computational reasons having
|
||
to do with our particular implementation of this model. The validation split was obtained by taking
|
||
a random split of 20% of the trajectories from the full dataset. In total there are 8000 trajectories.
|
||
Code is included for processing the dataset, and links to downloading the data may be found in the
|
||
code for Rubanova et al. (2019). All other experimental details may be found in the main body and
|
||
appendices of Rubanova et al. (2019).
|
||
|
||
B.4 Continuous Normalizing Flows
|
||
|
||
For the model trained on the MINIBOONE tabular dataset from Papamakarios et al. (2017), we used
|
||
the same architecture as in Table 4 in the appendix of Grathwohl et al. (2019). We chose the number
|
||
of epochs and a learning rate schedule based on manual tuning on the validation set, in contrast
|
||
to Grathwohl et al. (2019) who tuned these automatically using early stopping and an automatic
|
||
heuristic for the learning rate decay using evaluation on a validation set. In particular, we trained for
|
||
500 epochs with a learning rate of 1e-3 for the first 300 epochs, 1e-4 until epoch 425, and 1e-5
|
||
for the remaining 75 epochs. The number of epochs and learning rate schedule was determined by
|
||
evaluating the model on the validation set every 10 epochs, and decaying the learning rate by a factor
|
||
of 10 once the loss on the validation set stopped improving for several evaluations, with the goal of
|
||
matching or improving upon the log-likelihood reported in Grathwohl et al. (2019). The data was
|
||
obtained as made available from Papamakarios et al. (2017), which was already processed and split
|
||
into train/validation/test. In particular, the training set has 29556 examples, the validation set has
|
||
3284 examples, and the test set has 3648 examples, which consist of 43 features.
|
||
It is important to note that we implemented a single-flow model for the MNIST dataset, while the
|
||
original comparison in Finlay et al. (2020) was on a multi-flow model. This accounts for discrepancy
|
||
in bits/dim and NFE reported in Finlay et al. (2020).
|
||
All other experimental details are as in Grathwohl et al. (2019).
|
||
|
||
B.5 Hardware
|
||
|
||
MNIST Supervised learning, Physionet Time-series, and MNIST FFJORD experiments were trained
|
||
and evaluated on NVIDIA Tesla P100 GPU. Tabular data FFJORD experiments were evaluated on
|
||
NVIDIA Tesla P100 GPU but trained on NVIDIA Tesla T4 GPU. All experiments except for MNIST
|
||
FFJORD were trained with double precision for purposes of reproducibility.
|
||
|
||
Appendix C Additional Results
|
||
|
||
C.1 Overfitting of NFE
|
||
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 10: The difference in NFE is tracked by the variance of NFE.
|
||
|
||
In fig. 10 we note that there is a striking correspondence in the variance of NFE across individual
|
||
examples (in both the train set (dark red) and test set (light red)) and the absolute difference in NFE
|
||
between examples in the training set and test set. This suggests that any difference in the average
|
||
NFE between training examples and test examples is explained by noise in the estimate of the true
|
||
average NFE. It is also interesting that speed regularization does not have a monotonic relationship
|
||
with the variance of NFE, and we speculate as to how this might interact between the correspondence
|
||
of NFE for a particular example and the difficulty in the model correctly classifying it.
|
||
|
||
C.2 Trading off function evaluations with a surrogate loss
|
||
|
||
In fig. 11 and fig. 12 we confirm that our method poses a suitable tradeoff not only on the loss being
|
||
optimized, but also on the potentially non-differentiable loss which we truly care about. On MNIST,
|
||
we get a similar pareto curve when plotting classification error as opposed to cross-entropy loss, and
|
||
similarly on the time-series modelling task we see that we get a similar pareto curve on MSE loss as
|
||
compared to IWAE loss. The pareto curves are plotted for R3 , R2 respectively.
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 11: MNIST Classification
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 12: Physionet Time-Series
|
||
|
||
C.3 Wall-clock Time
|
||
|
||
We include additional tables with wall-clock time and training with fixed grid solvers in table 3 and
|
||
table 4.
|
||
|
||
|
||
Appendix D Comparison to How to Train Your Neural ODE
|
||
|
||
The terms from Finlay et al. (2020) are
|
||
|
||
<<FORMULA>>
|
||
|
||
and an estimate of
|
||
<<FORMULA>>
|
||
|
||
Table 3: Classification on MNIST
|
||
|
||
<<TABLE>>
|
||
|
||
These are combined with a weighted average and integrated along the solution trajectory.
|
||
These terms are motivated by the expansion
|
||
|
||
<<FORMULA>>
|
||
|
||
Namely, eq. (3) regularizes the first total derivative of the solution, f (z(t), t), along the trajectory, and
|
||
eq. (4) regularizes a stochastic estimate of the Frobenius norm of the spatial derivative, ∇z f (z(t), t),
|
||
along the solution trajectory.
|
||
In contrast, R2 regularizes the norm of the second total derivative directly. In particular, this takes
|
||
into account the ∂f ∂t term. In other words, this accounts for the explicit dependence of f on time,
|
||
while eq. (3) and eq. (4) capture only the implicit dependence on time through z(t).
|
||
Even in the case of an autonomous system, that is, where ∂f ∂t is identically 0 and the dynamics f only
|
||
depend implicitly on time, these terms still differ. Namely, R2 integrates the following along the
|
||
solution trajectory:
|
||
|
||
<<FORMULA>>
|
||
|
||
while Finlay et al. (2020) penalizes the respective norms of the matrix ∇z f (z(t), t) and vector
|
||
f (z(t), t) separately.
|
||
|
||
Table 4: Density Estimation on Tabular Data (MINIBOONE)
|
||
|
||
<<TABLE>>
|
||
|
||
<<END>> <<END>> <<END>>
|
||
|
||
|
||
<<START> <<START>> <<START>>
|
||
|
||
|
||
How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
|
||
|
||
Chris Finlay 1 Jörn-Henrik Jacobsen 2 Levon Nurbekyan 3 Adam M Oberman 1
|
||
|
||
Abstract
|
||
|
||
Training neural ODEs on large datasets has not
|
||
been tractable due to the necessity of allowing
|
||
the adaptive numerical ODE solver to refine its
|
||
step size to very small values. In practice this
|
||
leads to dynamics equivalent to many hundreds
|
||
or even thousands of layers. In this paper, we
|
||
overcome this apparent difficulty by introducing
|
||
a theoretically-grounded combination of both op-
|
||
timal transport and stability regularizations which
|
||
encourage neural ODEs to prefer simpler dynam-
|
||
ics out of all the dynamics that solve a problem
|
||
well. Simpler dynamics lead to faster conver-
|
||
gence and to fewer discretizations of the solver,
|
||
considerably decreasing wall-clock time without
|
||
loss in performance. Our approach allows us to
|
||
train neural ODE-based generative models to the
|
||
same performance as the unregularized dynamics,
|
||
with significant reductions in training time. This
|
||
brings neural ODEs closer to practical relevance
|
||
in large-scale applications.
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 1. Optimal transport map and a generic normalizing flow.
|
||
|
||
Indeed, it was observed that there is a striking similarity
|
||
1. Introduction between ResNets and the numerical solution of ordinary
|
||
differential equations (E, 2017; Haber & Ruthotto, 2017;
|
||
Recent research has bridged dynamical systems, a Ruthotto & Haber, 2018; Chen et al., 2018; 2019). In these
|
||
workhorse of mathematical modeling, with neural networks, works, deep networks are interepreted as discretizations of
|
||
the defacto function approximator for high dimensional data. an underlying dynamical system, where time indexes the
|
||
The great promise of this pairing is that the vast mathemat- “depth” of the network and the parameters of the discretized
|
||
ical machinery stemming from dynamical systems can be dynamics are learned. An alternate viewpoint was taken by
|
||
leveraged for modelling high dimensional problems in a neural ODEs (Chen et al., 2018), where the dynamics of
|
||
dimension-independent fashion. the neural network are approximated by an adaptive ODE
|
||
Connections between neural networks and ordinary differ- solver on the fly. This latter approach is quite compelling
|
||
ential equations (ODEs) were almost immediately noted as it does not require specifying the number of layers of the
|
||
after residual networks (He et al., 2016) were first proposed. network beforehand. Furthermore, it allows the learning of
|
||
homeomorphisms without any structural constraints on the
|
||
function computed by the residual block.
|
||
Neural ODEs have shown great promise in the physical sciences
|
||
(Köhler et al., 2019), in modeling irregular time series
|
||
(Rubanova et al., 2019), mean field games (Ruthotto et al.,
|
||
2019), continuous-time modeling (Yildiz et al., 2019; Kanaa
|
||
et al., 2019), and for generative modeling through normaliz-
|
||
ing flows with free-form Jacobians (Grathwohl et al., 2019).
|
||
|
||
|
||
Recent work has even adapted neural ODEs to the stochas- based on (ODE) which abstain from a priori fixing step-size.
|
||
tic setting (Li et al., 2020). Despite these successes, some Chen et al.’s method is a continuous-time generalization of
|
||
hurdles still remain. In particular, although neural ODEs are residual networks, where the dynamics are generated by an
|
||
memory efficient, they can take a prohibitively long time to adaptive ODE solver that chooses step-size on-the-fly.
|
||
train, which is arguably one of the main stumbling blocks
|
||
Because of their adaptive nature, neural ODEs can be more
|
||
towards their widespread adoption.
|
||
flexible than ResNets in certain scenarios, such as when
|
||
In this work we reduce the training time of neural ODEs trading between model speed and accuracy. Moreover given
|
||
by regularizing the learned dynamics, complementing other a fixed network depth, the memory footprint of neural ODEs
|
||
recent approaches to this end such as augmented neural is orders of magnitude smaller than a standard ResNet dur-
|
||
ODEs (Dupont et al., 2019). Without further constraints on ing training. They therefore show great potential on a host
|
||
their dynamics, high dimensional neural ODEs may learn of applications, including generative modeling and density
|
||
dynamics which minimize an objective function, but which estimation. An apparent drawback of neural ODEs is their
|
||
generate irregular solution trajectories. See for example long training time: although a learned function f (· ; θ) may
|
||
Figure 1b, where an unregularized flow exhibits undesirable generate a map that solves a problem particularly well, the
|
||
properties due to unnecessarily fluctuating dynamics. As computational cost of numerically integrating (ODE) may
|
||
a solution, we propose two theoretically motivated regular- be so prohibitive that it is not tractable in practice. In this
|
||
ization terms arising from an optimal transport viewpoint paper we demonstrate this need not be so: with proper reg-
|
||
of the learned map, which encourage well-behaved dynam- ularization, it is possible to learn f (· ; θ) so that (ODE) is
|
||
ics (see 1a left). We empirically demonstrate that proper easily and quickly solved.
|
||
regularization leads to significant speed-up in training time
|
||
without loss in performance, thus bringing neural ODEs 2.1. FFJORD
|
||
closer to deployment on large-scale datasets. Our methods
|
||
are validated on the problem of generative modelling and In density estimation and generative modeling, we wish
|
||
density estimation, as an example of where neural ODEs to estimate an unknown data distribution p(x) from which
|
||
have shown impressive results, but could easily be applied we have drawn N samples. Maximum likelihood seeks to
|
||
elsewhere. approximate p(x) with a parameterized distribution pθ (x)
|
||
by minimizing the Kullback-Leibler divergence between the
|
||
In summary, our proposed regularized neural ODE (RN- two, or equivalently minimizing
|
||
ODE) achieves the same performance as the baseline, while
|
||
reducing the wall-clock training time by many hours or even
|
||
days. <<FORMULA>> (1)
|
||
|
||
2. Neural ODEs & Continuous normalizing Continuous normalizing flows (Grathwohl et al., 2019; Chen
|
||
flows et al., 2018) parameterize pθ (x) using a vector field f :
|
||
Rd × R 7→ Rd as follows. Let z(x, T ) be the solution map
|
||
Neural ODEs simplify the design of deep neural networks given by running the dynamics (ODE) for fixed time T .
|
||
by formulating the forward pass of a deep network as the Suppose we are given a known distribution q at final time T ,
|
||
solution of a ordinary differential equation. Initial work such as the normal distribution. Change of variables tells us
|
||
along these lines was motivated by the similarity of the eval- that the distribution pθ (x) may be evaluated through
|
||
uation of one layer of a ResNet and the Euler discretization
|
||
of an ODE. Suppose the block in the t-th layer of a ResNet <<log pθ (x) = log q (z(x, T )) + log det | ∇ z(x, T )|>> (2)
|
||
is given by the function f (x, t; θ), where θ are the block’s
|
||
parameters. Then the evaluation of this layer of the ResNet Evaluating the log determinant of the Jacobian is difficult.
|
||
is simply xt+1 = xt + f (xt , t; θ). Now, instead consider Grathwohl et al. (2019) exploit the following identity from
|
||
the following ODE fluid mechanics (Villani, 2003, p 114)
|
||
|
||
<<FORMULA>> (ODE) <<log det | ∇ z(x, t)| = div (f ) (z(x, t), t))>> (3)
|
||
|
||
The Euler discretization of this ODE with step-size <<τ>> is where <<div(·)>> is the divergence operator, <<div(f ) (x) =
|
||
<<zt+1 = zt + τ f (zt , t; θ)>>, which is nearly identical to the i ∂xi fi (x)>>. By the fundamental theorem of calculus, we
|
||
forward evaluation of the ResNet’s layer (setting step-size 1
|
||
In the normalizing flow literature divergence is typically writ-
|
||
<<τ = 1>> gives equality). Armed with this insight, Chen et al. ten explicitly as the trace of the Jacobian, however we use div (·)
|
||
(2018) suggested a method for training neural networks which is more common elsewhere.
|
||
|
||
<<FIGURE>>
|
||
Figure 2. Log-likelihood (measured in bits/dim) on the validation set as a function of wall-clock time. Rolling average of three hours, with
|
||
90% confidence intervals.
|
||
|
||
may then rewrite (2) in integral form From this simple motivating example, the need for regular-
|
||
ity of the vector field is apparent. Without placing demands
|
||
on the vector field f , it is entirely possible that the learned
|
||
<<log pθ (x) = log q (z(x, T )) + div (f ) (z(x, s), s) ds>>
|
||
dynamics will be poorly conditioned. This is not just a theo-
|
||
(4) retical exercise: because the dynamics must be solved with
|
||
Remark 2.1 (Divergence trace estimate). In (Grathwohl a numerical integrator, poorly conditioned dynamics will
|
||
et al., 2019), the divergence is estimated using an unbiased lead to difficulties during numerical integration of (ODE).
|
||
Monte-Carlo trace estimate (Hutchinson, 1990; Avron & Indeed, later we present results demonstrating a clear corre-
|
||
Toledo, 2011), lation between the number of time steps an adaptive solver
|
||
takes to solve (ODE), and the regularity of f .
|
||
<<FORMULA>> (5) How can the regularity of the vector field be measured? One
|
||
motivating approach is to measure the force experienced by
|
||
a particle z(t) under the dynamics generated by the vector
|
||
By using the substitution (4), the task of maximizing log- field f , which is given by the total derivative of f with
|
||
likelihood shifts from choosing pθ to minimize (1), to learn- respect to time
|
||
ing the flow generated by a vector field f . This results in a
|
||
normalizing flow with a free-form Jacobian and reversible
|
||
dynamics, and was named FFJORD by Grathwohl et al.. <<FORMULA>> (6)
|
||
|
||
2.2. The need for regularity <<FORMULA>> (7)
|
||
|
||
The vector field learned through FFJORD that maximizes Well conditioned flows will place constant, or nearly con-
|
||
the log-likelihood is not unique, and raises troubling prob- stant, force on particles as they travel. Thus, in this work we
|
||
lems related to the regularity of the flow. For a simple propose regularizing the dynamics with two penalty terms,
|
||
example, refer to Figure 1, where we plot two normaliz- one term regularizing f and the other ∇ f . The first penalty,
|
||
ing flows, both mapping a toy one-dimensional distribution presented in Section 3, is a measure of the distance travelled
|
||
to the unit Gaussian, and where both maximize the log- under the flow f , and can alternately be interpreted as the
|
||
likelihood of exactly the same sample of particles. Figure kinetic energy of the flow. This penalty term is based off
|
||
1a presents a “regular” flow, where particles travel in straight of numerical methods in optimal transport, and encourages
|
||
lines that travel with constant speed. In contrast, Figure 1b particles to travel in straight lines with constant speed. The
|
||
shows a flow that still maximizes the log-likelihood, but second penalty term, discussed in Section 4, performs regu-
|
||
that has undesirable properties, such as rapidly varying local larization on the Jacobian of the vector field. Taken together
|
||
trajectories and non-constant speed. the two terms ensure that the force experienced by a particle
|
||
|
||
under the flow is constant or nearly so. 3.1. Linking normalizing flows to optimal transport
|
||
|
||
These two regularizers will promote dynamics that follow Now suppose we wish to minimize (18a), with q(z) a unit
|
||
numerically easy-to-integrate paths, thus greatly speeding normal distribution, and p(x) a data distribution, unknown
|
||
up training time. to us, but from which we have drawn N samples, and which
|
||
we model as a discrete distribution of Dirac masses. Enforc-
|
||
3. Optimal transport maps & ing the initial condition is trivial because we have sampled
|
||
from p directly. The continuity equation (18b) need not be
|
||
Benamou-Brenier
|
||
enforced because we are tracking a finite number of sam-
|
||
There is a remarkable similarity between density estimation pled particles. However the final time condition ρT = q
|
||
using continuous time normalizing flows, and the calcula- cannot be implemented directly, since we do not have di-
|
||
tion of the optimal transport map between two densities rect control on the form ρT (z) takes. Instead, introduce
|
||
using the Benamou-Brenier formulation (Benamou & Bre- a Kullback-Leibler term to (18a) penalizing discrepancy
|
||
nier, 2000; Santambrogio, 2015). While a review of optimal between ρT and q. This penalty term has an elegant simpli-
|
||
transport theory is far outside the scope of this paper, here fication when p(x) is modeled as a distribution of a finite
|
||
we provide an informal summary of key ideas relevant to number of masses, as is done in generative modeling. Set-
|
||
continuous normalizing flows. The quadratic-cost optimal ting ρ0 = pθ a brief derivation yields
|
||
transport map between two densities p(x) and q(x) is a map
|
||
z : Rd 7→ Rd minimizing the transport cost
|
||
<<FORMULA>> (10)
|
||
|
||
<<FORMULA>> (8)
|
||
With this simplification (18a) becomes
|
||
|
||
subject to the constraint that A q(z) dz = z−1 (A) p(x) dx,
|
||
in other words that the measure of any set A is preserved
|
||
under the map z. In a seminal work, Benamou & Brenier <<FORMULA>> (11)
|
||
(2000) showed that rather than solving for minimizers of (8)
|
||
directly, an indirect (but computationally efficient) method
|
||
is available by writing z(x, T ) as the solution map of a
|
||
flow under a vector field f (as in (ODE)) for time T , by For further details on this derivation consult the supplemen-
|
||
minimizing tary materials.
|
||
The connection between the Benamou-Brenier formulation
|
||
<<FORMULA>> (9a) of the optimal transport problem on a discrete set of points
|
||
and continuous normalizing flows is apparent: the optimal
|
||
transport problem (11) is a regularized form of the continu-
|
||
<<FORMULA>> (9b) ous normalizing flow optimization problem (1). We there-
|
||
<<ρ0 (x) = p>>, (9c) fore expect that adding a kinetic energy regularization term
|
||
<<ρT (z) = q>>. (9d) to FFJORD will encourage solution trajectories to prefer
|
||
straight lines with constant speed.
|
||
The objective function (18a) is a measure of the kinetic
|
||
energy of the flow. The constraint (18b) ensures probability
|
||
mass is conserved. The latter two constraints guarantee the 4. Unbiased Frobenius norm regularization of
|
||
learned distribution agrees with the source p and target q. the Jacobian
|
||
Note that the kinetic energy (18a) is an upper bound on the
|
||
Refering to equation (7), one can see that even if f is regu-
|
||
transport cost, with equality only at optimality.
|
||
larized to be small, via a kinetic energy penalty term, if the
|
||
The optimal flow f minimizing (18) has several particularly Jacobian is large then the force experienced by a particle
|
||
appealing properties. First, particles induced by the opti- may also still be large. As a result, the error of the numerical
|
||
mal flow f travel in straight lines. Second, particles travel integrator can be large, which may lead an adaptive solver
|
||
with constant speed. Moreover, under suitable conditions to make many function evaluations. This relationship is
|
||
on the source and target distributions, the optimal solution apparent in Figure 3, where we empirically demonstrate the
|
||
map is unique (Villani, 2008). Therefore the solution map correlation between the number of function evaluations of
|
||
z(x, t) is entirely characterized by the initial and final posi- f taken by the adaptive solver, and the size of the Jacobian
|
||
tions: z(x, t) = (1 − Tt )z(x, 0) + Tt z(x, T ). Consequently, norm of f . The correlation is remarkably strong: dynamics
|
||
given an optimal f it is extraordinarily easy to solve (ODE) governed by a poorly conditioned Jacobian matrix require
|
||
numerically with minimal computational effort. the adaptive solver to take many small time steps.
|
||
|
||
|
||
Algorithm 1 RNODE: regularized neural ODE training of
|
||
FFJORD
|
||
<<ALGORITHM>>
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 3. Number of function evaluations vs Jacobian Frobenius
|
||
norm of flows on CIFAR10 during training with vanilla FFJORD,
|
||
using an adaptive ODE solver.
|
||
\
|
||
Avron & Toledo, 2011). For real matrix B, an unbiased
|
||
<<FORMULA>> estimate of the trace is given by
|
||
|
||
<<FORMULA>> (14)
|
||
|
||
where <<FORMULA>> is drawn from a unit normal distribution.
|
||
Thus the squared Frobenius norm can be easily estimated by
|
||
setting B = AAT.
|
||
Moreover, in particle-based methods, the kinetic energy Turning to the Jacobian <<FORMULA>> of a vector valued func-
|
||
term forces dynamics to travel in straight lines only on tion f : Rd 7→ Rd , recall that the vector-Jacobian product
|
||
data seen during training, and so the regularity of the map <<FORMULA>> may be quickly computed through reverse-mode
|
||
is only guaranteed on trajectories taken by training data. automatic differentiation. Therefore an unbiased Monte-
|
||
The issue here is one of generalization: the map may be Carlo estimate of the Frobenius norm of the Jacobian is
|
||
irregular on off-distribution or perturbed images, and cannot readily available
|
||
be remedied by the kinetic energy term during training alone.
|
||
In the context of generalization, Jacobian regularization is <<FORMULA>> (15)
|
||
analagous to gradient regularization, which has been shown
|
||
to improve generalization (Drucker & LeCun, 1992; Novak <<FORMULA>> (16)
|
||
et al., 2018).
|
||
|
||
For these reasons, we also propose regularizing the Jacobian Conveniently, in the FFJORD framework the quantity
|
||
through its Frobenius norm. The Frobenius norm k · kF of a <<FORMULA>> must be computed during the estimate of the prob-
|
||
real matrix A can be thought of as the `2 norm of the matrix ability distribution under the flow, in the Monte-Carlo esti-
|
||
A vectorized mate of the divergence term (5). Thus Jacobian Frobenius
|
||
<<FORMULA>> (12) norm regularization is available with essentially no extra
|
||
computational cost.
|
||
Equivalently it may be computed as
|
||
5. Algorithm description
|
||
|
||
<<kAkF = tr(AAT)>> (13) All together, we propose modifying the objective function
|
||
of the FFJORD continuous normalizing flow (Grathwohl
|
||
and is the Euclidean norm of the singular values of a matrix. et al., 2019) with the two regularization penalties of Sec-
|
||
In trace form, the Frobenius norm lends itself to estimation tions 3 & 4. The proposed method is called RNODE, short
|
||
using a Monte-Carlo trace estimator (Hutchinson, 1990; for regularized neural ODE. Pseudo-code of the method is
|
||
|
||
<<TABLE>>
|
||
|
||
Table 1. Log-likelihood (in bits/dim) and training time (in hours) on validation images with uniform dequantization. Results on clean
|
||
images are found in the supplemental materials. For comparison we report both the results of the original FFJORD paper (Grathwohl
|
||
et al., 2019) and our own independent run of FFJORD (“vanilla”) on CIFAR10 and MNIST. Vanilla FFJORD did not train on ImageNet64
|
||
(denoted by “x”). Also reported are results for other flow-based generative modeling papers. Our method (FFJORD with RNODE) has
|
||
comparable log-likelihood as FFJORD but is significantly faster.
|
||
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 4. Quality of generated samples samples on 5bit CelebA-HQ64 with RNODE. Here temperature annealing (Kingma & Dhariwal,
|
||
2018) with T = 0.7 was used to generate visually appealing images. For full sized CelebA-HQ256 samples, consult the supplementary
|
||
materials.
|
||
|
||
presented in Algorithm 1. The optimization problem to be Here E, l, and n are respectively the kinetic energy, the
|
||
solved is log determinant of the Jacobian, and the integral of the
|
||
Frobenius norm of the Jacobian.
|
||
Both the divergence term and the Jacobian Frobenius norm
|
||
are approximated with Monte-Carlo trace estimates. In our
|
||
<<FORMULA>> implementation, the Jacobian Frobenius estamate reuses
|
||
the computatian T ∇ f from the divergence estimate for
|
||
efficiency. We remark that the kinetic energy term only
|
||
<<FORMULA>> requires the computation of a dot product. Thus just as
|
||
in FFJORD, our implementation scales linearly with the
|
||
<<FORMULA>> (17) number of time steps taken by the ODE solver.
|
||
|
||
Gradients of the objective function with respect to the net-
|
||
where z(x, t) is determined by numerically solving (ODE). work parameters are computed using the adjoint sensitivity
|
||
Note that we take the mean over number of samples and method (Pontryagin et al., 1962; Chen et al., 2018).
|
||
input dimension. This is to ensure that the choice of regu-
|
||
larization strength λK and λJ is independent of dimension
|
||
size and sample size. 6. Experimental design
|
||
To compute the three integrals and the log-probability under Here we demonstrate the benefits of regularizing neural
|
||
q of z(x, T ) at final time T , we augment the dynamics of ODEs on generative models, an application where neu-
|
||
the ODE with three extra terms, so that the entire system ral ODEs have shown strong empirical performance. We
|
||
solved by the numerical integrator is use four datasets: CIFAR10 (Krizhevsky & Hinton, 2009),
|
||
MNIST (LeCun & Cortes, 1998), downsampled ImageNet
|
||
(64x64) (van den Oord et al., 2016), and 5bit CelebA-HQ
|
||
(256x256) (Karras et al., 2017). We use an identical neural
|
||
<<FORMULA>> (RNODE) architecture to that of Grathwohl et al. (2019). The dynamics
|
||
(Kingma & Dhariwal, 2018) trained with 40 GPUs for a week;
|
||
in contrast we train with four GPUs in just under a week.
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 5. Ablation study of the effect of the two regularizers, comparing two measures of flow regularity during training with a fixed
|
||
step-size ODE solver. Figure 5a: mean Jacobian Frobenius norm as a function of training epoch. Figure 5b: mean kinetic energy of the
|
||
flow as a function of training epoch. Figure 5c: number of function evaluations.
|
||
|
||
are defined by a neural network <<f (z, t; θ(t)) : Rd × R+ 7→ step size by a factor of two until the discrete dynamics were
|
||
Rd>> where <<θ(t)>> is piecewise constant in time. On MNIST we stable and achieved good performance. The Runge-Kutta
|
||
use 10 pieces; CIFAR10 uses 14; downsampled ImageNet 4(5) adaptive solver was used on the two larger datasets. We
|
||
uses 18; and CelebA-HQ uses 26 pieces. Each piece is a have also observed that RNODE improves the training time
|
||
4-layer deep convolutional network comprised of 3x3 ker- of the adaptive solvers as well, requiring many fewer func-
|
||
nels and softplus activation functions. Intermediary layers tion evaluations; however in Python we have found that the
|
||
have 64 hidden dimensions, and time t is concatenated to fixed grid solver is typically quicker at a specified number
|
||
the spatial input z. The integration time of each piece is of function evaluations. At test time RNODE uses the same
|
||
[0, 1]. Weight matrices are chosen to imitate the multi-scale adaptive solver as FFJORD.
|
||
architecture of Real NVP (Dinh et al., 2017), in that im-
|
||
We always initialize RNODE so that <<f(z, t) = 0>>; thus train-
|
||
ages are ‘squeezed’ via a permutation to halve image height
|
||
ing begins with an initial identity map. This is done by zero-
|
||
and width but quadruple the number of channels. Diver-
|
||
ing the parameters of the last layer in each piece (block),
|
||
gence of f is estimated using the Gaussian Monte-Carlo
|
||
following Goyal et al. (2017). The identity map is an ap-
|
||
trace estimator with one sample of fixed noise per solver
|
||
propriate choice because it has zero transport cost and zero
|
||
time-step.
|
||
Frobenius norm. Moreover the identity map is trivially
|
||
On MNIST and CIFAR10 we train with a batch size of solveable for any numerical solver, thus training begins
|
||
200 and train for 100 epochs on a single GPU3 , using the without any effort required on the solver’s behalf.
|
||
Adam optimizer (Kingma & Ba, 2015) with a learning rate
|
||
On all datasets we set both the kinetic energy regularization
|
||
of 1e−3. On the two larger datasets, we train with four
|
||
coefficient λK and the Jacobian norm coefficient λJ to 0.01.
|
||
GPUs, using a per-GPU batch size of respectively 3 and 50
|
||
for CelebA-HQ and ImageNet. Data is preprocessed by per-
|
||
turbing with uniform noise followed by the logit transform. 7. Results
|
||
The reference implementation of FFJORD solves the dy- A comparison of RNODE against FFJORD and other flow-
|
||
namics using a Runge-Kutta 4(5) adaptive solver (Dormand based generative models is presented in Table 1. We report
|
||
& Prince, 1980) with error tolerances 1e−5 and initial step both our running of “vanilla” FFJORD and the results as
|
||
size 1e−2. We have found that using less accurate solvers originally reported in (Grathwohl et al., 2019). We highlight
|
||
on the reference implementation of FFJORD results in nu- that RNODE runs roughly 2.8x faster than FFJORD on both
|
||
merically unstable training dynamics. In contrast, a simple datasets, while achieving or surpassing the performance of
|
||
fixed-grid four stage Runge-Kutta solver suffices for RN- FFJORD. This can further be seen in Figure 2 where we plot
|
||
ODE during training on MNIST and CIFAR10, using a bits per dimension ( − d1 log2 p(x), a normalized measure
|
||
step size of 0.25. The step size was determined based on of log-likelihood) on the validation set as a function of
|
||
a simple heuristic of starting with 0.5 and decreasing the training epoch, for both datasets. Visual inspection of the
|
||
sample quality reveals no qualitative difference between
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 6. Quality of generated samples samples with and without regularization on MNIST, left, and CIFAR10, right.
|
||
|
||
regularized and unregularized approaches; refer to Figure 6. encourages flows to travel a minimal distance. In addition,
|
||
Generated images for downsampled ImageNet and CelebA- we see that the Jacobian norm alone also has a beneficial
|
||
HQ are deferred to the supplementary materials; we provide effect on the distance particles travel. Overall, the results
|
||
smaller generated images for networks trained on CelebA- support our theoretical reasoning empirically.
|
||
HQ 64x64 in Figure 4.
|
||
Surprisingly, our run of “vanilla” FFJORD achieved slightly 8. Previous generative flows inspired by
|
||
better performance than the results reported in (Grathwohl optimal transport
|
||
et al., 2019). We suspect the discrepancy in performance
|
||
and run times between our implementation of FFJORD and Zhang et al. (2018) define a neural ODE flow where the
|
||
that of the original paper is due to batch size: Grathwohl dynamics are given as the gradient of a scalar potential func-
|
||
et al. use a batch size of 900 and train on six GPUs, whereas tion. This interpretation has deep connections to optimal
|
||
on MNIST and CIFAR10 we use a batch size of 200 and transport: the optimal transport map is the gradient of a
|
||
train on a single GPU. convex potential function. Yang & Karniadakis (2019) con-
|
||
tinue along these lines, and define an optimal transport again
|
||
We were not able to train vanilla FFJORD on ImageNet64, as a scalar potential gradient. Yang & Karniadakis (2019)
|
||
due to numerical underflow in the adaptive solver’s time step. enforce that the learned map is in fact an optimal trans-
|
||
This issue cannot be remedied by increasing the solver’s port map by penalizing their objective function with a term
|
||
error tolerance, for this would bias the log-likelihood esti- measuring violations of the continuity equation. Ruthotto
|
||
mates on validation. et al. (2019) place generative flows within a broader context
|
||
of mean field games, and as an example consider a neural
|
||
7.1. Ablation study on MNIST ODE gradient potential flow solving the optimal transport
|
||
problem in up to 100 dimensions. We also note the recent
|
||
In Figure 5, we compare the effect of each regularizer by
|
||
work of Twomey et al. (2019), who proposed regularizing
|
||
itself on the training dynamics with the fixed grid ODE
|
||
neural ODEs with an Euler-step discretization of the kinetic
|
||
solver on the MNIST dataset. Without any regularization at
|
||
energy term to enforce ‘straightness’, although connections
|
||
all, training dynamics are numerically unstable and fail after
|
||
to optimal transport were not discussed.
|
||
just under 50 epochs. This is precisely when the Jacobian
|
||
norm grows large; refer to Figure 5a. Figure 5a demonstrates When a flow is the gradient of a scalar potential, the change
|
||
that each regularizer by itself is able to control the Jacobian of variables formula (4) simplifies so that the divergence
|
||
norm. The Jacobian regularizer is better suited to this task, term is replaced by the Laplacian of the scalar potential.
|
||
although it is interesting that the kinetic energy regularizer Although mathematically parsimonious and theoretically
|
||
also improves the Jacobian norm. Unsurprisingly Figure 5b well-motivated, we chose not to implement our flow as the
|
||
demonstrates the addition of the kinetic energy regularizer gradient of a scalar potential function due to computational
|
||
How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
|
||
constraints: such an implementation would require ‘triple through CIFAR, and companies sponsoring the Vector Insti-
|
||
backprop’ (twice to compute or approximate the Laplacian, tute (www.vectorinstitute.ai/#partners).
|
||
and once more for the parameter gradient). Ruthotto et al.
|
||
(2019) circumvented this problem by utilizing special struc- References
|
||
tural properties of residual networks to efficiently compute
|
||
the Laplacian. Avron, H. and Toledo, S. Randomized algorithms for esti-
|
||
mating the trace of an implicit symmetric positive semi-
|
||
definite matrix. J. ACM, 58(2):8:1–8:34, 2011. doi:
|
||
9. Discussion
|
||
10.1145/1944345.1944349. URL https://doi.org/
|
||
In practice, RNODE is simple to implement, and only re- 10.1145/1944345.1944349.
|
||
quires augmenting the dynamics (ODE) with two extra
|
||
scalar equations (one for the kinetic energy term, and an- Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duve-
|
||
other for the Jacobian penalty). In the setting of FFJORD, naud, D., and Jacobsen, J. Invertible residual networks.
|
||
because we may recycle intermediary terms used in the In Chaudhuri, K. and Salakhutdinov, R. (eds.), Pro-
|
||
divergence estimate, the computational cost of evaluating ceedings of the 36th International Conference on Ma-
|
||
these two extra equations is minimal. RNODE introduces chine Learning, ICML 2019, 9-15 June 2019, Long
|
||
two extra hyperparameters related to the strength of the reg- Beach, California, USA, volume 97 of Proceedings
|
||
ularizers; we have found these required almost no tuning. of Machine Learning Research, pp. 573–582. PMLR,
|
||
2019. URL http://proceedings.mlr.press/
|
||
Although the problem of classification was not considered v97/behrmann19a.html.
|
||
in this work, we believe RNODE may offer similar im-
|
||
provements both in training time and the regularity of the Benamou, J.-D. and Brenier, Y. A computational fluid me-
|
||
classifier learned. In the classification setting we expect the chanics solution to the Monge-Kantorovich mass transfer
|
||
computional overhead of calculating the two extra terms problem. Numerische Mathematik, 84(3):375–393, 2000.
|
||
should be marginal relative to gains made in training time.
|
||
Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duve-
|
||
naud, D. Neural Ordinary Differential Equations. In
|
||
10. Conclusion Advances in Neural Information Processing Systems 31:
|
||
We have presented RNODE, a regularized method for neu- Annual Conference on Neural Information Processing
|
||
ral ODEs. This regularization approach is theoretically Systems 2018, NeurIPS 2018, 3-8 December 2018,
|
||
well-motivated, and encourages neural ODEs to learn well- Montréal, Canada, pp. 6572–6583, 2018. URL http:
|
||
behaved dynamics. As a consequence, numerical integration //papers.nips.cc/paper/7892-neural-
|
||
of the learned dynamics is straight forward and relatively ordinary-differential-equations.
|
||
easy, which means fewer discretizations are needed to solve Chen, T. Q., Behrmann, J., Duvenaud, D., and Jacobsen,
|
||
the dynamics. In many circumstances, this allows for the re- J. Residual flows for invertible generative modeling.
|
||
placement of adaptive solvers with fixed grid solvers, which In Wallach, H. M., Larochelle, H., Beygelzimer,
|
||
can be more efficient during training. This leads to a sub- A., d’Alché-Buc, F., Fox, E. B., and Garnett, R.
|
||
stantial speed up in training time, while still maintaining (eds.), Advances in Neural Information Processing
|
||
the same empirical performance, opening the use of neural Systems 32: Annual Conference on Neural Information
|
||
ODEs to large-scale applications. Processing Systems 2019, NeurIPS 2019, 8-14 Decem-
|
||
ber 2019, Vancouver, BC, Canada, pp. 9913–9923,
|
||
Acknowledgements 2019. URL http://papers.nips.cc/paper/
|
||
9183-residual-flows-for-invertible-
|
||
C. F. and A. O. were supported by a grant from the Innova- generative-modeling.
|
||
tive Ideas Program of the Healthy Brains and Healthy Lives
|
||
initiative (HBHL) through McGill University. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Den-
|
||
L. N. was supported by AFOSR MURI FA9550-18-1-0502, sity estimation using real NVP. In 5th International
|
||
AFOSR Grant No. FA9550-18-1-0167, and ONR Grant No. Conference on Learning Representations, ICLR 2017,
|
||
N00014-18-1-2527. Toulon, France, April 24-26, 2017, Conference Track Pro-
|
||
ceedings, 2017. URL https://openreview.net/
|
||
A. O. was supported by the Air Force Office of Scientific forum?id=HkpbnH9lx.
|
||
Research under award number FA9550-18-1-0167
|
||
Dormand, J. R. and Prince, P. J. A family of embedded
|
||
Resources used in preparing this research were provided, in Runge-Kutta formulae. Journal of computational and
|
||
part, by the Province of Ontario, the Government of Canada applied mathematics, 6(1):19–26, 1980.
|
||
How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
|
||
Drucker, H. and LeCun, Y. Improving generalization per- Hutchinson, M. F. A stochastic estimator of the trace of the
|
||
formance using double backpropagation. IEEE Trans. influence matrix for Laplacian smoothing splines. Com-
|
||
Neural Networks, 3(6):991–997, 1992. doi: 10.1109/ munications in Statistics-Simulation and Computation,
|
||
72.165600. URL https://doi.org/10.1109/ 19(2):433–450, 1990.
|
||
72.165600.
|
||
Kanaa, D., Voleti, V., Kahou, S., and Pal, C. Simple video
|
||
Dupont, E., Doucet, A., and Teh, Y. W. Augmented generation using neural ODEs. Workshop on Learning
|
||
neural ODEs. In Wallach, H. M., Larochelle, H., with Rich Experience, Advances in Neural Information
|
||
Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Gar- Processing Systems 32: Annual Conference on Neural
|
||
nett, R. (eds.), Advances in Neural Information Pro- Information Processing Systems 2019, NeurIPS 2019,
|
||
cessing Systems 32: Annual Conference on Neural 8-14 December 2019, Vancouver, BC, Canada, 2019.
|
||
Information Processing Systems 2019, NeurIPS 2019,
|
||
8-14 December 2019, Vancouver, BC, Canada, pp. Karras, T., Aila, T., Laine, S., and Lehtinen, J. Pro-
|
||
3134–3144, 2019. URL http://papers.nips.cc/ gressive growing of gans for improved quality, stabil-
|
||
paper/8577-augmented-neural-odes. ity, and variation. CoRR, abs/1710.10196, 2017. URL
|
||
http://arxiv.org/abs/1710.10196.
|
||
E, W. A Proposal on Machine Learning via Dynam-
|
||
ical Systems. Communications in Mathematics and Kingma, D. P. and Ba, J. Adam: A method for stochastic op-
|
||
Statistics, 5(1):1–11, March 2017. ISSN 2194-671X. timization. In 3rd International Conference on Learning
|
||
doi: 10.1007/s40304-017-0103-z. URL https:// Representations, ICLR 2015, San Diego, CA, USA, May
|
||
doi.org/10.1007/s40304-017-0103-z. 7-9, 2015, Conference Track Proceedings, 2015. URL
|
||
http://arxiv.org/abs/1412.6980.
|
||
Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P.,
|
||
Kingma, D. P. and Dhariwal, P. Glow: Generative flow
|
||
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
|
||
with invertible 1x1 convolutions. In Bengio, S., Wallach,
|
||
He, K. Accurate, large minibatch SGD: training ima-
|
||
H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N.,
|
||
genet in 1 hour. CoRR, abs/1706.02677, 2017. URL
|
||
and Garnett, R. (eds.), Advances in Neural Information
|
||
http://arxiv.org/abs/1706.02677.
|
||
Processing Systems 31: Annual Conference on Neural
|
||
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, Information Processing Systems 2018, NeurIPS 2018,
|
||
I., and Duvenaud, D. FFJORD: free-form continu- 3-8 December 2018, Montréal, Canada, pp. 10236–
|
||
ous dynamics for scalable reversible generative mod- 10245, 2018. URL http://papers.nips.cc/
|
||
els. In 7th International Conference on Learning Rep- paper/8224-glow-generative-flow-with-
|
||
resentations, ICLR 2019, New Orleans, LA, USA, May invertible-1x1-convolutions.
|
||
6-9, 2019, 2019. URL https://openreview.net/
|
||
Köhler, J., Klein, L., and Noé, F. Equivariant flows: sam-
|
||
forum?id=rJxgknCcK7.
|
||
pling configurations for multi-body systems with sym-
|
||
Haber, E. and Ruthotto, L. Stable architectures for deep metric energies. arXiv preprint arXiv:1910.00753, 2019.
|
||
neural networks. Inverse Problems, 34(1):014004, 2017. Krizhevsky, A. and Hinton, G. Learning multiple
|
||
He, K., Zhang, X., Ren, S., and Sun, J. Deep resid- layers of features from tiny images. Technical re-
|
||
ual learning for image recognition. In 2016 IEEE port, University of Toronto, 2009. URL http://
|
||
Conference on Computer Vision and Pattern Recogni- www.cs.toronto.edu/ ̃kriz/cifar.html.
|
||
tion, CVPR 2016, Las Vegas, NV, USA, June 27-30, LeCun, Y. and Cortes, C. The MNIST database of handwrit-
|
||
2016, pp. 770–778. IEEE Computer Society, 2016. doi: ten digits. 1998. URL http://yann.lecun.com/
|
||
10.1109/CVPR.2016.90. URL https://doi.org/ exdb/mnist/.
|
||
10.1109/CVPR.2016.90.
|
||
Li, X., Wong, T. L., Chen, R. T. Q., and Duvenaud, D. Scal-
|
||
Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. able gradients for stochastic differential equations. CoRR,
|
||
Flow++: Improving flow-based generative models with abs/2001.01328, 2020. URL http://arxiv.org/
|
||
variational dequantization and architecture design. In abs/2001.01328.
|
||
Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings
|
||
of the 36th International Conference on Machine Learn- Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and
|
||
ing, ICML 2019, 9-15 June 2019, Long Beach, California, Sohl-Dickstein, J. Sensitivity and generalization in neural
|
||
USA, volume 97 of Proceedings of Machine Learning networks: an empirical study. In 6th International Con-
|
||
Research, pp. 2722–2730. PMLR, 2019. URL http: ference on Learning Representations, ICLR 2018, Van-
|
||
//proceedings.mlr.press/v97/ho19a.html. couver, BC, Canada, April 30 - May 3, 2018, Conference
|
||
How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization
|
||
Track Proceedings. OpenReview.net, 2018. URL https: Processing Systems 2019, NeurIPS 2019, 8-14 December
|
||
//openreview.net/forum?id=HJC2SzZCW. 2019, Vancouver, BC, Canada, pp. 13412–13421, 2019.
|
||
URL http://papers.nips.cc/paper/9497-
|
||
Pontryagin, L. S., Mishchenko, E., Boltyanskii, V., and ode2vae-deep-generative-second-order-
|
||
Gamkrelidze, R. The mathematical theory of optimal odes-with-bayesian-neural-networks.
|
||
processes. 1962.
|
||
Zhang, L., E, W., and Wang, L. Monge-Ampère flow for
|
||
Rubanova, Y., Chen, T. Q., and Duvenaud, D. K. Latent or- generative modeling. CoRR, abs/1809.10188, 2018. URL
|
||
dinary differential equations for irregularly-sampled time http://arxiv.org/abs/1809.10188.
|
||
series. In Advances in Neural Information Processing
|
||
Systems, pp. 5321–5331, 2019.
|
||
Ruthotto, L. and Haber, E. Deep neural networks motivated
|
||
by partial differential equations. Journal of Mathematical
|
||
Imaging and Vision, pp. 1–13, 2018.
|
||
Ruthotto, L., Osher, S. J., Li, W., Nurbekyan, L., and
|
||
Fung, S. W. A machine learning framework for solv-
|
||
ing high-dimensional mean field game and mean field
|
||
control problems. CoRR, abs/1912.01825, 2019. URL
|
||
http://arxiv.org/abs/1912.01825.
|
||
Santambrogio, F. Benamou-Brenier and other continu-
|
||
ous numerical methods, pp. 219–248. Springer Interna-
|
||
tional Publishing, Cham, 2015. ISBN 978-3-319-20828-
|
||
2. doi: 10.1007/978-3-319-20828-2 6. URL https:
|
||
//doi.org/10.1007/978-3-319-20828-2 6.
|
||
Twomey, N., Kozlowski, M., and Santos-Rodrı́guez, R. Neu-
|
||
ral ODEs with stochastic vector field mixtures. CoRR,
|
||
abs/1905.09905, 2019. URL http://arxiv.org/
|
||
abs/1905.09905.
|
||
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu,
|
||
K. Pixel recurrent neural networks. CoRR,
|
||
abs/1601.06759, 2016. URL http://arxiv.org/
|
||
abs/1601.06759.
|
||
Villani, C. Topics in Optimal Transportation. Graduate
|
||
studies in mathematics. American Mathematical Society,
|
||
2003. ISBN 9780821833124.
|
||
Villani, C. Optimal Transport: Old and New. Grundlehren
|
||
der mathematischen Wissenschaften. Springer Berlin Hei-
|
||
delberg, 2008. ISBN 9783540710509. URL https://
|
||
books.google.ca/books?id=hV8o5R7 5tkC.
|
||
Yang, L. and Karniadakis, G. E. Potential flow gener-
|
||
ator with L2 Optimal Transport regularity for gener-
|
||
ative models. CoRR, abs/1908.11462, 2019. URL
|
||
http://arxiv.org/abs/1908.11462.
|
||
Yildiz, C., Heinonen, M., and Lähdesmäki, H. ODE2VAE:
|
||
deep generative second order ODEs with Bayesian neural
|
||
networks. In Wallach, H. M., Larochelle, H., Beygelz-
|
||
imer, A., d’Alché-Buc, F., Fox, E. B., and Garnett,
|
||
R. (eds.), Advances in Neural Information Processing
|
||
Systems 32: Annual Conference on Neural Information
|
||
|
||
A. Details of Section 3.1: Benamou-Brenier Hence, multiplying the objective function in (20) by λ and
|
||
formulation in Lagrangian coordinates ignoring the f -independent term Ex∼p log p(x) we obtain
|
||
an equivalent objective function
|
||
The Benamou-Brenier formulation of the optimal transporta-
|
||
tion (OT) problem in Eulerian coordinates is
|
||
<<FORMULA>> (21)
|
||
|
||
<<FORMULA>> (18a)
|
||
|
||
Finally, if we assume that {xi }N i=1 are iid sampled from p,
|
||
<<FORMULA>> (18b) we obtain the empirical objective function
|
||
|
||
<<ρ0 (x) = p>>, (18c)
|
||
|
||
<<ρT (z) = q>>. (18d) <<FORMULA>> (22)
|
||
|
||
The connection between continuous normalizing flows
|
||
(CNF) and OT becomes transparent once we rewrite (18) in
|
||
Lagrangian coordinates. Indeed, for regular enough velocity
|
||
B. Additional results
|
||
fields f one has that the solution of the continuity equation Here we present additional generated samples on the two
|
||
(18b), (18c) is given by ρt = z(·, t)]p where z is the flow larger datasets considered, CelebA-HQ and ImageNet64. In
|
||
addition bits/dim on clean images are reported in Table 2.
|
||
<<FORMULA>>
|
||
|
||
The relation ρt = z(·, t)]p means that for arbitrary test
|
||
function φ we have that
|
||
|
||
<<φ(x)ρt (x, t)dx = φ(z(x, t))p(x)dx>>
|
||
|
||
Therefore (18) can be rewritten as
|
||
|
||
<<min kf (z(x, t), t)k2 p(x) dxdt>> (19a)
|
||
|
||
<<subject to ż(x, t) = f (z(x, t), t)>>, (19b)
|
||
|
||
<<z(x, 0) = x>>, (19c)
|
||
|
||
<<z(·, T )]p = q>>. (19d)
|
||
|
||
Note that ρt is eliminated in this formulation. The terminal
|
||
condition (18d) is trivial to implement in Eulerian coordi-
|
||
nates (grid-based methods) but not so simple in Lagrangian
|
||
ones (19d) (grid-free methods). To enforce (19d) we intro-
|
||
duce a penalty term in the objective function that measures
|
||
the deviation of z(·, T )]p from q. Thus, the penalized ob-
|
||
jective function is
|
||
<<FORMULA>> (20)
|
||
where λ > 0 is the penalization strength. Next, we observe
|
||
that this objective function can be written as an expectation
|
||
with respect to x ∼ p. Indeed, the Kullback-Leibler di-
|
||
vergence is invariant under coordinate transformations, and
|
||
therefore
|
||
|
||
<<FORMULA>>
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 7. Quality of FFJORD RNODE generated images on ImageNet-64.
|
||
|
||
<<FIGURE>>
|
||
|
||
Figure 8. Quality of FFJORD RNODE generated images on CelebA-HQ. We use temperature annealing, as described in (Kingma &
|
||
Dhariwal, 2018), to generate visually appealing images, with T = 0.5, . . . , 1.
|
||
|
||
Table 2. Additional results and model statistics of FFJORD RNODE. Here we report validation bits/dim on both validation images, and on
|
||
validation images with uniform variational dequantization (ie perturbed by uniform noise). We also report number of trainable model
|
||
parameters.
|
||
<<TABLE>>
|
||
|
||
<<END>> <<END>> <<END>>
|
||
|
||
|
||
<<START>> <<START>> <<START>>
|
||
|
||
A guide to convolution arithmetic for deep
|
||
learning
|
||
|
||
The authors of this guide would like to thank David Warde-Farley,
|
||
Guillaume Alain and Caglar Gulcehre for their valuable feedback. We
|
||
are likewise grateful to all those who helped improve this tutorial with
|
||
helpful comments, constructive criticisms and code contributions. Keep
|
||
them coming!
|
||
Special thanks to Ethan Schoonover, creator of the Solarized color
|
||
scheme, 1 whose colors were used for the figures.
|
||
|
||
Feedback
|
||
Your feedback is welcomed! We did our best to be as precise, infor-
|
||
mative and up to the point as possible, but should there be any thing you
|
||
feel might be an error or could be rephrased to be more precise or com-
|
||
prehensible, please don’t refrain from contacting us. Likewise, drop us a
|
||
line if you think there is something that might fit this technical report
|
||
and you would like us to discuss – we will make our best effort to update
|
||
this document.
|
||
|
||
Source code and animations
|
||
The code used to generate this guide along with its figures is available
|
||
on GitHub. 2 There the reader can also find an animated version of the
|
||
figures.
|
||
|
||
|
||
1 Introduction 5
|
||
1.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . .6
|
||
1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
|
||
|
||
2 Convolution arithmetic 12
|
||
2.1 No zero padding, unit strides . . . . . . . . . . . . . . . . . . . .12
|
||
2.2 Zero padding, unit strides . . . . . . . . . . . . . . . . . . . . . .13
|
||
2.2.1 Half (same) padding . . . . . . . . . . . . . . . . . . . . .13
|
||
2.2.2 Full padding . . . . . . . . . . . . . . . . . . . . . . . . .13
|
||
2.3 No zero padding, non-unit strides . . . . . . . . . . . . . . . . . .15
|
||
2.4 Zero padding, non-unit strides . . . . . . . . . . . . . . . . . . . .15
|
||
|
||
3 Pooling arithmetic 18
|
||
|
||
4 Transposed convolution arithmetic 19
|
||
4.1 Convolution as a matrix operation . . . . . . . . . . . . . . . . .20
|
||
4.2 Transposed convolution . . . . . . . . . . . . . . . . . . . . . . .20
|
||
4.3 No zero padding, unit strides, transposed . . . . . . . . . . . . .21
|
||
4.4 Zero padding, unit strides, transposed . . . . . . . . . . . . . . .22
|
||
4.4.1 Half (same) padding, transposed . . . . . . . . . . . . . .22
|
||
4.4.2 Full padding, transposed . . . . . . . . . . . . . . . . . . .22
|
||
4.5 No zero padding, non-unit strides, transposed . . . . . . . . . . .24
|
||
4.6 Zero padding, non-unit strides, transposed . . . . . . . . . . . . .24
|
||
|
||
5 Miscellaneous convolutions 28
|
||
5.1 Dilated convolutions . . . . . . . . . . . . . . . . . . . . . . . . .28
|
||
|
||
|
||
Chapter 1
|
||
|
||
|
||
Introduction
|
||
|
||
|
||
Deep convolutional neural networks (CNNs) have been at the heart of spectac-
|
||
ular advances in deep learning. Although CNNs have been used as early as the
|
||
nineties to solve character recognition tasks (Le Cunet al., 1997), their current
|
||
widespread application is due to much more recent work, when a deep CNN
|
||
was used to beat state-of-the-art in the ImageNet image classification challenge
|
||
(Krizhevskyet al., 2012).
|
||
Convolutional neural networks therefor e constitute a very useful tool for ma-
|
||
chine learning practitioners. However, learning to use CNNs for the first time
|
||
is generally an intimidating experience. A convolutional layer’s output shape
|
||
is affected by the shape of its input as well as the choice of kernel shape, zero
|
||
padding and strides, and the relationship between these properties is not triv-
|
||
ial to infer. This contrasts with fully-connected layers, whose output size is
|
||
independent of the input size. Additionally, CNNs also usually feature apool-
|
||
ingstage, adding yet another level of complexity with respect to fully-connected
|
||
networks. Finally, so-called transposed convolutional layers (also known as frac-
|
||
tionally strided convolutional layers) have been employed in more and more work
|
||
as of late (Zeileret al., 2011; Zeiler and Fergus, 2014; Longet al., 2015; Rad-
|
||
for det al., 2015; Visinet al., 2015; Imet al., 2016), and their relationship with
|
||
convolutional layers has been explained with various degrees of clarity.
|
||
This guide’s objective is twofold:
|
||
|
||
1.Explain the relationship between convolutional layers and transposed con-
|
||
volutional layers.
|
||
2.Provide an intuitive underst and ing of the relationship between input shape,
|
||
kernel shape, zero padding, strides and output shape in convolutional,
|
||
pooling and transposed convolutional layers.
|
||
|
||
In order to remain broadly applicable, the results shown in this guide are
|
||
independent of implementation details and apply to all commonly used machine
|
||
learning frameworks, such as Theano (Bergstraet al., 2010; Bastienet al., 2012),
|
||
|
||
|
||
|
||
Torch (Collobertet al., 2011), Tensorflow (Abadiet al., 2015) and Caffe (Jia et al., 2014).
|
||
|
||
This chapter briefly reviews the main building blocks of CNNs, namely dis-
|
||
crete convolutions and pooling. for an in-depth treatment of the subject, see
|
||
Chapter 9 of the Deep Learning textbook (Goodfellowet al., 2016).
|
||
|
||
|
||
1.1 Discrete convolutions
|
||
|
||
The bread and butter of neural networks is affine transformations: a vector
|
||
is received as input and is multiplied with a matrix to produce an output (to
|
||
which a bias vector is usually added before passing the result through a non-
|
||
linearity). This is applicable to any type of input, be it an image, a sound
|
||
clip or an unordered collection of features: whatever their dimensionality, their
|
||
representation can always be flattened into a vector before the transfomation.
|
||
Images, sound clips and many other similar kinds of data have an intrinsic
|
||
structure. More formally, they share these important properties:
|
||
|
||
|