diff --git a/Corpus/CORPUS.txt b/Corpus/CORPUS.txt
index 851f739..70d7dd3 100644
--- a/Corpus/CORPUS.txt
+++ b/Corpus/CORPUS.txt
@@ -21521,4 +21521,7539 @@ In this section, we show the plots of feature importance for all the tasks.
<
>
+<> <> <>
+
+
+<> <> <>
+Scalable Gradients for Stochastic Differential Equations
+
+Xuechen Li. Ting-Kam Leonard Wong
+
+Google Research University of Toronto
+
+Abstract
+
+The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations. We generalize this method to stochastic Differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. Specifically, we derive a stochastic differential equation whose solution is the gradient, a memory-efficient algorithm for caching noise, and conditions under which numerical solutions converge. In addition, we combine our method with gradient-based stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance on a 50-dimensional motion capture dataset.
+
+1 Introduction
+
+Deterministic dynamical systems can often be modeled by ordinary Differential equations (ODEs). The adjoint sensitivity method can efficiently compute gradients of ODE solutions with constant memory cost. This method was well-known in the physics, numerical analysis, and control communities for decades [3, 4, 60, 65]. Recently, it was combined with modern reverse-mode automatic differentiation packages, enabling ODEs with millions of parameters to be fit to data [12] and allow.
+ing more flexible density estimation and time-series models [23, 32, 72].
+Stochastic Differential equations (SDEs) generalize ODEs, adding instantaneous noise to their dynamics [55, 77, 78]. They are a natural model for phenomena governed by many small and unobserved interactions, such as motion of molecules in a liquid [8],
+
+allele frequencies in a gene pool [15], or prices in a market [79]. Previous attempts on fitting SDEs mostly relied on methods with poor scaling properties. The pathwise approach [22, 89], a form of forward-mode automatic differentiation, scales poorly in time with the number of parameters and states in the model. On the other hand, simply differentiating through the operations of an SDE solver [19] scales poorly in memory.
+In this work, we generalize the adjoint method to stochastic dynamics defined by SDEs. We give a sim.ple and practical algorithm for fitting SDEs with tens of thousands of parameters, while allowing the use of high-order adaptive time-stepping SDE solvers. We call this approach the stochastic adjoint sensitivity method.
+
+<
>
+
+Table 1: Asymptotic complexity comparison. L is the number of steps used in a fixed-step solve, and D is the number of state and parameters. Both memory and time are expressed in units of the cost of evaluating the drift and diffusion functions once each.
+There are two main difficulties in generalizing the ad.joint formulation for ODEs to SDEs. The first is mathematical: SDEs are defined using nonstandard integrals that usually rely on Ito calculus. The adjoint method requires solving the dynamics backwards in time from the end state. However, it is not clear exactly what running the SDE backwards means in the context of stochastic calculus, and when it correctly reconstructs the forward trajectory. We address this problem in Section 3, deriving a backward Stratonovich SDE whose dynamics compute the necessary gradient.
+The second difficulty is computational: To retrace the steps, one needs to reconstruct the noise sampled on the forward pass, ideally without storing it. In Section 4, we give an algorithm that allows querying a Brownian motion sample at any time point arbitrarily-precisely, while only storing a single random seed.
+
+We combine our adjoint approach with a gradient-based stochastic variational inference scheme for efficiently marginalizing over latent SDE models with arbitrary differentiable likelihoods. This model fam.ily generalizes several existing families such as latent ODEs [12, 72], Gaussian state-space models [36, 81], and deep Kalman filters [40], and can naturally handle irregularly-sampled times series and missing observations. We train latent SDEs on toy and real datasets, demonstrating competitive performance compared to existing approaches for dynamics modeling.
+
+2 Background: Stochastic Flows
+
+2.1 Adjoint Sensitivity Method
+The adjoint sensitivity method is an efficient approach to solve control problems relying on the adjoint (co-state) system [65]. Chen et al. [12] used this method to compute the gradient with respect to parameters of a neural ODE, which is a particular model among many others inspired by the theory of dynamical systems [10, 11, 26, 44, 46, 74, 86]. The method, shown in Algorithm 1, is scalable, since the most costly computation is a vector-Jacobian product defining its backwards dynamics. In addition, since the gradient is obtained by solving another ODE, no intermediate computation is stored as in the case of regular backpropagation [73].
+
+2.2 Stochastic Differential Equations
+We briefly define SDEs: Consider a filtered probability space <> on which an m-dimensional adapted Wiener process (aka Brownian motion) <> is defined. For a fixed terminal time t
+<>, we denote by <> the time horizon. We denote the ith component of Wt by <>. A stochastic process <> can be defined by an Ito SDE
+
+<>, (1)
+
+where z0 . Rd is the starting state, and <> and <> are the drift and diffusion functions, respectively. For ease of presentation, we let m =1 in the following unless otherwise stated. Our contributions can be easily generalized to cases where
+m> 1. Here, the second integral on the right hand side of (1) is the Ito stochastic integral [55]. When the coefficients are globally Lipschitz in both the state and time, there exists a unique strong solution to the SDE [55].
+2.3 Neural Stochastic Differential Equations
+Similar to neural ODEs, one can consider drift and diffusion functions defined by neural networks, a model known as the neural SDE [32, 45, 82, 83].
+Amongst work on neural SDEs, none has enabled an efficient training framework. In particular, Tzen and Raginsky [82] and Liu et al. [45] considered computing the gradient by simulating the forward dynamics of an explicit Jacobian matrix. This Jacobian has size of either the square of the number of parameters, or the number of parameters times the number of states, building on the pathwise approach [22, 89]. In contrast, our approach only requires a small number of cheap vector-Jacobian products, independent of the dimension of the parameter and state vectors. These vector-Jacobian products have the same asymptotic time cost as evaluating the drift and diffusion functions, and can be easily computed by modern automatic differentiation libraries [1, 16, 49, 59].
+2.4 Backward Stratonovich Integral
+Our stochastic adjoint sensitivity method involves stochastic processes running both forward and back.ward in time. The Stratonovich stochastic integral, due to its symmetry, gives nice expressions for the backward dynamics and is more convenient for our purpose. Our results can be straightforwardly applied to ItSDEs as well, using a simple conversion (see e.g. [64, Sec. 2]).
+Following the treatment of Kunita [41], we introduce the forward and backward Stratonovich integrals. Let <> be a two-sided filtration, where <> is the \sigma-algebra generated by <> for <> such that <>. For a continuous semi-martingale <> adapted to the forward filtration <>, the Stratonovich stochastic integral is
+
+<>
+
+where <> is a partition of the interval <> denotes the size of largest segment of the partition, and the limit is to be interpreted in the L2 sense. The Ito integral uses instead the left endpoint <> rather than the average. In general, the Ito and Stratonovich integrals differ by a term of finite variation.
+
+To define the backward Stratonovich integral, we consider c the backward Wiener process <> defined as <> for all t that is adapted to the backward filtration <>. For a continuous semimartingale <> adapted to the backward filtration,
+
+Algorithm 1 ODE Adjoint Sensitivity
+
+<>
+
+Algorithm 2 SDE Adjoint Sensitivity (Ours)
+
+<>
+
+Figure 1: Pseudocode of the (ODE) adjoint sensitivity method (left), and our generalization to Stratonovich SDEs (right). differences are highlighted in blue. Square brackets denote vector concatenation.
+the backward Stratonovich integral is Moreover, each .s,t is a smooth diffeomorphism N flow of diffeomorphisms generated by the SDE (2).
+
+<
>
+
+from Rd to itself. We thus call S the stochastic (b) The backward flow <> satisfies the backward SDE:
+
+<>
+
+where <> is the partition.
+
+2.5 Stochastic Flow of diffeomorphisms
+<>
+
+It is well known that an ODE defines a flow of diffeomorphisms [6]. Here we consider the stochastic analog <>, (3) for the Stratonovich SDE s
+
+<>
+
+for all <> and <> such that <>.
+
+<> (2)
+
+The coefficients in (2) and (3) differ by only a negative sign. This symmetry is due to our use of the Stratonovich integral (see Figure 2).
+
+<>
+
+Throughout the paper, we assume that both b and <> have infinitely many bounded derivatives w.r.t. the state, and bounded first derivatives w.r.t. time, i.e. <>, so that the SDE has a unique strong solution. Let <> be the solution at time t
+when the process is started at z at time s. Given a realization of the Wiener process, this defines a collection of continuous maps <> from Rd to itself.
+The following theorem shows that these maps are diffeomorphisms (after choosing a suitable modification) and that they satisfy backward SDEs.
+Theorem 2.1 ([41, Theorem 3.7.1]). (a) With probability 1, the collection <> satisfies the flow property
+
+<>.
+
+3 Sensitivity via Stochastic Adjoint
+
+We present our main contribution: a stochastic analog of the adjoint sensitivity method for SDEs. We use (3) to derive another backward Stratonovich SDE, which we call the stochastic adjoint process. The direct implication is a gradient computation algorithm that works by solving a set of dynamics in reverse time, and relies on cheap vector-Jacobian products without storing any intermediate quantities.
+The proof included in Appendix 9.1 relies on Its lemma in the Stratonovich form [41, Theorem 2.4.1]. We stress that this lemma considers only the case where the endpoint z is fixed and deterministic.
+Now, we extend to the case where the endpoint is not deterministic, but rather computed from the forward flow. To achieve this, we compose the state process and the loss function. Consider As, <>. The chain rule gives As, <>. Let
+
+<>
+
+3.1 Stochastic Adjoint Process
+The goal is to derive a stochastic adjoint process <> that can be simulated by evaluating only vector-Jacobian products, where <> is a
+
+<> (6)
+
+Note that As, <>.
+
+Since <> is scalar loss of the terminal state from the forward flow a constant, <> satisfies the augmented <> backward SDE system
+
+backward SDE for the process
+
+<>
+
+We first derive <>, assuming that <> follows the inverse flow from a deterministic end state ZT
+
+<>
+
+that does not depend on the realized Wiener process (Lemma 3.1). We then extend to the case where <> is obtained by the forward flow starting from a deterministic initial state z0 (Theorem 3.2). This latter part is unconventional, and the resulting value cannot be interpreted as the solution to a backward SDE anymore due to loss of adaptedness. Instead, we will formulate the result with the Ito map [69]. Finally, it is straightforward to extend the state Zt to include parameters of the drift and diffusion functions such that the desired gradient can be obtained for stochastic optimization; we comment on this step in Section 3.3.
+
+<>
+
+Since the drift and diffusion functions of this augmented system are <>, the system has a unique strong solution. Let s=0 and t = T . Since (7) admits a strong solution, we may write
+
+<>, (8)
+
+We first present the SDE for the Jacobian matrix of where <> denotes the path of the Wiener the backward flow.
+process and
+
+Lemma 3.1 (Dynamics of <>). Consider the stochastic flow generated by the backward SDE (3) as in <>
+
+Theorem 2.1(b). Letting Js,t(z) := r.s,t(z), we have
+is a deterministic measurable function (the Ito map) [69, Chapter V, definition 10.9]. Intuitively, F can be thought as a black box that computes the solution
+
+<>
+
+to the backward SDE system (7) given the position at time T and the realized Wiener process samples. Similarly, we let G be the solution map for the forward flow (2). The next theorem follows immediately from (6) and the definition of <>, we have
+for all <> and <>. Furthermore, letting
+
+<>, (4)
+
+ we have
+
+Theorem 3.2. For <>-almost all <>,
+
+<>
+
+where <>
+
+<>, (5)
+
+for all <> and <> and (8).
+
+Proof. This is a consequence of composing <>
+
+
+This shows that one can obtain the gradient by "composing" the backward SDE system (7) with the original forward SDE (2) and ends our continuous-time analysis.
+
+3.2 Numerical Approximation
+In practice, we compute solutions to SDEs with numerical solvers Fh and Gh, where <> denotes the mesh size of a fixed grid. The approximate algorithm thus outputs <>. The following theorem provides sufficient conditions for convergence.
+Theorem 3.3. Suppose the schemes Fh and Gh satisfy the following conditions: (i) <> in probability as <>, and
+(ii) for any <>, we have <> in probability as <>. Then, for any starting point z of the forward flow, we have
+
+<>
+
+in probability as <>.
+
+See Appendix 9.2 for the proof. Usual schemes such as the Euler-Maruyama scheme (more generally ItTaylor schemes) converge pathwise (i.e. almost surely) from any fixed starting point [38] and satisfies (i). While (ii) is strong, we note that the SDEs considered here have smooth coefficients, and thus their solutions enjoy nice regularity properties in the starting position. There.fore, it is reasonable to expect that the corresponding numerical schemes to also behave nicely as a function of both the mesh size and the starting position. To the best of our knowledge, this property is not considered
+at all in the literature on numerical methods for SDEs (where the initial position is fixed), but is crucial in the proof of Theorem 3.3. In Appendix 9.3, we prove that condition (ii) holds for the Euler-Maruyama scheme. Detailed analysis for other schemes is beyond the scope of this paper.
+
+3.3 The Algorithm
+So far we have derived the gradient of the loss with respect to the initial state. We can extend these results to give gradients with respect to parameters of the drift and diffusion functions by treating them as an additional part of the state whose dynamics has zero drift and diffusion. We summarize this in Algorithm 2, assuming access only to a black-box solver sdeint. All terms in the augmented dynamics, such as <> can be cheaply evaluated by calling <> and <>, respectively.
+difficulties with non-diagonal diffusion. In principle, we can simulate the forward and backward adjoint dynamics with any high-order solver of choice. However,
+for general matrix-valued diffusion functions ., to ob.tain a numerical solution with strong order 1
+beyond 1/2, we need to simulate multiple integrals of the Wiener process such as <>.
+These random variables are difficult to simulate and costly to approximate [87].
+Fortunately, if we restrict our SDE to have diagonal noise, then even though the backward SDE for the stochastic adjoint will not in general have diagonal noise, it will satisfy a commutativity property [70]. In that case, we can safely adopt certain numerical schemes of strong order 1.0 (e.g. Milstein [52] and stochastic Runge-Kutta [71]) without approximating multiple integrals or the Levy area during simulation. We formally show this in Appendix 9.4.
+One may also consider numerical schemes with high weak order [39]. However, analysis of this scenario is beyond the current scope.
+
+3.4 Software and Implementation
+We have implemented several common SDE solvers in PyTorch [59] with adaptive time-stepping using a PI controller [9, 30]. Following
+torchdiffeq [12], we have created a user-friendly subclass of torchautograd. Function that facilitates gradient computation using our stochastic adjoint framework for SDEs that are subclasses of torch.nn.Module. We include a short code snippet covering the main idea of the stochastic adjoint in Appendix 9.12. The complete codebase can be found at https://github.com/google-research/torchsde.
+
+4 Virtual Brownian Tree
+
+Our formulation of the adjoint can be numerically integrated efficiently, since simulating its dynamics only requires evaluating cheap vector-Jacobian products, as opposed to whole Jacobians. However, the backward-in-time nature introduces a new difficulty: The same Wiener process sample path used in the for.ward pass must be queried again during the backward pass. Brownian storing Brownian motion increments implies a large memory consumption and complicates the usage of adaptive time-stepping integrators, where the evaluation times in the backward pass may be different from those in the forward pass.
+To overcome this issue, we combine Brownian trees with splittable pseudorandom number generators (PRNGs) to give an algorithm that can query values of a Wiener
+1A numerical scheme is of strong order p if <> for all <>, where Xt and XN. are respectively the coupled true solution and numerical solution, N and . are respectively the iteration index and step size such that N. = T , and C is independent of process sample path at arbitrary times. This algorithm, which we call the virtual Brownian tree, has O(1) memory cost, and time cost logarithmic with respect to the inverse error tolerance.
+
+<
>
+
+Figure 3: Evaluating a Brownian motion sample at time tq using a virtual Brownian tree. Our algorithm repeatedly bisects the interval, sampling from a Brownian bridge at each halving to determine intermediate values. Each call to the random number generator uses a unique key whose value depends on the path taken to reach it.
+
+4.1 Brownian Bridges and Brownian Trees
+Levy's Brownian bridge [67] states that given a start time ts and end time te along with their respective Wiener process values ws and we, the marginal of the process at time <> is a normal distribution:
+
+<>. (9)
+
+We can recursively apply this formula to evaluate the process at the midpoint of any two distinct timestamps where the values are already known. Constructing the whole sample path of a Wiener process in this manner results in what is known as the Brownian tree [17]. Storing this tree would be memory-intensive, but we show how to reconstruct any node in this tree as desired.
+
+4.2 Brownian Trees using Splittable Seeds
+We assume access to a splittable PRNG [14], which has an operation split that deterministically generates two keys from an existing key. Given a key, the function BrownianBridge samples deterministically from (9). To obtain the Wiener process value at a specific time, we must first know or sample the values at the initial and terminal times. Then, the virtual Brownian tree recursively samples from the midpoint of Brownian bridges, each sample using a key split from that of its parent node. The algorithm terminates when the most recently sampled time is close enough to the desired time. We outline the full procedure in Algorithm 3.
+
+Algorithm 3 Virtual Brownian Tree
+
+<>
+
+This algorithm has constant memory cost. For a fixed-step-size solver taking L steps, the tolerance that the tree will need to be queried at scales as 1/L. Thus the per-step time complexity scales as log L. Our implementation uses an efficient count-based PRNG [76] which avoids passing large random states, and instead simply passes integers. Table 1 compares the asymptotic time complexity of this approach against existing alternatives.
+
+5 Latent Stochastic Differential Equations
+
+The algorithms presented in Sections 3 and 4 allow us to efficiently compute gradients of scalar objectives with respect to SDE parameters, letting us fit SDEs to data. This raises the question: Which loss to optimize?
+Simply fitting SDE parameters to maximize likelihood will in general cause overfitting, and will result in the diffusion function going to zero. In this section, we show how to do efficient variational inference in SDE models, and optimize the marginal log-likelihood to fit both prior (hyper-)parameters and the parameters of a tractable approximate posterior over functions.
+In particular, we can parameterize both a prior over functions and an approximate posterior using SDEs:
+
+<>, (prior)
+
+<>, (approx. post.)
+
+where <> and <> are Lipschitz in both arguments, and both processes have the same starting value: <>.
+If both processes share the same diffusion function <>, then the KL divergence between them is finite (under additional mild regularity conditions; see Appendix 9.6), and can be estimated by sampling paths from the posterior process. Then, the evidence lower
+
+<
>
+
+Figure 4: Graphical models for the generative process (decoder) and recognition network (encoder) of the latent stochastic Differential equation model. This model can be viewed as a variational autoencoder with infinite-dimensional noise. Red circles represent entire function draws from Brownian motion. Given the initial state z0 and a Brownian motion sample path <>, the intermediate states <> are deterministically approximated by a numerical SDE solver.
+
+bound (ELBO) can be written as:
+
+<>, (10)
+
+where <> satisfies <>, and the expectation is taken over the approximate posterior process defined by (approx. post.). The likelihoods of the observations x1,...,xN at times t1,...,tN depend only on the latent states zt at corresponding times.
+To compute the gradient with respect to prior parameters <> and variational parameters <>, we need only augment the forward SDE with an extra scalar variable whose drift function is <> and whose diffusion function is zero. The backward dynamics can be derived analogously using (7). We include a detailed derivation in Appendix 9.6. Thus, a stochastic estimate of the gradients of the loss w.r.t. all parameters can be computed in a single pair of forward and backward SDE solves.
+The variational parameters . can either be optimized individually for each sequence, or if multiple time series are sharing parameters, then an encoder network can be trained to input the observations and output .. This architecture, shown in figure 4, can be viewed as an infinite-dimensional variational autoencoder [35, 68].
+6 Related Work
+Sensitivity Analysis for SDEs. Gradient computation is closely related to sensitivity analysis. Computing gradients with respect to parameters of vector fields of an SDE has been extensively studied in the stochastic control literature [42]. In particular, for low dimensional problems, this is done effectively using dynamic programming [7] and finite differences [20, 43]. However, both approaches scale poorly with the dimensionality of the parameter vector.
+Analogous to REINFORCE (or the score-function estimator) [21, 37, 88], Yang and Kushner [89] considered deriving the gradient as rE[L(ZT )] = E[L(ZT )H] for some random variable H. However, H usually depends on the density of ZT with respect to the Lebesgue measure which can be difficult to compute. Gobet and Munos [22] extended this approach by weakening a non-degeneracy condition using Mallianvin calculus [53].
+Closely related to the current approach is the pathwise method [89], which is also a continuous-time analog of the reparameterization trick [35, 68]. Existing meth.ods in this regime [22, 45, 82] all require simulating a (forward) SDE where each step requires computing entire Jacobian matrices. This computational cost is prohibitive for high-dimensional systems with a large number of parameters.
+Based on the Euler discretization, Giles and Glasser.man [19] considered simply performing reverse-mode automatic differentiation through all intermediate steps. They named this method the adjoint approach, which, by modern standards, is a form of "backpropagation through the operations of a numerical solver". This approach, widely adopted in the field of finance for calibrating market models [19], has high memory cost, and relies on a fixed Euler-Maruyama discretization. Recently, this approach was also used by Hegde et al.
+[27] to learn parameterized drift and diffusion functions Figure 5: (a) Same fixed step size used in both forward and reverse simulation. Boxplot generated by repeating the experiment with different Brownian motion sample paths 64 times. (b) Colors of dots represent tolerance levels and correspond to the colorbar on the right. Only atol was varied and rtol was set to 0.
+
+
+of an SDE. In scientific computing, Innes et al. [31] considered backpropagating through high-order implicit SDE solvers.
+In the machine learning literature, Ryder et al. [75] perform variational inference over the state and parameters for Euler-discretized latent SDEs and optimize the model with regular backpropagation. This approach should not be confused with the formulation of variational inference for non-discretized SDEs presented in previous works [25, 57, 82] and our work, as it is unclear whether the limit of their discretization corresponds to that obtained by operating with continuous-time SDEs using Girsanov's theorem.
+Backward SDEs. Our stochastic adjoint process re.lies on the notion of backward SDEs devised by Kunita [41], which is based on two-sided filtrations. This is different from the more traditional notion of backward SDEs where only a single filtration is defined [58, 62].
+Based on the latter notion, forward-backward SDEs (FBSDEs) have been proposed to solve stochastic optimal control problems [63]. However, simulating FBS-DEs is costly due to the need to estimate conditional expectations in the backward pass [58].
+Bayesian Learning of SDEs. Recent works considered the problem of inferring an approximate posterior SDE given observed data under a prior SDE with the same diffusion coefficient [25, 57, 82]. The special case with constant diffusion coefficients was considered more than a decade ago [5]. Notably, computing the KL divergence between two SDEs over a finite time horizon was well-explored in the control literature [33, 80]. We include background on this topic in Appendix 9.5.
+Bayesian learning and parameter estimation for SDEs have a long history [24]. Techniques which don't fit require positing a variational family such as then extended Kalman filter and Markov chain Monte Carlo have been considered in the literature [50].
+7 Experiments
+The aim of this section is threefold. We first empirically verify our theory by comparing the gradients obtained by our stochastic adjoint framework against analytically derived gradients for problems having closed-form solutions. We then fit latent SDE models with our framework on two synthetic datasets, verifying that the variational inference framework allows learning a generative model of time series. Finally, we learn dynamics parameterized by neural networks with a latent SDE from a motion capture dataset, demonstrating competitive performance compared to existing approaches.
+We report results based on an implementation of Brownian motion that stores all intermediate queries. The virtual Brownian tree allowed training with much larger batch sizes on GPUs, but was not necessary for our small-scale experiments. Notably, our adjoint approach, even when combined with the Brownian motion implementation that stores noise, was able to reduce the memory usage by 1/2-1/3 compared to directly back-propagating through solver operations on the tasks we considered.
+7.1 Numerical Studies
+We consider three test problems (examples 1-3 from [66]; details in Appendix 9.7), all of which have closed-form solutions. We compare the gradient computed from simulating our stochastic adjoint process using the Milstein scheme against the exact gradient. Figure 5(a) shows that for test example 2, the error between the adjoint gradient and analytical gradient decreases with step size.
+For all three test problems, the mean squared error across dimensions tends to be smaller as the absolute tolerance of the adaptive solver is reduced (e.g. see Fig. 5 (b)). However, the Number of Function Evaluations (NFEs) tends to be much larger than that in the ODE case [12].
+
+Additionally, for two out of three test problems, we found that our adjoint approach with the Milstein scheme and fixed step size can be much more time.efficient than regular backpropagation through operations of the Milstein and Euler schemes (see e.g. Fig. 5(c)). Backpropagating through the Euler scheme gives gradients of higher error compared to the Milstein method. On the other hand, directly backpropagating through the Milstein solve requires evaluating high-order derivatives and can be costly.
+Results for examples 1 and 3 are in Appendix 9.8.
+
+Figure 6: Learned posterior and prior dynamics on data from a stochastic Lorenz attractor. All samples from our model are continuous-time paths, and form a multi-modal, non-Gaussian distribution.
+7.2 Synthetic Datasets
+We trained latent SDEs with our adjoint framework to recover (1) a 1D Geometric Brownian motion, and (2) a 3D stochastic Lorenz attractor process. The main objective is to verify that the learned posterior can reconstruct the training data, and that the learned priors are not deterministic. We jointly optimize the evidence lower bound (10) with respect to parameters of the prior and posterior distributions at the initial latent state z0, the prior and posterior drift, the diffusion function, the encoder, and the decoder. We include the details of datasets and architectures in Appendix 9.9.
+For the stochastic Lorenz attractor, not only is the model able to reconstruct the data well, but also the learned prior process can produce bimodal samples in both data and latent space. This is showcased in the last row of Figure 6 where the latent and data space samples cluster around two modes. This is hard to achieve using a latent ODE with a unimodal Gaussian initial approximate posterior. We include additional visualizations in Appendix 9.10.
+7.3 Motion Capture Dataset
+To demonstrate that latent SDEs can learn complex dynamics from real-world datasets, we evaluated their predictive performance on a 50-dimensional motion capture dataset. The dataset, from Gan et al. [18], consists of 23 walking sequences of subject 35 partitioned into 16 training, 3 validation, and 4 test sequences. We follow the preprocessing of Wang et al. [85].
+In designing the recognition network, we follow Y�ld�z et al. [90] and use a fully connected network to encode the first three observations of each sequence and there.after predicted the remaining sequence. This encoder is chosen for fair comparison to existing models, and could be extended to a recurrent or attention model [84]. The overall architecture is described in Appendix 9.11 and is similar to that of ODE2VAE [90], with a similar number of parameters. We also use a fixed step size 1/5 of smallest interval between any two observations [90].
+We train latent ODE and latent SDE models with the Adam optimizer [34] and its default hyperparameter settings, with an initial learning rate of 0.01 that is exponentially decayed with rate 0.999 during each iteration. We perform validation over the number of training iterations, KL penalty [29], and KL annealing schedule. All models were trained for at most 400 iterations, where we start to observe severe overfitting for most model instances. We report the test MSE on future observations following Y�ld�z et al. [90]. We believe that the improved performance is due to the strong regularization in path space, as removing the KL penalty improve training error but caused validation error to deteriorate.
+
+Table 2: Test MSE on 297 future frames averaged over 50 samples. 95% confidence interval reported based on t-statistic results from [90].
+
+<
>
+
+8 Discussion
+
+We presented a generalization of the adjoint sensitivity method to compute gradients through solutions of SDEs. In contrast to existing approaches, this method has nearly the same time and memory complexity as simply solving the SDE. We showed how our stochastic adjoint framework can be combined with a gradient-based stochastic variational inference scheme for train.ing latent SDEs.
+It is worthwhile to mention that SDEs and the commonly used GP models define two distinct classes of stochastic processes, albeit having a nonempty inter.section (e.g. Ornstein-Uhlenbeck processes fall under both). Computationally, the cost of fitting GPs lies in the matrix inversion, whereas the computational bottle.neck of training SDEs is the sequential numerical solve. Empirically, another avenue of research is to reduce the variance of gradient estimates. In the future, we may adopt techniques such as control variates or antithetic paths.
+On the application side, our method opens up a broad set of opportunities for fitting any differentiable SDE model, such as Wright-Fisher models with selection and mutation parameters [15], derivative pricing models in finance, or infinitely-deep Bayesian neural networks [61]. In addition, the latent SDE model enabled by our frame.work can be extended to include domain knowledge and structural or stationarity constraints [48] in the prior process for specific applications.
+On the theory side, there remain fundamental questions to be answered. Convergence rates of numerical gradients estimated with general schemes are unknown. Additionally, since our analyses are based on strong orders of schemes, it is natural to question whether convergence results still hold when we consider weak errors, and moreover if the method could be reformulated more coherently with rough paths theory [47].
+
+Acknowledgements
+We thank Yulia Rubanova, Danijar Hafner, Mufan Li, Shengyang Sun, Kenneth R. Jackson, Simo S�rkk�, Daniel Lacker, and Philippe Casgrain for helpful discus.sions. We thank �a�atay Y�ld�z for helpful discussions regarding evaluation settings of the mocap task. We also thank Guodong Zhang, Kevin Swersky, Chris Rackauckas, and members of the Vector Institute for helpful comments on an early draft of this paper.
+
+References
+[1] Mart�n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Je�rey Dean, Matthieu Devin, Sanjay Ghemawat, Geo�rey Irving, Michael Isard, et al. Tensorflow: A system for large-scale
+machine learning. In 12th Symposium on Oper.
+ating Systems Design and Implementation, pages
+265�283, 2016.
+[2] R Adams. Sobolev Spaces. Academic Press, 1975.
+[3] Joel Andersson. A general-purpose software frame.work for dynamic optimization. PhD thesis, Aren.berg Doctoral School, KU Leuven, 2013.
+[4] Joel Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):1�36, 2019.
+[5] C�dric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, and John S Shawe-Taylor. variational inference for diffusion processes. In Advances in Neural Information Processing Systems, pages 17�24, 2008.
+[6] VI Arnold. Ordinary Differential Equations. The MIT Press, 1978.
+[7] Jonathan Baxter and Peter L Bartlett. Infinite-horizon gradient-based policy search. 2001.
+[8] Robert Brown. ... microscopical observations ... on the particles contained in the pollen of plants. The Philosophical Magazine, 4(21):161�173, 1828.
+[9] Pamela M Burrage, R Herdiana, and Kevin Bur-rage. Adaptive stepsize based on control theory for stochastic Differential equations. Journal of Computational and Applied Mathematics, 170(2): 317�336, 2004.
+[10] Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
+[11] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
+[12] Ricky Tian Qi Chen, Yulia Rubanova, Jesse Bet.tencourt, and David K Duvenaud. Neural ordinary Differential equations. In Advances in neural in.formation processing systems, pages 6571�6583, 2018.
+[13] Kyunghyun Cho, Bart Van Merri�nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
+[14] Koen Claessen and Micha. H Pa.ka. Splittable pseudorandom number generators using crypto.graphic hashing. In ACM SIGPLAN Notices, vol.ume 48, pages 47�58. ACM, 2013.
+[15] Warren J Ewens. Mathematical population genetics 1: theoretical introduction, volume 27. Springer Science & Business Media, 2012.
+[16] Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing, 2018.
+[17] Jessica G Gaines and Terry J Lyons. Variable step size control in the numerical solution of stochastic Differential equations. SIAM Journal on Applied Mathematics, 57(5):1455�1484, 1997.
+[18] Zhe Gan, Chunyuan Li, Ricardo Henao, David E Carlson, and Lawrence Carin. Deep temporal sig.moid belief networks for sequence modeling. In Advances in Neural Information Processing systems, pages 2467�2475, 2015.
+[19] Mike Giles and Paul Glasserman. Smoking ad-joints: Fast Monte Carlo greeks. Risk, 19(1):88�92, 2006.
+[20] Paul Glasserman and David D Yao. Some guide.lines and guarantees for common random numbers. Management Science, 38(6):884�908, 1992.
+[21] Peter W Glynn. Likelihood ratio gradient estima.tion for stochastic systems. Communications of the ACM, 33(10):75�84, 1990.
+[22] Emmanuel Gobet and R�mi Munos. Sensitivity analysis using ItMalliavin calculus and martin.gales, and application to stochastic optimal control. SIAM Journal on control and optimization, 43(5): 1676�1713, 2005.
+[23] Will Grathwohl, Ricky T. Q. Chen, Jesse Bet.tencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scal.able reversible generative models. International Conference on Learning Representations, 2019.
+[24] Narendra Gupta and Raman Mehra. computational aspects of maximum likelihood estimation and reduction in sensitivity function calculations. IEEE transactions on automatic control, 19(6): 774�783, 1974.
+[25] Jung-Su Ha, Young-Jin Park, Hyeok-Joo Chae, Soon-Seo Park, and Han-Lim Choi. Adaptive path-integral autoencoders: Representation learning and planning for dynamical systems. In Advances in Neural Information Processing Systems, pages 8927�8938, 2018.
+[26] Eldad Haber and Lars Ruthotto. Stable architec.tures for deep neural networks. Inverse Problems, 34(1):014004, 2017.
+[27] Pashupati Hegde, Markus Heinonen, Harri L�hdesm�ki, and Samuel Kaski. Deep learning with Differential gaussian process flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1812�1821, 2019.
+[28] Markus Heinonen, Cagatay Yildiz, Henrik Man.nerstr, Jukka Intosalmi, and Harri L�hdesm�ki. Learning unknown ode models with gaussian pro.cesses. arXiv preprint arXiv:1803.04303, 2018.
+[29] Irina Higgins, Loic Matthey, Arka Pal, Christo.pher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta.vae: Learning basic visual concepts with a con.strained variational framework. ICLR, 2(5):6, 2017.
+[30] Silvana Ilie, Kenneth R Jackson, and Wayne H Enright. Adaptive time-stepping for the strong numerical solution of stochastic Differential equations. Numerical Algorithms, 68(4):791�812, 2015.
+[31] Mike Innes, Alan Edelman, Keno Fischer, Chris Rackauckus, Elliot Saba, Viral B Shah, and Will Tebbutt. Zygote: A differentiable programming system to bridge machine learning and scien.ti�c computing. arXiv preprint arXiv:1907.07587, 2019.
+[32] Junteng Jia and Austin R. Benson. Neural Jump Stochastic Differential Equations. arXiv e-prints, art. arXiv:1905.10403, May 2019.
+[33] Hilbert Johan Kappen and Hans Christian Ruiz. Adaptive importance sampling for control and in.ference. Journal of Statistical Physics, 162(5): 1244�1266, 2016.
+[34] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
+[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
+[36] Genshiro Kitagawa and Will Gersch. Linear gaus.sian state space modeling. In Smoothness Priors Analysis of Time Series, pages 55�65. Springer, 1996.
+[37] Jack PC Kleijnen and Reuven Y Rubinstein. Op.timization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research, 88(3): 413�427, 1996.
+[38] Peter E Kloeden and Andreas Neuenkirch. The pathwise convergence of approximation schemes for stochastic Differential equations. LMS jour.nal of Computation and Mathematics, 10:235�253, 2007.
+[39] Peter E Kloeden and Eckhard Platen. Numer.ical solution of stochastic Differential equations, volume 23. Springer Science & Business Media, 2013.
+[40] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
+[41] Hiroshi Kunita. Stochastic Flows and Jump.diffusions. Springer, 2019.
+[42] Harold Kushner and Paul G Dupuis. Numerical methods for stochastic control problems in continu.ous time, volume 24. Springer Science & Business Media, 2013.
+[43] Pierre L�Ecuyer and Gafitan Perron. On the con.vergence rates of ipa and fdc derivative estimators. Operations Research, 42(4):643�656, 1994.
+[44] Qianxiao Li, Long Chen, Cheng Tai, and E Weinan. Maximum principle based algorithms for deep learning. The Journal of Machine Learning Re.search, 18(1):5998�6026, 2017.
+[45] Xuanqing Liu, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural sde: Stabilizing neural ode networks with stochastic noise. arXiv preprint arXiv:1906.02355, 2019.
+[46] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridg.ing deep architectures and numerical Differential equations. arXiv preprint arXiv:1710.10121, 2017.
+[47] Terry J Lyons. Differential equations driven by rough signals. Revista Matemfitica Iberoamericana, 14(2):215�310, 1998.
+[48] Yi-An Ma, Tianqi Chen, and Emily Fox. A com.plete recipe for stochastic gradient mcmc. In Ad.vances in Neural Information Processing Systems, pages 2917�2925, 2015.
+[49] Dougal Maclaurin, David Duvenaud, M Johnson, and RP Adams. Autograd: Reverse-mode differ.entiation of native python. In ICML workshop on Automatic Machine Learning, 2015.
+[50] Isambi S Mbalawata, Simo S�rkk�, and Heikki Haario. Parameter estimation in stochastic differential equations with markov chain monte carlo and non-linear kalman filtering. Computational Statistics, 28(3):1195�1223, 2013.
+[51] Grigori Noah Milstein and Michael V Tretyakov. Stochastic Numerics for Mathematical Physics. Springer Science & Business Media, 2013.
+[52] Grigorii Noikhovich Milstein. Numerical integra.tion of stochastic Differential equations, volume 313. Springer Science & Business Media, 1994.
+[53] Ivan Nourdin and Giovanni Peccati. Normal ap.proximations with Malliavin calculus: from Stein�s method to universality, volume 192. Cambridge University Press, 2012.
+[54] Daniel Ocone and fitienne Pardoux. A general.ized itventzell formula. application to a class of anticipating stochastic Differential equations. 25 (1):39�71, 1989.
+[55] Bernt �ksendal. Stochastic Differential Equations. Springer, 2003.
+[56] Bernt Oksendal. Stochastic Differential equations: an introduction with applications. Springer Science & Business Media, 2013.
+[57] Manfred Opper. Variational inference for stochas.tic Differential equations. Annalen der Physik, 531 (3):1800233, 2019.
+[58] Etienne Pardoux and Shige Peng. Backward stochastic Differential equations and quasilinear parabolic partial Differential equations. In Stochas.tic Partial Differential Equations and Their Ap.plications, pages 200�217. Springer, 1992.
+[59] Adam Paszke, Sam Gross, Soumith Chintala, Gre.gory Chanan, Edward Yang, Zachary DeVito, Zem.ing Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
+[60] Barak A Pearlmutter. Gradient calculations for dy.namic recurrent neural networks: A survey. IEEE Transactions on Neural networks, 6(5):1212�1228, 1995.
+[61] Stefano Peluchetti and Stefano Favaro. Neural stochastic Differential equations. arXiv preprint arXiv:1904.01681, 2019.
+[62] Shige Peng. A general stochastic maximum principle for optimal control problems. SIAM Journal on Control and Optimization, 28(4):966�979, 1990.
+[63] Shige Peng and Zhen Wu. Fully coupled forward-backward stochastic Differential equations and ap.plications to optimal control. SIAM Journal on Control and Optimization, 37(3):825�843, 1999.
+[64] Eckhard Platen. An introduction to numerical methods for stochastic Differential equations. Acta numerica, 8:197�246, 1999.
+[65] Lev Semenovich Pontryagin. Mathematical Theory of Optimal Processes. Routledge, 2018.
+[66] Christopher Rackauckas and Qing Nie. Adaptive methods for stochastic Differential equations via natural embeddings and rejection sampling with memory. Discrete and Continuous Dynamical systems. Series B, 22(7):2731, 2017.
+[67] Daniel Revuz and Marc Yor. Continuous martin.gales and Brownian motion, volume 293. Springer Science & Business Media, 2013.
+[68] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
+[69] L Chris G Rogers and David Williams. diffusions, Markov Processes and Martingales: Volume 2, ItCalculus, volume 2. Cambridge University Press, 2000.
+[70] Andreas Rler. Runge�Kutta methods for stratonovich stochastic Differential equation systems with commutative noise. Journal of Com.putational and Applied mathematics, 164:613�627, 2004.
+[71] Andreas Rler. Runge�Kutta methods for the strong approximation of solutions of stochastic Differential equations. SIAM Journal on Numerical Analysis, 48(3):922�952, 2010.
+[72] Yulia Rubanova, Ricky TQ Chen, and David Du.venaud. Latent odes for irregularly-sampled time series. Neural Information Processing Systems, 2019.
+[73] David E Rumelhart, Geo�rey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating errors. Cognitive Modeling, 5(3):1, 1988.
+[74] Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial Differential equations. arXiv preprint arXiv:1804.04272, 2018.
+[75] Thomas Ryder, Andrew Golightly, A Stephen Mc-Gough, and Dennis Prangle. Black-box variational inference for stochastic Differential equa.tions. arXiv preprint arXiv:1802.03335, 2018.
+[76] John K Salmon, Mark A Moraes, Ron O Dror, and David E Shaw. Parallel random numbers: as easyas1,2, 3. In Proceedings of 2011 Interna.tional Conference for High Performance Comput.ing, Networking, Storage and Analysis, page 16. ACM, 2011.
+[77] Simo S�rkk�. Bayesian filtering and smoothing, volume 3. Cambridge University Press, 2013.
+[78] Simo S�rkk� and Arno Solin. Applied stochas.tic Differential equations, volume 10. Cambridge University Press, 2019.
+[79] Steven E Shreve. Stochastic calculus for finance II: Continuous-time models, volume 11. Springer Science & Business Media, 2004.
+[80] Evangelos Theodorou. Nonlinear stochastic con.trol and information theoretic dualities: Connec.tions, interdependencies and thermodynamic in.terpretations. Entropy, 17(5):3352�3375, 2015.
+[81] Ryan Turner, Marc Deisenroth, and Carl Ras.mussen. State-space inference and learning with gaussian processes. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 868�875, 2010.
+[82] Belinda Tzen and Maxim Raginsky. Neural stochastic Differential equations: Deep latent gaus.sian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019.
+[83] Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. Proceeings of the Conference on Learning Theory, 2019.
+[84] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, .ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998�6008, 2017.
+[85] Jack M Wang, David J Fleet, and Aaron Hertz.mann. Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):283�298, 2007.
+[86] E Weinan. A proposal on machine learning via dy.namical systems. Communications in Mathematics and Statistics, 5(1):1�11, 2017.
+[87] Magnus Wiktorsson et al. Joint characteristic function and simultaneous simulation of iterated itintegrals for multiple independent brownian motions. The Annals of Applied Probability, 11(2): 470�487, 2001.
+[88] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce.ment learning. Machine Learning, 8(3-4):229�256, 1992.
+[89] Jichuan Yang and Harold J Kushner. A monte carlo method for sensitivity analysis and paramet.ric optimization of nonlinear stochastic systems. SIAM Journal on Control and Optimization, 29 (5):1216�1249, 1991.
+[90] �a�atay Y�ld�z, Markus Heinonen, and Harri L�hdesm�ki. Ode2vae: Deep generative second order odes with bayesian neural networks. arXiv preprint arXiv:1905.10994, 2019.
+
+9 Appendix
+
+Notation. For a fixed terminal time <>, we denote by <> the time horizon. Let <> be the class
+of infinitely differentiable functions from Rd to itself. Let Cp,q be the class of functions from <> to <> that <> be
+are p and q times continuously differentiable in the first and second component, respectively. Let <> the subclass with bounded derivatives of all possible orders. For a positive integer m, we adopt the short hand
+[m]= {1, 2,...,m}. We denote the Euclidean norm of a vector v by |v|. For f . Cp,q, we denote its Jacobian with respect to the first component by rf.
+
+9.1 Proof of Theorem 3.1
+Proof of Theorem 3.1. We have <>, where <> is defined in (3). Now we take the gradient with respect to z on both sides. The solution is differentiable with respect to z and we may differentiate under the stochastic integral [41, Proposition 2.4.3]. Theorem 3.4.3 [41] is sufficient for the regularity conditions required. Since <>, applying the Stratonovich version of Its formula to (4), we have (5).
+
+9.2 Proof of Theorem 3.3
+Proof of Theorem 3.3. By the triangle inequality,
+
+<>
+
+We show that both I and I converge to 0 in probability as <>. For simplicity, we suppress z and W�.
+Bounding I(1) . Let > 0 be given. Since Gh . G in probability, there exist M1 > 0 and h0 > 0 such that <>, <>, for all <>.
+By Lemma 2.1 (iv) of Ocone and Pardoux [54], which can be easily adapted to our context, there exists a positive random variable C1, finite almost surely, such that <>, and there exists M2 > 0 such that <>. Given M2, there exists h1 > 0 such that
+
+<>
+
+Now, suppose <>. Then, by the union bound, with probability at least 1, we have
+
+<>
+
+On this event, we have
+
+<> (1)
+
+Thus, we have shown that (1) converges to 0 in probability as <>. Bounding <>. The idea is similar. By condition (ii), we have
+
+<>
+
+in probability. Using this and condition (i), for given <>, there exist <> and <> such that for all <>, we have
+
+<>
+
+with probability at least 1. On this event, we have
+
+<>
+
+Thus <> also converges to 0 in probability as <>.
+
+9.3 Euler-Maruyama Scheme satisfies Local Uniform Convergence
+Here we verify that the Euler-Maruyama scheme satisfies condition (ii) when d =1. Our proof can be extended to
+the case where d> 1 assuming an Lp estimate of the error; see the discussion after the proof of Proposition 9.1. Proposition 9.1. Let Fh(z) be the Euler-Maruyama discretization of a 1-dimensional SDE with mesh size h of F(z). Then, for any compact <>, we have
+
+<>
+
+Usual convergence results in stochastic numerics only control the error for a single fixed starting point. Here, we strengthen the result to local uniform convergence. Our main idea is to apply a Sobolev inequality argument [54, Part II]. To do so, we need some preliminary results about the Euler-Maruyama discretization of the original SDE and its derivative. We first recall a theorem characterizing the expected squared error for general schemes.
+Theorem 9.2 (Mean-square order of convergence [51, Theorem 1.1]). Let <> be the solution to an Ito SDE, and <> be a numerical discretization with fixed step size h, both of which are started at <> and defined on the same probability space. Let the coefficients of the SDE be <>. Furthermore, suppose that the numerical scheme has order of accuracy p1 for the expectation of deviation and order of accuracy p2 for the mean-square deviation. If <> and <>, then, for any <>, and <>
+for a constant C that does not depend on h or z.
+
+We refer the reader to [51] for the precise definitions of orders of accuracy and the proof. Given this theorem, we establish an estimate regarding errors of the discretization and its derivative with respect to the initial position.
+
+Lemma 9.3. We have
+
+ <>,
+
+where C1 is a constant independent of z and h.
+
+Proof of Lemma 9.3. Since the coefficients of the SDE are of class <>, we may differentiate the SDE in z to
+b get the SDE for the derivative rzZz [41]. Specifically, letting <>, we have
+
+<>
+
+Note that the augmented process (F(z), rzF(z)) satisfies an SDE with <> coefficients. By the chain rule,
+one can easily show that the derivative of the Euler-Maruyama discretization Fh(z) is the discretization of the derivative process Y z . Thus, (Fh(z), rzFh(z)) is simply the discretization of (F(z), rzF(z)).
+Since the Euler-Maruyama scheme has orders of accuracy (p1,p2) = (1.5, 1.0) [51, Section 1.1.5], by Theorem 9.2, we have
+
+<>
+
+for some constant C1 that does not depend on z or h.
+
+We also recall a variant of the Sobolev inequality which we will apply for d =1. Theorem 9.4 (Sobolev inequality [2, Theorem 5.4.1.c]). For any p>d, there exists a universal constant cp such that
+
+<>
+
+where
+
+<>
+
+for all continuously differentiable <>.
+
+Proof of Proposition 9.1. define H. :� . R . R, regarded as a random function <>, by
+
+<>
+
+where <> is a fixed constant. Since H. is continuously differentiable a.s., by Theorem 9.4,
+
+<>,
+
+Without loss of generality, we may let the compact set be <> where <>. Then,
+
+<> (11)
+
+It remains to estimate <>. Starting from the definition of <>, a standard estimation yields
+
+<>
+
+where C2 is a deterministic constant depending only on . (but not z and h).
+Now we take expectation on both sides. By Lemma 9.3, we have
+
+<>
+
+where the last integral is finite since <>.
+
+We have shown that <>. Thus kH.k. 0 in L2 , and hence also in probability, as <>. From equation 11, we have that <> converges to 0 in probability as <>.
+It is clear from the above proof that we may generalize to the case where d> 1 and other numerical schemes if we can bound the expected <>, p-norm of <> in terms of z and h, for p>d, where W 1,p here denotes the Sobolev space consisting of all real-valued functions on Rd whose weak derivatives are functions in Lp. For the Euler scheme and <>, we need only bound the Lp norm of the discretization error in term of z and h for general p. To achieve this, we would need to make explicit the dependence on z for existing estimates (see e.g. [39, Chapter 10]).
+Generically extending the argument to other numerical schemes, however, is technically non-trivial. We plan to address this question in future research.
+
+9.4 Stochastic Adjoint has Commutative Noise when Original SDE has Diagonal Noise
+Recall the Stratonovich SDE (2) with drift and diffusion functions <> governed by a set of parameters <>. Consider the augmented state composed of the original state and parameters Yt =(Zt,.). The augmented state satisfies a Stratonovich SDE with the drift function <> and diffusion functions <> for <>. By (5) and (6), the dynamics for the adjoint process of the augmented state is characterized by the backward SDE:
+
+<>
+
+By definitions of f and gi, the Jacobian matrices rf(x, s) and rgi(x, s) can be written as:
+
+<>
+
+Thus, we can write out the backward SDEs for the adjoint processes of the state and parameters separately:
+
+<>
+
+Now assume the original SDE has diagonal noise. Then, m = d and Jacobian matrix r.i(z) can be written as:
+
+<>
+
+Consider the adjoint process for the augmented state along with the backward flow of the backward SDE (3). We write the overall state as <>, where we abuse notation slightly to let <> denote the backward
+flow process. Then, by (12) and (13), {Xt}t.T satisfies a backward SDE with a diffusion function that can be written as:
+
+<>
+
+Recall, for an SDE with diffusion function <>, it is said to satisfy the commutativity property [70] if
+
+<>
+
+for all j1,j2 . [m] and k . [d]. When an SDE has commutative noise, the computationally intensive double Itintegrals (and the Levy areas) need not be simulated by having the numerical scheme take advantage of the following property of iterated integrals [30]:
+
+<>
+
+where the Brownian motion increment <> for <> can be easily sampled. To see that the diffusion function (14) indeed satisfies the commutativity condition (15), we consider several cases:
+<> Both LHS and RHS are zero unless j1 == k, since for .i,j2 (x) to be non-zero, <> Similar to the case above. Write <>, where <>. Both LHS and RHS are zero unless <>, since
+
+<>
+
+for <> to be non-zero <> or <> and <>.
+
+Since in all scenarios, LHS = RHS, we conclude that the commutativity condition holds.
+Finally, we comment that the Milstein scheme for the stochastic adjoint of diagonal noise SDEs can be implemented such that during each iteration of the backward solve, vjp is only called a number of times independent respect to the dimensionality of the original SDE.
+
+9.5 Background on Latent SDE
+
+Consider a filtered probability space <>, where <> is a finite time horizon.
+Recall the approximate posterior process that we intend to learn is governed by the SDE:
+
+<>, (16)
+
+Suppose there exists a measurable function u(z, t) such that <>, and <> satisfies Novikov's condition, i.e. <>. Novikov's condition ensures that the process
+
+<>
+
+is a P-martingale. By Girsanov Theorem II [56, Theorem 8.6.4], the process <> is a Wiener process under the probability measure Q defined by
+
+<>,
+
+Moreover, since a simple rewrite shows that
+
+<>, (17)
+
+we conclude that the Q-law of (17) (or equivalently (16)) is the same as the P -law of the prior process.
+
+9.5.1 Deriving the Variational Bound
+
+Let xt1,...,xtN be observation data at times t1,...,tN , whose conditionals only depend on the respective latent states zt1,...,ztN . Since the Q-law of the approximate posterior is the same as the P-law of the prior,
+
+<>
+
+where the second line follows from the definition of Q and third line follows from Jensen's inequality. In the last equality we used the fact that the Ito integral <> is a martingale.
+
+9.6 Stochastic Adjoint for Latent SDE
+
+Note that the variational free energy (10) can be derived from Girsanov's change of measure theorem [57]. To efficiently Monte Carlo estimate this quantity and its gradient, we simplify the equation by noting that for a one-dimensional process <> adapted to the filtration generated by a one-dimensional Wiener process <>,
+if Novikov's condition [55] is satisfied, then the process defined by the Ito integral Vs dWs is a Martingale [55]. Hence, <>, and
+
+<>
+
+To Monte Carlo simulate the quantity in the forward pass along with the original dynamics, we need only extend the original augmented state with an extra variable Lt such that the new drift and diffusion functions for the new augmented state <> are
+
+<>
+
+By (7), the backward SDEs of the adjoint processes become
+
+<> (18)
+
+In this case, neither do we need to actually simulate the backward SDE of the extra variable nor do we need to simulate its adjoint. Moreover, when considered as a single system for the augmented adjoint state, the diffusion function of the backward SDE (18) satisfies the commutativity property (15).
+
+9.7 Test Problems
+
+In the following, <> and p are parameters of SDEs, and x0 is a fixed initial value.
+
+Example 1.
+
+<>
+
+Analytical solution:
+
+<>
+
+Example 2.
+
+<>
+
+Analytical solution:
+
+<>
+
+Example 3.
+
+<>
+
+Analytical solution:
+
+<>
+
+In each numerical experiment, we duplicate the equation 10 times to obtain a system of SDEs where each dimension had their own parameter values sampled from the standard Gaussian distribution and then passed through a sigmoid to ensure positivity. Moreover, we also sample the initial value for each dimension from a Gaussian distribution.
+
+<