diff --git a/dev/data/dataloader/index.html b/dev/data/dataloader/index.html
index 868f9e9b..f3faeaab 100644
--- a/dev/data/dataloader/index.html
+++ b/dev/data/dataloader/index.html
@@ -27,4 +27,4 @@ end
 
 # train for 10 epochs
 using IterTools: ncycle 
-Flux.train!(loss, ps, ncycle(dtrain, 10), opt)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/data/dataloader.jl#L13-L50">source</a></section><footer><hr/><a class="previous" href="../onehot/"><span class="direction">Previous</span><span class="title">One-Hot Encoding</span></a><a class="next" href="../../training/optimisers/"><span class="direction">Next</span><span class="title">Optimisers</span></a></footer></article></body></html>
+Flux.train!(loss, ps, ncycle(dtrain, 10), opt)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/data/dataloader.jl#L13-L50">source</a></section><footer><hr/><a class="previous" href="../onehot/"><span class="direction">Previous</span><span class="title">One-Hot Encoding</span></a><a class="next" href="../../training/optimisers/"><span class="direction">Next</span><span class="title">Optimisers</span></a></footer></article></body></html>
diff --git a/dev/models/layers/index.html b/dev/models/layers/index.html
index 1a91405f..4a8b587b 100644
--- a/dev/models/layers/index.html
+++ b/dev/models/layers/index.html
@@ -11,41 +11,41 @@ m(5) == 26
 
 m = Chain(Dense(10, 5), Dense(5, 2))
 x = rand(10)
-m(x) == m[2](m[1](x))</code></pre><p><code>Chain</code> also supports indexing and slicing, e.g. <code>m[2]</code> or <code>m[1:end-1]</code>. <code>m[1:3](x)</code> will calculate the output of the first three layers.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/basic.jl#L1-L18">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dense" href="#Flux.Dense"><code>Flux.Dense</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Dense(in::Integer, out::Integer, σ = identity)</code></pre><p>Creates a traditional <code>Dense</code> layer with parameters <code>W</code> and <code>b</code>.</p><pre><code class="language-none">y = σ.(W * x .+ b)</code></pre><p>The input <code>x</code> must be a vector of length <code>in</code>, or a batch of vectors represented as an <code>in × N</code> matrix. The out <code>y</code> will be a vector or batch of length <code>out</code>.</p><pre><code class="language-julia">julia&gt; d = Dense(5, 2)
+m(x) == m[2](m[1](x))</code></pre><p><code>Chain</code> also supports indexing and slicing, e.g. <code>m[2]</code> or <code>m[1:end-1]</code>. <code>m[1:3](x)</code> will calculate the output of the first three layers.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/basic.jl#L1-L18">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dense" href="#Flux.Dense"><code>Flux.Dense</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Dense(in::Integer, out::Integer, σ = identity)</code></pre><p>Creates a traditional <code>Dense</code> layer with parameters <code>W</code> and <code>b</code>.</p><pre><code class="language-none">y = σ.(W * x .+ b)</code></pre><p>The input <code>x</code> must be a vector of length <code>in</code>, or a batch of vectors represented as an <code>in × N</code> matrix. The out <code>y</code> will be a vector or batch of length <code>out</code>.</p><pre><code class="language-julia">julia&gt; d = Dense(5, 2)
 Dense(5, 2)
 
 julia&gt; d(rand(5))
 Tracked 2-element Array{Float64,1}:
   0.00257447
-  -0.00449443</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/basic.jl#L78-L97">source</a></section><h2><a class="nav-anchor" id="Convolution-and-Pooling-Layers-1" href="#Convolution-and-Pooling-Layers-1">Convolution and Pooling Layers</a></h2><p>These layers are used to build convolutional neural networks (CNNs).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Conv" href="#Flux.Conv"><code>Flux.Conv</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Conv(size, in=&gt;out)
+  -0.00449443</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/basic.jl#L78-L97">source</a></section><h2><a class="nav-anchor" id="Convolution-and-Pooling-Layers-1" href="#Convolution-and-Pooling-Layers-1">Convolution and Pooling Layers</a></h2><p>These layers are used to build convolutional neural networks (CNNs).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Conv" href="#Flux.Conv"><code>Flux.Conv</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Conv(size, in=&gt;out)
 Conv(size, in=&gt;out, relu)</code></pre><p>Standard convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Example: Applying Conv layer to a 1-channel input using a 2x2 window size,          giving us a 16-channel output. Output is activated with ReLU.</p><pre><code class="language-none">size = (2,2)
 in = 1
 out = 16
-Conv((2, 2), 1=&gt;16, relu)</code></pre><p>Data should be stored in WHCN order (width, height, # channels, batch size). In other words, a 100×100 RGB image would be a <code>100×100×3×1</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/conv.jl#L10-L30">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.MaxPool" href="#Flux.MaxPool"><code>Flux.MaxPool</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">MaxPool(k)</code></pre><p>Max pooling layer. <code>k</code> stands for the size of the window for each dimension of the input.</p><p>Takes the keyword arguments <code>pad</code> and <code>stride</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/conv.jl#L307-L313">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.MeanPool" href="#Flux.MeanPool"><code>Flux.MeanPool</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">MeanPool(k)</code></pre><p>Mean pooling layer. <code>k</code> stands for the size of the window for each dimension of the input.</p><p>Takes the keyword arguments <code>pad</code> and <code>stride</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/conv.jl#L338-L344">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.DepthwiseConv" href="#Flux.DepthwiseConv"><code>Flux.DepthwiseConv</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">DepthwiseConv(size, in=&gt;out)
-DepthwiseConv(size, in=&gt;out, relu)</code></pre><p>Depthwise convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively. Note that <code>out</code> must be an integer multiple of <code>in</code>.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/conv.jl#L166-L178">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.ConvTranspose" href="#Flux.ConvTranspose"><code>Flux.ConvTranspose</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ConvTranspose(size, in=&gt;out)
-ConvTranspose(size, in=&gt;out, relu)</code></pre><p>Standard convolutional transpose layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/conv.jl#L91-L102">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.CrossCor" href="#Flux.CrossCor"><code>Flux.CrossCor</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">CrossCor(size, in=&gt;out)
+Conv((2, 2), 1=&gt;16, relu)</code></pre><p>Data should be stored in WHCN order (width, height, # channels, batch size). In other words, a 100×100 RGB image would be a <code>100×100×3×1</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/conv.jl#L10-L30">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.MaxPool" href="#Flux.MaxPool"><code>Flux.MaxPool</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">MaxPool(k)</code></pre><p>Max pooling layer. <code>k</code> stands for the size of the window for each dimension of the input.</p><p>Takes the keyword arguments <code>pad</code> and <code>stride</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/conv.jl#L307-L313">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.MeanPool" href="#Flux.MeanPool"><code>Flux.MeanPool</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">MeanPool(k)</code></pre><p>Mean pooling layer. <code>k</code> stands for the size of the window for each dimension of the input.</p><p>Takes the keyword arguments <code>pad</code> and <code>stride</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/conv.jl#L338-L344">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.DepthwiseConv" href="#Flux.DepthwiseConv"><code>Flux.DepthwiseConv</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">DepthwiseConv(size, in=&gt;out)
+DepthwiseConv(size, in=&gt;out, relu)</code></pre><p>Depthwise convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively. Note that <code>out</code> must be an integer multiple of <code>in</code>.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/conv.jl#L166-L178">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.ConvTranspose" href="#Flux.ConvTranspose"><code>Flux.ConvTranspose</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ConvTranspose(size, in=&gt;out)
+ConvTranspose(size, in=&gt;out, relu)</code></pre><p>Standard convolutional transpose layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/conv.jl#L91-L102">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.CrossCor" href="#Flux.CrossCor"><code>Flux.CrossCor</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">CrossCor(size, in=&gt;out)
 CrossCor(size, in=&gt;out, relu)</code></pre><p>Standard cross convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Example: Applying CrossCor layer to a 1-channel input using a 2x2 window size,          giving us a 16-channel output. Output is activated with ReLU.</p><pre><code class="language-none">size = (2,2)
 in = 1
 out = 16
-CrossCor((2, 2), 1=&gt;16, relu)</code></pre><p>Data should be stored in WHCN order (width, height, # channels, # batches). In other words, a 100×100 RGB image would be a <code>100×100×3×1</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/conv.jl#L233-L253">source</a></section><h2><a class="nav-anchor" id="Recurrent-Layers-1" href="#Recurrent-Layers-1">Recurrent Layers</a></h2><p>Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.RNN" href="#Flux.RNN"><code>Flux.RNN</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">RNN(in::Integer, out::Integer, σ = tanh)</code></pre><p>The most basic recurrent layer; essentially acts as a <code>Dense</code> layer, but with the output fed back into the input each time step.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/recurrent.jl#L90-L95">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LSTM" href="#Flux.LSTM"><code>Flux.LSTM</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">LSTM(in::Integer, out::Integer)</code></pre><p>Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/recurrent.jl#L135-L143">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GRU" href="#Flux.GRU"><code>Flux.GRU</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">GRU(in::Integer, out::Integer)</code></pre><p>Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/recurrent.jl#L176-L184">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Recur" href="#Flux.Recur"><code>Flux.Recur</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Recur(cell)</code></pre><p><code>Recur</code> takes a recurrent cell and makes it stateful, managing the hidden state in the background. <code>cell</code> should be a model of the form:</p><pre><code class="language-none">h, y = cell(h, x...)</code></pre><p>For example, here&#39;s a recurrent network that keeps a running total of its inputs.</p><pre><code class="language-julia">accum(h, x) = (h+x, x)
+CrossCor((2, 2), 1=&gt;16, relu)</code></pre><p>Data should be stored in WHCN order (width, height, # channels, # batches). In other words, a 100×100 RGB image would be a <code>100×100×3×1</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/conv.jl#L233-L253">source</a></section><h2><a class="nav-anchor" id="Recurrent-Layers-1" href="#Recurrent-Layers-1">Recurrent Layers</a></h2><p>Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.RNN" href="#Flux.RNN"><code>Flux.RNN</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">RNN(in::Integer, out::Integer, σ = tanh)</code></pre><p>The most basic recurrent layer; essentially acts as a <code>Dense</code> layer, but with the output fed back into the input each time step.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/recurrent.jl#L90-L95">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LSTM" href="#Flux.LSTM"><code>Flux.LSTM</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">LSTM(in::Integer, out::Integer)</code></pre><p>Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/recurrent.jl#L135-L143">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GRU" href="#Flux.GRU"><code>Flux.GRU</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">GRU(in::Integer, out::Integer)</code></pre><p>Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/recurrent.jl#L176-L184">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Recur" href="#Flux.Recur"><code>Flux.Recur</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Recur(cell)</code></pre><p><code>Recur</code> takes a recurrent cell and makes it stateful, managing the hidden state in the background. <code>cell</code> should be a model of the form:</p><pre><code class="language-none">h, y = cell(h, x...)</code></pre><p>For example, here&#39;s a recurrent network that keeps a running total of its inputs.</p><pre><code class="language-julia">accum(h, x) = (h+x, x)
 rnn = Flux.Recur(accum, 0)
 rnn(2) # 2
 rnn(3) # 3
 rnn.state # 5
 rnn.(1:10) # apply to a sequence
-rnn.state # 60</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/recurrent.jl#L7-L26">source</a></section><h2><a class="nav-anchor" id="Other-General-Purpose-Layers-1" href="#Other-General-Purpose-Layers-1">Other General Purpose Layers</a></h2><p>These are marginally more obscure than the Basic Layers. But in contrast to the layers described in the other sections are not readily grouped around a particular purpose (e.g. CNNs or RNNs).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Maxout" href="#Flux.Maxout"><code>Flux.Maxout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Maxout(over)</code></pre><p><code>Maxout</code> is a neural network layer, which has a number of internal layers, which all have the same input, and the maxout returns the elementwise maximium of the internal layers&#39; outputs.</p><p>Maxout over linear dense layers satisfies the univeral approximation theorem.</p><p>Reference: Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio.</p><ol><li>Maxout networks.</li></ol><p>In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML&#39;13), Sanjoy Dasgupta and David McAllester (Eds.), Vol. 28. JMLR.org III-1319-III-1327. https://arxiv.org/pdf/1302.4389.pdf</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/basic.jl#L176-L191">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.SkipConnection" href="#Flux.SkipConnection"><code>Flux.SkipConnection</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">SkipConnection(layers, connection)</code></pre><p>Creates a Skip Connection, of a layer or <code>Chain</code> of consecutive layers plus a shortcut connection. The connection function will combine the result of the layers with the original input, to give the final output.</p><p>The simplest &#39;ResNet&#39;-type connection is just <code>SkipConnection(layer, +)</code>, and requires the output of the layers to be the same shape as the input. Here is a more complicated example:</p><pre><code class="language-none">m = Conv((3,3), 4=&gt;7, pad=(1,1))
+rnn.state # 60</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/recurrent.jl#L7-L26">source</a></section><h2><a class="nav-anchor" id="Other-General-Purpose-Layers-1" href="#Other-General-Purpose-Layers-1">Other General Purpose Layers</a></h2><p>These are marginally more obscure than the Basic Layers. But in contrast to the layers described in the other sections are not readily grouped around a particular purpose (e.g. CNNs or RNNs).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Maxout" href="#Flux.Maxout"><code>Flux.Maxout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Maxout(over)</code></pre><p><code>Maxout</code> is a neural network layer, which has a number of internal layers, which all have the same input, and the maxout returns the elementwise maximium of the internal layers&#39; outputs.</p><p>Maxout over linear dense layers satisfies the univeral approximation theorem.</p><p>Reference: Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio.</p><ol><li>Maxout networks.</li></ol><p>In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML&#39;13), Sanjoy Dasgupta and David McAllester (Eds.), Vol. 28. JMLR.org III-1319-III-1327. https://arxiv.org/pdf/1302.4389.pdf</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/basic.jl#L176-L191">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.SkipConnection" href="#Flux.SkipConnection"><code>Flux.SkipConnection</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">SkipConnection(layers, connection)</code></pre><p>Creates a Skip Connection, of a layer or <code>Chain</code> of consecutive layers plus a shortcut connection. The connection function will combine the result of the layers with the original input, to give the final output.</p><p>The simplest &#39;ResNet&#39;-type connection is just <code>SkipConnection(layer, +)</code>, and requires the output of the layers to be the same shape as the input. Here is a more complicated example:</p><pre><code class="language-none">m = Conv((3,3), 4=&gt;7, pad=(1,1))
 x = ones(5,5,4,10);
 size(m(x)) == (5, 5, 7, 10)
 
 sm = SkipConnection(m, (mx, x) -&gt; cat(mx, x, dims=3))
-size(sm(x)) == (5, 5, 11, 10)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/basic.jl#L225-L243">source</a></section><h2><a class="nav-anchor" id="Normalisation-and-Regularisation-1" href="#Normalisation-and-Regularisation-1">Normalisation &amp; Regularisation</a></h2><p>These layers don&#39;t affect the structure of the network but may improve training times or reduce overfitting.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.BatchNorm" href="#Flux.BatchNorm"><code>Flux.BatchNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">BatchNorm(channels::Integer, σ = identity;
+size(sm(x)) == (5, 5, 11, 10)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/basic.jl#L225-L243">source</a></section><h2><a class="nav-anchor" id="Normalisation-and-Regularisation-1" href="#Normalisation-and-Regularisation-1">Normalisation &amp; Regularisation</a></h2><p>These layers don&#39;t affect the structure of the network but may improve training times or reduce overfitting.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.BatchNorm" href="#Flux.BatchNorm"><code>Flux.BatchNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">BatchNorm(channels::Integer, σ = identity;
           initβ = zeros, initγ = ones,
           ϵ = 1e-8, momentum = .1)</code></pre><p>Batch Normalization layer. The <code>channels</code> input should be the size of the channel dimension in your data (see below).</p><p>Given an array with <code>N</code> dimensions, call the <code>N-1</code>th the channel dimension. (For a batch of feature vectors this is just the data dimension, for <code>WHCN</code> images it&#39;s the usual channel dimension.)</p><p><code>BatchNorm</code> computes the mean and variance for each each <code>W×H×1×N</code> slice and shifts them to have a new mean and variance (corresponding to the learnable, per-channel <code>bias</code> and <code>scale</code> parameters).</p><p>Use <a href="#Flux.testmode!"><code>testmode!</code></a> during inference.</p><p>See <a href="https://arxiv.org/pdf/1502.03167.pdf">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a>.</p><p>Example:</p><pre><code class="language-julia">m = Chain(
   Dense(28^2, 64),
   BatchNorm(64, relu),
   Dense(64, 10),
   BatchNorm(10),
-  softmax)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/normalise.jl#L118-L148">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dropout" href="#Flux.Dropout"><code>Flux.Dropout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Dropout(p, dims = :)</code></pre><p>A Dropout layer. In the forward pass, applies the <a href="#Flux.dropout"><code>dropout</code></a> function on the input.</p><p>Does nothing to the input once <a href="#Flux.testmode!"><code>testmode!</code></a> is false.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/normalise.jl#L30-L36">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.dropout" href="#Flux.dropout"><code>Flux.dropout</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">dropout(p, dims = :)</code></pre><p>Dropout function. For each input, either sets that input to <code>0</code> (with probability <code>p</code>) or scales it by <code>1/(1-p)</code>. The <code>dims</code> argument is to specify the unbroadcasted dimensions, i.e. <code>dims=1</code> does dropout along columns and <code>dims=2</code> along rows. This is used as a regularisation, i.e. it reduces overfitting during training. </p><p>See also <a href="#Flux.Dropout"><code>Dropout</code></a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/normalise.jl#L12-L21">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.AlphaDropout" href="#Flux.AlphaDropout"><code>Flux.AlphaDropout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AlphaDropout(p)</code></pre><p>A dropout layer. It is used in Self-Normalizing Neural Networks. (https://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf) The AlphaDropout layer ensures that mean and variance of activations remains the same as before.</p><p>Does nothing to the input once <a href="#Flux.testmode!"><code>testmode!</code></a> is false.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/normalise.jl#L62-L70">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LayerNorm" href="#Flux.LayerNorm"><code>Flux.LayerNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">LayerNorm(h::Integer)</code></pre><p>A <a href="https://arxiv.org/pdf/1607.06450.pdf">normalisation layer</a> designed to be used with recurrent hidden states of size <code>h</code>. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/normalise.jl#L96-L102">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GroupNorm" href="#Flux.GroupNorm"><code>Flux.GroupNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>Group Normalization. This layer can outperform Batch-Normalization and Instance-Normalization.</p><pre><code class="language-none">GroupNorm(chs::Integer, G::Integer, λ = identity;
+  softmax)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/normalise.jl#L118-L148">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dropout" href="#Flux.Dropout"><code>Flux.Dropout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Dropout(p, dims = :)</code></pre><p>A Dropout layer. In the forward pass, applies the <a href="#Flux.dropout"><code>dropout</code></a> function on the input.</p><p>Does nothing to the input once <a href="#Flux.testmode!"><code>testmode!</code></a> is false.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/normalise.jl#L30-L36">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.dropout" href="#Flux.dropout"><code>Flux.dropout</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">dropout(p, dims = :)</code></pre><p>Dropout function. For each input, either sets that input to <code>0</code> (with probability <code>p</code>) or scales it by <code>1/(1-p)</code>. The <code>dims</code> argument is to specify the unbroadcasted dimensions, i.e. <code>dims=1</code> does dropout along columns and <code>dims=2</code> along rows. This is used as a regularisation, i.e. it reduces overfitting during training. </p><p>See also <a href="#Flux.Dropout"><code>Dropout</code></a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/normalise.jl#L12-L21">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.AlphaDropout" href="#Flux.AlphaDropout"><code>Flux.AlphaDropout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AlphaDropout(p)</code></pre><p>A dropout layer. It is used in Self-Normalizing Neural Networks. (https://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf) The AlphaDropout layer ensures that mean and variance of activations remains the same as before.</p><p>Does nothing to the input once <a href="#Flux.testmode!"><code>testmode!</code></a> is false.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/normalise.jl#L62-L70">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LayerNorm" href="#Flux.LayerNorm"><code>Flux.LayerNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">LayerNorm(h::Integer)</code></pre><p>A <a href="https://arxiv.org/pdf/1607.06450.pdf">normalisation layer</a> designed to be used with recurrent hidden states of size <code>h</code>. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/normalise.jl#L96-L102">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GroupNorm" href="#Flux.GroupNorm"><code>Flux.GroupNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>Group Normalization. This layer can outperform Batch-Normalization and Instance-Normalization.</p><pre><code class="language-none">GroupNorm(chs::Integer, G::Integer, λ = identity;
           initβ = (i) -&gt; zeros(Float32, i), initγ = (i) -&gt; ones(Float32, i),
           ϵ = 1f-5, momentum = 0.1f0)</code></pre><p><span>$chs$</span> is the number of channels, the channel dimension of your input. For an array of N dimensions, the (N-1)th index is the channel dimension.</p><p><span>$G$</span> is the number of groups along which the statistics would be computed. The number of channels must be an integer multiple of the number of groups.</p><p>Use <a href="#Flux.testmode!"><code>testmode!</code></a> during inference.</p><p>Example:</p><pre><code class="language-none">m = Chain(Conv((3,3), 1=&gt;32, leakyrelu;pad = 1),
-          GroupNorm(32,16)) # 32 channels, 16 groups (G = 16), thus 2 channels per group used</code></pre><p>Link : https://arxiv.org/pdf/1803.08494.pdf</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/normalise.jl#L309-L332">source</a></section><h3><a class="nav-anchor" id="Testmode-1" href="#Testmode-1">Testmode</a></h3><p>Many normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference. Still, depending on your use case, it may be helpful to manually specify when these layers should be treated as being trained or not. For this, Flux provides <code>testmode!</code>. When called on a model (e.g. a layer or chain of layers), this function will place the model into the mode specified.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.testmode!" href="#Flux.testmode!"><code>Flux.testmode!</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">testmode!(m, mode = true)</code></pre><p>Set a layer or model&#39;s test mode (see below). Using <code>:auto</code> mode will treat any gradient computation as training.</p><p><em>Note</em>: if you manually set a model into test mode, you need to manually place it back into train mode during training phase.</p><p>Possible values include:</p><ul><li><code>false</code> for training</li><li><code>true</code> for testing</li><li><code>:auto</code> or <code>nothing</code> for Flux to detect the mode automatically</li></ul></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/functor.jl#L42-L55">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.trainmode!" href="#Flux.trainmode!"><code>Flux.trainmode!</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">trainmode!(m, mode = true)</code></pre><p>Set a layer of model&#39;s train mode (see below). Symmetric to <a href="#Flux.testmode!"><code>testmode!</code></a> (i.e. `trainmode!(m, mode) == testmode!(m, !mode)).</p><p><em>Note</em>: if you manually set a model into train mode, you need to manually place it into test mode during testing phase.</p><p>Possible values include:</p><ul><li><code>true</code> for training</li><li><code>false</code> for testing</li><li><code>:auto</code> or <code>nothing</code> for Flux to detect the mode automatically</li></ul></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/functor.jl#L58-L71">source</a></section><h2><a class="nav-anchor" id="Cost-Functions-1" href="#Cost-Functions-1">Cost Functions</a></h2><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.mse" href="#Flux.mse"><code>Flux.mse</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">mse(ŷ, y)</code></pre><p>Return the mean squared error <code>sum((ŷ .- y).^2) / length(y)</code>. </p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/stateless.jl#L2-L6">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.crossentropy" href="#Flux.crossentropy"><code>Flux.crossentropy</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">crossentropy(ŷ, y; weight=1)</code></pre><p>Return the crossentropy computed as <code>-sum(y .* log.(ŷ) .* weight) / size(y, 2)</code>. </p><p>See also <a href="#Flux.logitcrossentropy"><code>logitcrossentropy</code></a>, <a href="#Flux.binarycrossentropy"><code>binarycrossentropy</code></a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/stateless.jl#L22-L28">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.logitcrossentropy" href="#Flux.logitcrossentropy"><code>Flux.logitcrossentropy</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">logitcrossentropy(ŷ, y; weight=1)</code></pre><p>Return the crossentropy computed after a <a href="models/@ref">softmax</a> operation: </p><p>-sum(y .* logsoftmax(ŷ) .* weight) / size(y, 2)</p><p>See also <a href="#Flux.crossentropy"><code>crossentropy</code></a>, <a href="#Flux.binarycrossentropy"><code>binarycrossentropy</code></a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/stateless.jl#L31-L39">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.binarycrossentropy" href="#Flux.binarycrossentropy"><code>Flux.binarycrossentropy</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">binarycrossentropy(ŷ, y; ϵ=eps(ŷ))</code></pre><p>Return <code>-y*log(ŷ + ϵ) - (1-y)*log(1-ŷ + ϵ)</code>. The ϵ term provides numerical stability.</p><p>Typically, the prediction <code>ŷ</code> is given by the output of a <a href="../nnlib/#NNlib.sigmoid"><code>sigmoid</code></a> activation.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/stateless.jl#L44-L50">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.logitbinarycrossentropy" href="#Flux.logitbinarycrossentropy"><code>Flux.logitbinarycrossentropy</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">logitbinarycrossentropy(ŷ, y)</code></pre><p><code>logitbinarycrossentropy(ŷ, y)</code> is mathematically equivalent to <code>binarycrossentropy(σ(ŷ), y)</code> but it is more numerically stable.</p><p>See also <a href="#Flux.binarycrossentropy"><code>binarycrossentropy</code></a>, <a href="../nnlib/#NNlib.sigmoid"><code>sigmoid</code></a>, <a href="../nnlib/#NNlib.logsigmoid"><code>logsigmoid</code></a>.  </p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/stateless.jl#L56-L63">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.kldivergence" href="#Flux.kldivergence"><code>Flux.kldivergence</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">kldivergence(ŷ, y)</code></pre><p>KLDivergence is a measure of how much one probability distribution is different from the other. It is always non-negative and zero only when both the distributions are equal everywhere. <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL Divergence</a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/stateless.jl#L100-L106">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.poisson" href="#Flux.poisson"><code>Flux.poisson</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">poisson(ŷ, y)</code></pre><p>Poisson loss function is a measure of how the predicted distribution diverges from the expected distribution. <a href="https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/poisson">Poisson Loss</a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/stateless.jl#L113-L118">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.hinge" href="#Flux.hinge"><code>Flux.hinge</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">hinge(ŷ, y)</code></pre><p>Measures the loss given the prediction <code>ŷ</code> and true labels <code>y</code> (containing 1 or -1).  <a href="https://en.wikipedia.org/wiki/Hinge_loss">Hinge Loss</a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/layers/stateless.jl#L121-L126">source</a></section><footer><hr/><a class="previous" href="../regularisation/"><span class="direction">Previous</span><span class="title">Regularisation</span></a><a class="next" href="../nnlib/"><span class="direction">Next</span><span class="title">NNlib</span></a></footer></article></body></html>
+          GroupNorm(32,16)) # 32 channels, 16 groups (G = 16), thus 2 channels per group used</code></pre><p>Link : https://arxiv.org/pdf/1803.08494.pdf</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/normalise.jl#L309-L332">source</a></section><h3><a class="nav-anchor" id="Testmode-1" href="#Testmode-1">Testmode</a></h3><p>Many normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference. Still, depending on your use case, it may be helpful to manually specify when these layers should be treated as being trained or not. For this, Flux provides <code>testmode!</code>. When called on a model (e.g. a layer or chain of layers), this function will place the model into the mode specified.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.testmode!" href="#Flux.testmode!"><code>Flux.testmode!</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">testmode!(m, mode = true)</code></pre><p>Set a layer or model&#39;s test mode (see below). Using <code>:auto</code> mode will treat any gradient computation as training.</p><p><em>Note</em>: if you manually set a model into test mode, you need to manually place it back into train mode during training phase.</p><p>Possible values include:</p><ul><li><code>false</code> for training</li><li><code>true</code> for testing</li><li><code>:auto</code> or <code>nothing</code> for Flux to detect the mode automatically</li></ul></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/functor.jl#L42-L55">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.trainmode!" href="#Flux.trainmode!"><code>Flux.trainmode!</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">trainmode!(m, mode = true)</code></pre><p>Set a layer of model&#39;s train mode (see below). Symmetric to <a href="#Flux.testmode!"><code>testmode!</code></a> (i.e. `trainmode!(m, mode) == testmode!(m, !mode)).</p><p><em>Note</em>: if you manually set a model into train mode, you need to manually place it into test mode during testing phase.</p><p>Possible values include:</p><ul><li><code>true</code> for training</li><li><code>false</code> for testing</li><li><code>:auto</code> or <code>nothing</code> for Flux to detect the mode automatically</li></ul></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/functor.jl#L58-L71">source</a></section><h2><a class="nav-anchor" id="Cost-Functions-1" href="#Cost-Functions-1">Cost Functions</a></h2><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.mse" href="#Flux.mse"><code>Flux.mse</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">mse(ŷ, y)</code></pre><p>Return the mean squared error <code>sum((ŷ .- y).^2) / length(y)</code>. </p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/stateless.jl#L2-L6">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.crossentropy" href="#Flux.crossentropy"><code>Flux.crossentropy</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">crossentropy(ŷ, y; weight=1)</code></pre><p>Return the crossentropy computed as <code>-sum(y .* log.(ŷ) .* weight) / size(y, 2)</code>. </p><p>See also <a href="#Flux.logitcrossentropy"><code>logitcrossentropy</code></a>, <a href="#Flux.binarycrossentropy"><code>binarycrossentropy</code></a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/stateless.jl#L22-L28">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.logitcrossentropy" href="#Flux.logitcrossentropy"><code>Flux.logitcrossentropy</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">logitcrossentropy(ŷ, y; weight=1)</code></pre><p>Return the crossentropy computed after a <a href="models/@ref">softmax</a> operation: </p><p>-sum(y .* logsoftmax(ŷ) .* weight) / size(y, 2)</p><p>See also <a href="#Flux.crossentropy"><code>crossentropy</code></a>, <a href="#Flux.binarycrossentropy"><code>binarycrossentropy</code></a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/stateless.jl#L31-L39">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.binarycrossentropy" href="#Flux.binarycrossentropy"><code>Flux.binarycrossentropy</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">binarycrossentropy(ŷ, y; ϵ=eps(ŷ))</code></pre><p>Return <code>-y*log(ŷ + ϵ) - (1-y)*log(1-ŷ + ϵ)</code>. The ϵ term provides numerical stability.</p><p>Typically, the prediction <code>ŷ</code> is given by the output of a <a href="../nnlib/#NNlib.sigmoid"><code>sigmoid</code></a> activation.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/stateless.jl#L44-L50">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.logitbinarycrossentropy" href="#Flux.logitbinarycrossentropy"><code>Flux.logitbinarycrossentropy</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">logitbinarycrossentropy(ŷ, y)</code></pre><p><code>logitbinarycrossentropy(ŷ, y)</code> is mathematically equivalent to <code>binarycrossentropy(σ(ŷ), y)</code> but it is more numerically stable.</p><p>See also <a href="#Flux.binarycrossentropy"><code>binarycrossentropy</code></a>, <a href="../nnlib/#NNlib.sigmoid"><code>sigmoid</code></a>, <a href="../nnlib/#NNlib.logsigmoid"><code>logsigmoid</code></a>.  </p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/stateless.jl#L56-L63">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.kldivergence" href="#Flux.kldivergence"><code>Flux.kldivergence</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">kldivergence(ŷ, y)</code></pre><p>KLDivergence is a measure of how much one probability distribution is different from the other. It is always non-negative and zero only when both the distributions are equal everywhere. <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL Divergence</a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/stateless.jl#L100-L106">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.poisson" href="#Flux.poisson"><code>Flux.poisson</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">poisson(ŷ, y)</code></pre><p>Poisson loss function is a measure of how the predicted distribution diverges from the expected distribution. <a href="https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/poisson">Poisson Loss</a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/stateless.jl#L113-L118">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.hinge" href="#Flux.hinge"><code>Flux.hinge</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">hinge(ŷ, y)</code></pre><p>Measures the loss given the prediction <code>ŷ</code> and true labels <code>y</code> (containing 1 or -1).  <a href="https://en.wikipedia.org/wiki/Hinge_loss">Hinge Loss</a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/layers/stateless.jl#L121-L126">source</a></section><footer><hr/><a class="previous" href="../regularisation/"><span class="direction">Previous</span><span class="title">Regularisation</span></a><a class="next" href="../nnlib/"><span class="direction">Next</span><span class="title">NNlib</span></a></footer></article></body></html>
diff --git a/dev/training/optimisers/index.html b/dev/training/optimisers/index.html
index 9affe30a..1649d9fd 100644
--- a/dev/training/optimisers/index.html
+++ b/dev/training/optimisers/index.html
@@ -28,7 +28,7 @@ end</code></pre><p>Running this will alter the parameters <code>W</code> and <co
 for p in (W, b)
   update!(opt, p, grads[p])
 end</code></pre><p>An optimiser <code>update!</code> accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass <code>opt</code> to our <a href="../training/">training loop</a>, which will update all parameters of the model in a loop. However, we can now easily replace <code>Descent</code> with a more advanced optimiser such as <code>ADAM</code>.</p><h2><a class="nav-anchor" id="Optimiser-Reference-1" href="#Optimiser-Reference-1">Optimiser Reference</a></h2><p>All optimisers return an object that, when passed to <code>train!</code>, will update the parameters passed to it.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.update!" href="#Flux.Optimise.update!"><code>Flux.Optimise.update!</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">update!(opt, p, g)
-update!(opt, ps::Params, gs)</code></pre><p>Perform an update step of the parameters <code>ps</code> (or the single parameter <code>p</code>)  according to optimizer <code>opt</code>  and the gradients <code>gs</code> (the gradient <code>g</code>).</p><p>As a result, the parameters are mutated and the optimizer&#39;s internal state may change. </p><p>update!(x, x̄)</p><p>Update the array <code>x</code> according to <code>x .-= x̄</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/train.jl#L5-L17">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Descent" href="#Flux.Optimise.Descent"><code>Flux.Optimise.Descent</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Descent(η)</code></pre><p>Classic gradient descent optimiser with learning rate <code>η</code>. For each parameter <code>p</code> and its gradient <code>δp</code>, this runs <code>p -= η*δp</code></p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): The amount by which the gradients are discounted before updating the weights. Defaults to <code>0.1</code>.</li></ul><p><strong>Example</strong></p><pre><code class="language-julia-repl">opt = Descent() # uses default η (0.1)
+update!(opt, ps::Params, gs)</code></pre><p>Perform an update step of the parameters <code>ps</code> (or the single parameter <code>p</code>)  according to optimizer <code>opt</code>  and the gradients <code>gs</code> (the gradient <code>g</code>).</p><p>As a result, the parameters are mutated and the optimizer&#39;s internal state may change. </p><p>update!(x, x̄)</p><p>Update the array <code>x</code> according to <code>x .-= x̄</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/train.jl#L5-L17">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Descent" href="#Flux.Optimise.Descent"><code>Flux.Optimise.Descent</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Descent(η)</code></pre><p>Classic gradient descent optimiser with learning rate <code>η</code>. For each parameter <code>p</code> and its gradient <code>δp</code>, this runs <code>p -= η*δp</code></p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): The amount by which the gradients are discounted before updating the weights. Defaults to <code>0.1</code>.</li></ul><p><strong>Example</strong></p><pre><code class="language-julia-repl">opt = Descent() # uses default η (0.1)
 
 opt = Descent(0.3) # use provided η
 
@@ -38,23 +38,23 @@ gs = gradient(ps) do
   loss(x, y)
 end
 
-Flux.Optimise.update!(opt, ps, gs)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L8-L31">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Momentum" href="#Flux.Optimise.Momentum"><code>Flux.Optimise.Momentum</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Momentum(η, ρ)</code></pre><p>Gradient descent with learning rate <code>η</code> and momentum <code>ρ</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (<code>η</code>): Amount by which gradients are discounted before updating the weights. Defaults to <code>0.01</code>.</li><li>Momentum (<code>ρ</code>): Parameter that accelerates descent in the relevant direction and dampens oscillations. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = Momentum() # uses defaults of η = 0.01 and ρ = 0.9
+Flux.Optimise.update!(opt, ps, gs)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L8-L31">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Momentum" href="#Flux.Optimise.Momentum"><code>Flux.Optimise.Momentum</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Momentum(η, ρ)</code></pre><p>Gradient descent with learning rate <code>η</code> and momentum <code>ρ</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (<code>η</code>): Amount by which gradients are discounted before updating the weights. Defaults to <code>0.01</code>.</li><li>Momentum (<code>ρ</code>): Parameter that accelerates descent in the relevant direction and dampens oscillations. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = Momentum() # uses defaults of η = 0.01 and ρ = 0.9
 
-opt = Momentum(0.01, 0.99)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L42-L57">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Nesterov" href="#Flux.Optimise.Nesterov"><code>Flux.Optimise.Nesterov</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Nesterov(η, ρ)</code></pre><p>Gradient descent with learning rate  <code>η</code> and Nesterov momentum <code>ρ</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Amount by which the gradients are dicsounted berfore updating the weights. Defaults to <code>0.001</code>.</li><li>Nesterov Momentum (ρ): Parameters controlling the amount of nesterov momentum to be applied. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = Nesterov() # uses defaults η = 0.001 and ρ = 0.9
+opt = Momentum(0.01, 0.99)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L42-L57">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Nesterov" href="#Flux.Optimise.Nesterov"><code>Flux.Optimise.Nesterov</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Nesterov(η, ρ)</code></pre><p>Gradient descent with learning rate  <code>η</code> and Nesterov momentum <code>ρ</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Amount by which the gradients are dicsounted berfore updating the weights. Defaults to <code>0.001</code>.</li><li>Nesterov Momentum (ρ): Parameters controlling the amount of nesterov momentum to be applied. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = Nesterov() # uses defaults η = 0.001 and ρ = 0.9
 
-opt = Nesterov(0.003, 0.95)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L73-L88">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.RMSProp" href="#Flux.Optimise.RMSProp"><code>Flux.Optimise.RMSProp</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">RMSProp(η, ρ)</code></pre><p>Implements the RMSProp algortihm. Often a good choice for recurrent networks. Parameters other than learning rate generally don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Rho (ρ): Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = RMSProp() # uses default η = 0.001 and ρ = 0.9
+opt = Nesterov(0.003, 0.95)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L73-L88">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.RMSProp" href="#Flux.Optimise.RMSProp"><code>Flux.Optimise.RMSProp</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">RMSProp(η, ρ)</code></pre><p>Implements the RMSProp algortihm. Often a good choice for recurrent networks. Parameters other than learning rate generally don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Rho (ρ): Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = RMSProp() # uses default η = 0.001 and ρ = 0.9
 
-opt = RMSProp(0.002, 0.95)</code></pre><p><strong>References</strong></p><p><a href="https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">RMSProp</a></p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L105-L123">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAM" href="#Flux.Optimise.ADAM"><code>Flux.Optimise.ADAM</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADAM(η, β::Tuple)</code></pre><p>Implements the ADAM optimiser.</p><p><strong>Paramters</strong></p><ul><li>Learning Rate (<code>η</code>): Defaults to <code>0.001</code>.</li><li>Beta (<code>β::Tuple</code>): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAM() # uses the default η = 0.001 and β = (0.9, 0.999)
+opt = RMSProp(0.002, 0.95)</code></pre><p><strong>References</strong></p><p><a href="https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">RMSProp</a></p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L105-L123">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAM" href="#Flux.Optimise.ADAM"><code>Flux.Optimise.ADAM</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADAM(η, β::Tuple)</code></pre><p>Implements the ADAM optimiser.</p><p><strong>Paramters</strong></p><ul><li>Learning Rate (<code>η</code>): Defaults to <code>0.001</code>.</li><li>Beta (<code>β::Tuple</code>): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAM() # uses the default η = 0.001 and β = (0.9, 0.999)
 
-opt = ADAM(0.001, (0.9, 0.8))</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1412.6980v8">ADAM</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L139-L157">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.AdaMax" href="#Flux.Optimise.AdaMax"><code>Flux.Optimise.AdaMax</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AdaMax(η, β::Tuple)</code></pre><p>Variant of ADAM based on ∞-norm.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code></li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = AdaMax() # uses default η and β
+opt = ADAM(0.001, (0.9, 0.8))</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1412.6980v8">ADAM</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L139-L157">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.AdaMax" href="#Flux.Optimise.AdaMax"><code>Flux.Optimise.AdaMax</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AdaMax(η, β::Tuple)</code></pre><p>Variant of ADAM based on ∞-norm.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code></li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = AdaMax() # uses default η and β
 
-opt = AdaMax(0.001, (0.9, 0.995))</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1412.6980v9">AdaMax</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L221-L238">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAGrad" href="#Flux.Optimise.ADAGrad"><code>Flux.Optimise.ADAGrad</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADAGrad(η)</code></pre><p>Implements AdaGrad. It has parameter specific learning rates based on how frequently it is updated.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.1</code></li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAGrad() # uses default η = 0.1
+opt = AdaMax(0.001, (0.9, 0.995))</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1412.6980v9">AdaMax</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L221-L238">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAGrad" href="#Flux.Optimise.ADAGrad"><code>Flux.Optimise.ADAGrad</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADAGrad(η)</code></pre><p>Implements AdaGrad. It has parameter specific learning rates based on how frequently it is updated.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.1</code></li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAGrad() # uses default η = 0.1
 
-opt = ADAGrad(0.001)</code></pre><p><strong>References</strong></p><p><a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">ADAGrad</a> optimiser. Parameters don&#39;t need tuning.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L257-L275">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADADelta" href="#Flux.Optimise.ADADelta"><code>Flux.Optimise.ADADelta</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADADelta(ρ)</code></pre><p>Version of ADAGrad that adapts learning rate based on a window of past gradient updates. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Rho (ρ): Factor by which gradient is decayed at each time step. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADADelta() # uses default ρ = 0.9
-opt = ADADelta(0.89)</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1212.5701">ADADelta</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L290-L306">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.AMSGrad" href="#Flux.Optimise.AMSGrad"><code>Flux.Optimise.AMSGrad</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AMSGrad(η, β::Tuple)</code></pre><p>Implements AMSGrad version of the ADAM optimiser. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = AMSGrad() # uses default η and β
-opt = AMSGrad(0.001, (0.89, 0.995))</code></pre><p><strong>References</strong></p><p><a href="https://openreview.net/forum?id=ryQu7f-RZ">AMSGrad</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L323-L340">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.NADAM" href="#Flux.Optimise.NADAM"><code>Flux.Optimise.NADAM</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">NADAM(η, β::Tuple)</code></pre><p>Nesterov variant of ADAM. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = NADAM() # uses default η and β
-opt = NADAM(0.002, (0.89, 0.995))</code></pre><p><strong>References</strong></p><p><a href="http://cs229.stanford.edu/proj2015/054_report.pdf">NADAM</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L358-L375">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAMW" href="#Flux.Optimise.ADAMW"><code>Flux.Optimise.ADAMW</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">ADAMW(η, β::Tuple, decay)</code></pre><p>Variant of ADAM defined by fixing weight decay regularization.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to (0.9, 0.999).</li><li>decay: Decay applied to weights during optimisation. Defaults to 0.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAMW() # uses default η, β and decay
-opt = ADAMW(0.001, (0.89, 0.995), 0.1)</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1711.05101">ADAMW</a></p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L394-L412">source</a></section><h2><a class="nav-anchor" id="Optimiser-Interface-1" href="#Optimiser-Interface-1">Optimiser Interface</a></h2><p>Flux&#39;s optimisers are built around a <code>struct</code> that holds all the optimiser parameters along with a definition of how to apply the update rule associated with it. We do this via the <code>apply!</code> function which takes the optimiser as the first argument followed by the parameter and its corresponding gradient.</p><p>In this manner Flux also allows one to create custom optimisers to be used seamlessly. Let&#39;s work this with a simple example.</p><pre><code class="language-julia">mutable struct Momentum
+opt = ADAGrad(0.001)</code></pre><p><strong>References</strong></p><p><a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">ADAGrad</a> optimiser. Parameters don&#39;t need tuning.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L257-L275">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADADelta" href="#Flux.Optimise.ADADelta"><code>Flux.Optimise.ADADelta</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADADelta(ρ)</code></pre><p>Version of ADAGrad that adapts learning rate based on a window of past gradient updates. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Rho (ρ): Factor by which gradient is decayed at each time step. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADADelta() # uses default ρ = 0.9
+opt = ADADelta(0.89)</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1212.5701">ADADelta</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L290-L306">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.AMSGrad" href="#Flux.Optimise.AMSGrad"><code>Flux.Optimise.AMSGrad</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AMSGrad(η, β::Tuple)</code></pre><p>Implements AMSGrad version of the ADAM optimiser. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = AMSGrad() # uses default η and β
+opt = AMSGrad(0.001, (0.89, 0.995))</code></pre><p><strong>References</strong></p><p><a href="https://openreview.net/forum?id=ryQu7f-RZ">AMSGrad</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L323-L340">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.NADAM" href="#Flux.Optimise.NADAM"><code>Flux.Optimise.NADAM</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">NADAM(η, β::Tuple)</code></pre><p>Nesterov variant of ADAM. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = NADAM() # uses default η and β
+opt = NADAM(0.002, (0.89, 0.995))</code></pre><p><strong>References</strong></p><p><a href="http://cs229.stanford.edu/proj2015/054_report.pdf">NADAM</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L358-L375">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAMW" href="#Flux.Optimise.ADAMW"><code>Flux.Optimise.ADAMW</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">ADAMW(η, β::Tuple, decay)</code></pre><p>Variant of ADAM defined by fixing weight decay regularization.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to (0.9, 0.999).</li><li>decay: Decay applied to weights during optimisation. Defaults to 0.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAMW() # uses default η, β and decay
+opt = ADAMW(0.001, (0.89, 0.995), 0.1)</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1711.05101">ADAMW</a></p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L394-L412">source</a></section><h2><a class="nav-anchor" id="Optimiser-Interface-1" href="#Optimiser-Interface-1">Optimiser Interface</a></h2><p>Flux&#39;s optimisers are built around a <code>struct</code> that holds all the optimiser parameters along with a definition of how to apply the update rule associated with it. We do this via the <code>apply!</code> function which takes the optimiser as the first argument followed by the parameter and its corresponding gradient.</p><p>In this manner Flux also allows one to create custom optimisers to be used seamlessly. Let&#39;s work this with a simple example.</p><pre><code class="language-julia">mutable struct Momentum
   eta
   rho
   velocity
@@ -81,4 +81,4 @@ for t = 1:10^5
 end
 
 loss(rand(10)) # around 0.9</code></pre><p>In this manner it is possible to compose optimisers for some added flexibility.</p><h2><a class="nav-anchor" id="Decays-1" href="#Decays-1">Decays</a></h2><p>Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ExpDecay" href="#Flux.Optimise.ExpDecay"><code>Flux.Optimise.ExpDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ExpDecay(eta, decay, decay_step, clip)</code></pre><p>Discount the learning rate <code>eta</code> by a multiplicative factor <code>decay</code> every <code>decay_step</code> till a minimum of <code>clip</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (eta): Defaults to <code>0.001</code>.</li><li>decay: Factor by which the learning rate is discounted. Defaults to <code>0.1</code>.</li><li>decay_step: Schedules decay operations by setting number of steps between two decay operations. Defaults to <code>1000</code>.</li><li>clip: Minimum value of learning rate. Defaults to <code>1e-4</code>.</li></ul><p><strong>Example</strong></p><p>To apply exponential decay to an optimiser:</p><pre><code class="language-julia">Optimiser(ExpDecay(..), Opt(..))
-opt = Optimiser(ExpDecay(), ADAM())</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L471-L488">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.InvDecay" href="#Flux.Optimise.InvDecay"><code>Flux.Optimise.InvDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">InvDecay(γ)</code></pre><p>Applies inverse time decay to an optimiser, i.e., the effective step size at iteration <code>n</code> is <code>eta / (1 + γ * n)</code> where <code>eta</code> is the initial step size. The wrapped optimiser&#39;s step size is not modified.</p><p><strong>Parameters</strong></p><ul><li>gamma (γ): Defaults to <code>0.001</code></li></ul><p><strong>Example</strong></p><pre><code class="language-julia">Optimiser(InvDecay(..), Opt(..))</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L443-L455">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.WeightDecay" href="#Flux.Optimise.WeightDecay"><code>Flux.Optimise.WeightDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">WeightDecay(wd)</code></pre><p>Decays the weight by <code>wd</code></p><p><strong>Parameters</strong></p><ul><li>weight decay (wd): 0</li></ul></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/optimisers.jl#L509-L516">source</a></section><footer><hr/><a class="previous" href="../../data/dataloader/"><span class="direction">Previous</span><span class="title">DataLoader</span></a><a class="next" href="../training/"><span class="direction">Next</span><span class="title">Training</span></a></footer></article></body></html>
+opt = Optimiser(ExpDecay(), ADAM())</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L471-L488">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.InvDecay" href="#Flux.Optimise.InvDecay"><code>Flux.Optimise.InvDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">InvDecay(γ)</code></pre><p>Applies inverse time decay to an optimiser, i.e., the effective step size at iteration <code>n</code> is <code>eta / (1 + γ * n)</code> where <code>eta</code> is the initial step size. The wrapped optimiser&#39;s step size is not modified.</p><p><strong>Parameters</strong></p><ul><li>gamma (γ): Defaults to <code>0.001</code></li></ul><p><strong>Example</strong></p><pre><code class="language-julia">Optimiser(InvDecay(..), Opt(..))</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L443-L455">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.WeightDecay" href="#Flux.Optimise.WeightDecay"><code>Flux.Optimise.WeightDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">WeightDecay(wd)</code></pre><p>Decays the weight by <code>wd</code></p><p><strong>Parameters</strong></p><ul><li>weight decay (wd): 0</li></ul></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/optimisers.jl#L509-L516">source</a></section><footer><hr/><a class="previous" href="../../data/dataloader/"><span class="direction">Previous</span><span class="title">DataLoader</span></a><a class="next" href="../training/"><span class="direction">Next</span><span class="title">Training</span></a></footer></article></body></html>
diff --git a/dev/training/training/index.html b/dev/training/training/index.html
index cfff4b2d..876f7a05 100644
--- a/dev/training/training/index.html
+++ b/dev/training/training/index.html
@@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
 
 ga('create', 'UA-36890222-9', 'auto');
 ga('send', 'pageview');
-</script><link href="https://cdnjs.cloudflare.com/ajax/libs/normalize/4.2.0/normalize.min.css" rel="stylesheet" type="text/css"/><link href="https://fonts.googleapis.com/css?family=Lato|Roboto+Mono" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.2.0/require.min.js" data-main="../../assets/documenter.js"></script><script src="../../siteinfo.js"></script><script src="../../../versions.js"></script><link href="../../assets/documenter.css" rel="stylesheet" type="text/css"/><link href="../../assets/flux.css" rel="stylesheet" type="text/css"/></head><body><nav class="toc"><h1>Flux</h1><select id="version-selector" onChange="window.location.href=this.value" style="visibility: hidden"></select><form class="search" id="search-form" action="../../search/"><input id="search-query" name="q" type="text" placeholder="Search docs"/></form><ul><li><a class="toctext" href="../../">Home</a></li><li><span class="toctext">Building Models</span><ul><li><a class="toctext" href="../../models/basics/">Basics</a></li><li><a class="toctext" href="../../models/recurrence/">Recurrence</a></li><li><a class="toctext" href="../../models/regularisation/">Regularisation</a></li><li><a class="toctext" href="../../models/layers/">Model Reference</a></li><li><a class="toctext" href="../../models/nnlib/">NNlib</a></li></ul></li><li><span class="toctext">Handling Data</span><ul><li><a class="toctext" href="../../data/onehot/">One-Hot Encoding</a></li><li><a class="toctext" href="../../data/dataloader/">DataLoader</a></li></ul></li><li><span class="toctext">Training Models</span><ul><li><a class="toctext" href="../optimisers/">Optimisers</a></li><li class="current"><a class="toctext" href>Training</a><ul class="internal"><li><a class="toctext" href="#Loss-Functions-1">Loss Functions</a></li><li><a class="toctext" href="#Model-parameters-1">Model parameters</a></li><li><a class="toctext" href="#Datasets-1">Datasets</a></li><li><a class="toctext" href="#Callbacks-1">Callbacks</a></li><li><a class="toctext" href="#Custom-Training-loops-1">Custom Training loops</a></li></ul></li></ul></li><li><a class="toctext" href="../../gpu/">GPU Support</a></li><li><a class="toctext" href="../../saving/">Saving &amp; Loading</a></li><li><a class="toctext" href="../../ecosystem/">The Julia Ecosystem</a></li><li><a class="toctext" href="../../performance/">Performance Tips</a></li><li><a class="toctext" href="../../community/">Community</a></li></ul></nav><article id="docs"><header><nav><ul><li>Training Models</li><li><a href>Training</a></li></ul><a class="edit-page" href="https://github.com/FluxML/Flux.jl/blob/master/docs/src/training/training.md"><span class="fa"></span> Edit on GitHub</a></nav><hr/><div id="topbar"><span>Training</span><a class="fa fa-bars" href="#"></a></div></header><h1><a class="nav-anchor" id="Training-1" href="#Training-1">Training</a></h1><p>To actually train a model we need four things:</p><ul><li>A <em>objective function</em>, that evaluates how well a model is doing given some input data.</li><li>The trainable parameters of the model.</li><li>A collection of data points that will be provided to the objective function.</li><li>An <a href="../optimisers/">optimiser</a> that will update the model parameters appropriately.</li></ul><p>With these we can call <code>train!</code>:</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.train!" href="#Flux.Optimise.train!"><code>Flux.Optimise.train!</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">train!(loss, params, data, opt; cb)</code></pre><p>For each datapoint <code>d</code> in <code>data</code> computes the gradient of <code>loss(d...)</code> through backpropagation and calls the optimizer <code>opt</code>.</p><p>In case datapoints <code>d</code> are of numeric array type, assumes no splatting is needed  and computes the gradient of <code>loss(d)</code>.</p><p>Takes a callback as keyword argument <code>cb</code>. For example, this will print &quot;training&quot; every 10 seconds:</p><p>train!(loss, params, data, opt,          cb = throttle(() -&gt; println(&quot;training&quot;), 10))</p><p>The callback can call <code>Flux.stop()</code> to interrupt the training loop.</p><p>Multiple optimisers and callbacks can be passed to <code>opt</code> and <code>cb</code> as arrays.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddab979ea9d062acc9e8e700404fd51997581ada/src/optimise/train.jl#L58-L76">source</a></section><p>There are plenty of examples in the <a href="https://github.com/FluxML/model-zoo">model zoo</a>.</p><h2><a class="nav-anchor" id="Loss-Functions-1" href="#Loss-Functions-1">Loss Functions</a></h2><p>The objective function must return a number representing how far the model is from its target – the <em>loss</em> of the model. The <code>loss</code> function that we defined in <a href="../../models/basics/">basics</a> will work as an objective. We can also define an objective in terms of some model:</p><pre><code class="language-julia">m = Chain(
+</script><link href="https://cdnjs.cloudflare.com/ajax/libs/normalize/4.2.0/normalize.min.css" rel="stylesheet" type="text/css"/><link href="https://fonts.googleapis.com/css?family=Lato|Roboto+Mono" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.2.0/require.min.js" data-main="../../assets/documenter.js"></script><script src="../../siteinfo.js"></script><script src="../../../versions.js"></script><link href="../../assets/documenter.css" rel="stylesheet" type="text/css"/><link href="../../assets/flux.css" rel="stylesheet" type="text/css"/></head><body><nav class="toc"><h1>Flux</h1><select id="version-selector" onChange="window.location.href=this.value" style="visibility: hidden"></select><form class="search" id="search-form" action="../../search/"><input id="search-query" name="q" type="text" placeholder="Search docs"/></form><ul><li><a class="toctext" href="../../">Home</a></li><li><span class="toctext">Building Models</span><ul><li><a class="toctext" href="../../models/basics/">Basics</a></li><li><a class="toctext" href="../../models/recurrence/">Recurrence</a></li><li><a class="toctext" href="../../models/regularisation/">Regularisation</a></li><li><a class="toctext" href="../../models/layers/">Model Reference</a></li><li><a class="toctext" href="../../models/nnlib/">NNlib</a></li></ul></li><li><span class="toctext">Handling Data</span><ul><li><a class="toctext" href="../../data/onehot/">One-Hot Encoding</a></li><li><a class="toctext" href="../../data/dataloader/">DataLoader</a></li></ul></li><li><span class="toctext">Training Models</span><ul><li><a class="toctext" href="../optimisers/">Optimisers</a></li><li class="current"><a class="toctext" href>Training</a><ul class="internal"><li><a class="toctext" href="#Loss-Functions-1">Loss Functions</a></li><li><a class="toctext" href="#Model-parameters-1">Model parameters</a></li><li><a class="toctext" href="#Datasets-1">Datasets</a></li><li><a class="toctext" href="#Callbacks-1">Callbacks</a></li><li><a class="toctext" href="#Custom-Training-loops-1">Custom Training loops</a></li></ul></li></ul></li><li><a class="toctext" href="../../gpu/">GPU Support</a></li><li><a class="toctext" href="../../saving/">Saving &amp; Loading</a></li><li><a class="toctext" href="../../ecosystem/">The Julia Ecosystem</a></li><li><a class="toctext" href="../../performance/">Performance Tips</a></li><li><a class="toctext" href="../../community/">Community</a></li></ul></nav><article id="docs"><header><nav><ul><li>Training Models</li><li><a href>Training</a></li></ul><a class="edit-page" href="https://github.com/FluxML/Flux.jl/blob/master/docs/src/training/training.md"><span class="fa"></span> Edit on GitHub</a></nav><hr/><div id="topbar"><span>Training</span><a class="fa fa-bars" href="#"></a></div></header><h1><a class="nav-anchor" id="Training-1" href="#Training-1">Training</a></h1><p>To actually train a model we need four things:</p><ul><li>A <em>objective function</em>, that evaluates how well a model is doing given some input data.</li><li>The trainable parameters of the model.</li><li>A collection of data points that will be provided to the objective function.</li><li>An <a href="../optimisers/">optimiser</a> that will update the model parameters appropriately.</li></ul><p>With these we can call <code>train!</code>:</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.train!" href="#Flux.Optimise.train!"><code>Flux.Optimise.train!</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">train!(loss, params, data, opt; cb)</code></pre><p>For each datapoint <code>d</code> in <code>data</code> computes the gradient of <code>loss(d...)</code> through backpropagation and calls the optimizer <code>opt</code>.</p><p>In case datapoints <code>d</code> are of numeric array type, assumes no splatting is needed  and computes the gradient of <code>loss(d)</code>.</p><p>Takes a callback as keyword argument <code>cb</code>. For example, this will print &quot;training&quot; every 10 seconds:</p><p>train!(loss, params, data, opt,          cb = throttle(() -&gt; println(&quot;training&quot;), 10))</p><p>The callback can call <code>Flux.stop()</code> to interrupt the training loop.</p><p>Multiple optimisers and callbacks can be passed to <code>opt</code> and <code>cb</code> as arrays.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/df73b8b8fb3a99308057866f99565fb89366651b/src/optimise/train.jl#L58-L76">source</a></section><p>There are plenty of examples in the <a href="https://github.com/FluxML/model-zoo">model zoo</a>.</p><h2><a class="nav-anchor" id="Loss-Functions-1" href="#Loss-Functions-1">Loss Functions</a></h2><p>The objective function must return a number representing how far the model is from its target – the <em>loss</em> of the model. The <code>loss</code> function that we defined in <a href="../../models/basics/">basics</a> will work as an objective. We can also define an objective in terms of some model:</p><pre><code class="language-julia">m = Chain(
   Dense(784, 32, σ),
   Dense(32, 10), softmax)