build based on ddc2c20

2020-02-01 05:57:50 +00:00 · 2020-02-01 05:57:50 +00:00 · de31cb483a
commit de31cb483a
parent 68fa42e53f
4 changed files with 39 additions and 26 deletions
--- a/dev/models/layers/index.html
+++ b/dev/models/layers/index.html
@ -11,34 +11,34 @@ m(5) == 26
 m = Chain(Dense(10, 5), Dense(5, 2))
 x = rand(10)
-m(x) == m[2](m[1](x))</code></pre><p><code>Chain</code> also supports indexing and slicing, e.g. <code>m[2]</code> or <code>m[1:end-1]</code>. <code>m[1:3](x)</code> will calculate the output of the first three layers.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/basic.jl#L1-L18">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dense" href="#Flux.Dense"><code>Flux.Dense</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Dense(in::Integer, out::Integer, σ = identity)</code></pre><p>Creates a traditional <code>Dense</code> layer with parameters <code>W</code> and <code>b</code>.</p><pre><code class="language-none">y = σ.(W * x .+ b)</code></pre><p>The input <code>x</code> must be a vector of length <code>in</code>, or a batch of vectors represented as an <code>in × N</code> matrix. The out <code>y</code> will be a vector or batch of length <code>out</code>.</p><pre><code class="language-julia">julia&gt; d = Dense(5, 2)
+m(x) == m[2](m[1](x))</code></pre><p><code>Chain</code> also supports indexing and slicing, e.g. <code>m[2]</code> or <code>m[1:end-1]</code>. <code>m[1:3](x)</code> will calculate the output of the first three layers.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/basic.jl#L1-L18">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dense" href="#Flux.Dense"><code>Flux.Dense</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Dense(in::Integer, out::Integer, σ = identity)</code></pre><p>Creates a traditional <code>Dense</code> layer with parameters <code>W</code> and <code>b</code>.</p><pre><code class="language-none">y = σ.(W * x .+ b)</code></pre><p>The input <code>x</code> must be a vector of length <code>in</code>, or a batch of vectors represented as an <code>in × N</code> matrix. The out <code>y</code> will be a vector or batch of length <code>out</code>.</p><pre><code class="language-julia">julia&gt; d = Dense(5, 2)
 Dense(5, 2)
 julia&gt; d(rand(5))
 Tracked 2-element Array{Float64,1}:
  0.00257447
-  -0.00449443</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/basic.jl#L65-L84">source</a></section><h2><a class="nav-anchor" id="Convolution-and-Pooling-Layers-1" href="#Convolution-and-Pooling-Layers-1">Convolution and Pooling Layers</a></h2><p>These layers are used to build convolutional neural networks (CNNs).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Conv" href="#Flux.Conv"><code>Flux.Conv</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Conv(size, in=&gt;out)
+  -0.00449443</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/basic.jl#L65-L84">source</a></section><h2><a class="nav-anchor" id="Convolution-and-Pooling-Layers-1" href="#Convolution-and-Pooling-Layers-1">Convolution and Pooling Layers</a></h2><p>These layers are used to build convolutional neural networks (CNNs).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Conv" href="#Flux.Conv"><code>Flux.Conv</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Conv(size, in=&gt;out)
 Conv(size, in=&gt;out, relu)</code></pre><p>Standard convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Example: Applying Conv layer to a 1-channel input using a 2x2 window size,          giving us a 16-channel output. Output is activated with ReLU.</p><pre><code class="language-none">size = (2,2)
 in = 1
 out = 16
-Conv((2, 2), 1=&gt;16, relu)</code></pre><p>Data should be stored in WHCN order (width, height, # channels, # batches). In other words, a 100×100 RGB image would be a <code>100×100×3×1</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/conv.jl#L5-L25">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.MaxPool" href="#Flux.MaxPool"><code>Flux.MaxPool</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">MaxPool(k)</code></pre><p>Max pooling layer. <code>k</code> stands for the size of the window for each dimension of the input.</p><p>Takes the keyword arguments <code>pad</code> and <code>stride</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/conv.jl#L278-L284">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.MeanPool" href="#Flux.MeanPool"><code>Flux.MeanPool</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">MeanPool(k)</code></pre><p>Mean pooling layer. <code>k</code> stands for the size of the window for each dimension of the input.</p><p>Takes the keyword arguments <code>pad</code> and <code>stride</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/conv.jl#L307-L313">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.DepthwiseConv" href="#Flux.DepthwiseConv"><code>Flux.DepthwiseConv</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">DepthwiseConv(size, in=&gt;out)
+Conv((2, 2), 1=&gt;16, relu)</code></pre><p>Data should be stored in WHCN order (width, height, # channels, # batches). In other words, a 100×100 RGB image would be a <code>100×100×3×1</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/conv.jl#L5-L25">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.MaxPool" href="#Flux.MaxPool"><code>Flux.MaxPool</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">MaxPool(k)</code></pre><p>Max pooling layer. <code>k</code> stands for the size of the window for each dimension of the input.</p><p>Takes the keyword arguments <code>pad</code> and <code>stride</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/conv.jl#L278-L284">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.MeanPool" href="#Flux.MeanPool"><code>Flux.MeanPool</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">MeanPool(k)</code></pre><p>Mean pooling layer. <code>k</code> stands for the size of the window for each dimension of the input.</p><p>Takes the keyword arguments <code>pad</code> and <code>stride</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/conv.jl#L307-L313">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.DepthwiseConv" href="#Flux.DepthwiseConv"><code>Flux.DepthwiseConv</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">DepthwiseConv(size, in=&gt;out)
-DepthwiseConv(size, in=&gt;out, relu)</code></pre><p>Depthwise convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively. Note that <code>out</code> must be an integer multiple of <code>in</code>.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/conv.jl#L143-L155">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.ConvTranspose" href="#Flux.ConvTranspose"><code>Flux.ConvTranspose</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ConvTranspose(size, in=&gt;out)
+DepthwiseConv(size, in=&gt;out, relu)</code></pre><p>Depthwise convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively. Note that <code>out</code> must be an integer multiple of <code>in</code>.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/conv.jl#L143-L155">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.ConvTranspose" href="#Flux.ConvTranspose"><code>Flux.ConvTranspose</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ConvTranspose(size, in=&gt;out)
-ConvTranspose(size, in=&gt;out, relu)</code></pre><p>Standard convolutional transpose layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/conv.jl#L71-L82">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.CrossCor" href="#Flux.CrossCor"><code>Flux.CrossCor</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">CrossCor(size, in=&gt;out)
+ConvTranspose(size, in=&gt;out, relu)</code></pre><p>Standard convolutional transpose layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/conv.jl#L71-L82">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.CrossCor" href="#Flux.CrossCor"><code>Flux.CrossCor</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">CrossCor(size, in=&gt;out)
 CrossCor(size, in=&gt;out, relu)</code></pre><p>Standard cross convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Example: Applying CrossCor layer to a 1-channel input using a 2x2 window size,          giving us a 16-channel output. Output is activated with ReLU.</p><pre><code class="language-none">size = (2,2)
 in = 1
 out = 16
-CrossCor((2, 2), 1=&gt;16, relu)</code></pre><p>Data should be stored in WHCN order (width, height, # channels, # batches). In other words, a 100×100 RGB image would be a <code>100×100×3×1</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/conv.jl#L207-L227">source</a></section><h2><a class="nav-anchor" id="Recurrent-Layers-1" href="#Recurrent-Layers-1">Recurrent Layers</a></h2><p>Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.RNN" href="#Flux.RNN"><code>Flux.RNN</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">RNN(in::Integer, out::Integer, σ = tanh)</code></pre><p>The most basic recurrent layer; essentially acts as a <code>Dense</code> layer, but with the output fed back into the input each time step.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/recurrent.jl#L91-L96">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LSTM" href="#Flux.LSTM"><code>Flux.LSTM</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">LSTM(in::Integer, out::Integer)</code></pre><p>Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/recurrent.jl#L136-L144">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GRU" href="#Flux.GRU"><code>Flux.GRU</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">GRU(in::Integer, out::Integer)</code></pre><p>Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/recurrent.jl#L177-L185">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Recur" href="#Flux.Recur"><code>Flux.Recur</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Recur(cell)</code></pre><p><code>Recur</code> takes a recurrent cell and makes it stateful, managing the hidden state in the background. <code>cell</code> should be a model of the form:</p><pre><code class="language-none">h, y = cell(h, x...)</code></pre><p>For example, here&#39;s a recurrent network that keeps a running total of its inputs.</p><pre><code class="language-julia">accum(h, x) = (h+x, x)
+CrossCor((2, 2), 1=&gt;16, relu)</code></pre><p>Data should be stored in WHCN order (width, height, # channels, # batches). In other words, a 100×100 RGB image would be a <code>100×100×3×1</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/conv.jl#L207-L227">source</a></section><h2><a class="nav-anchor" id="Recurrent-Layers-1" href="#Recurrent-Layers-1">Recurrent Layers</a></h2><p>Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.RNN" href="#Flux.RNN"><code>Flux.RNN</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">RNN(in::Integer, out::Integer, σ = tanh)</code></pre><p>The most basic recurrent layer; essentially acts as a <code>Dense</code> layer, but with the output fed back into the input each time step.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/recurrent.jl#L91-L96">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LSTM" href="#Flux.LSTM"><code>Flux.LSTM</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">LSTM(in::Integer, out::Integer)</code></pre><p>Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/recurrent.jl#L136-L144">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GRU" href="#Flux.GRU"><code>Flux.GRU</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">GRU(in::Integer, out::Integer)</code></pre><p>Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/recurrent.jl#L177-L185">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Recur" href="#Flux.Recur"><code>Flux.Recur</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Recur(cell)</code></pre><p><code>Recur</code> takes a recurrent cell and makes it stateful, managing the hidden state in the background. <code>cell</code> should be a model of the form:</p><pre><code class="language-none">h, y = cell(h, x...)</code></pre><p>For example, here&#39;s a recurrent network that keeps a running total of its inputs.</p><pre><code class="language-julia">accum(h, x) = (h+x, x)
 rnn = Flux.Recur(accum, 0)
 rnn(2) # 2
 rnn(3) # 3
 rnn.state # 5
 rnn.(1:10) # apply to a sequence
-rnn.state # 60</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/recurrent.jl#L7-L26">source</a></section><h2><a class="nav-anchor" id="Other-General-Purpose-Layers-1" href="#Other-General-Purpose-Layers-1">Other General Purpose Layers</a></h2><p>These are marginally more obscure than the Basic Layers. But in contrast to the layers described in the other sections are not readily grouped around a particular purpose (e.g. CNNs or RNNs).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Maxout" href="#Flux.Maxout"><code>Flux.Maxout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Maxout(over)</code></pre><p><code>Maxout</code> is a neural network layer, which has a number of internal layers, which all have the same input, and the maxout returns the elementwise maximium of the internal layers&#39; outputs.</p><p>Maxout over linear dense layers satisfies the univeral approximation theorem.</p><p>Reference: Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio.</p><ol><li>Maxout networks.</li></ol><p>In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML&#39;13), Sanjoy Dasgupta and David McAllester (Eds.), Vol. 28. JMLR.org III-1319-III-1327. https://arxiv.org/pdf/1302.4389.pdf</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/basic.jl#L149-L164">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.SkipConnection" href="#Flux.SkipConnection"><code>Flux.SkipConnection</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">SkipConnection(layers, connection)</code></pre><p>Creates a Skip Connection, of a layer or <code>Chain</code> of consecutive layers plus a shortcut connection. The connection function will combine the result of the layers with the original input, to give the final output.</p><p>The simplest &#39;ResNet&#39;-type connection is just <code>SkipConnection(layer, +)</code>, and requires the output of the layers to be the same shape as the input. Here is a more complicated example:</p><pre><code class="language-none">m = Conv((3,3), 4=&gt;7, pad=(1,1))
+rnn.state # 60</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/recurrent.jl#L7-L26">source</a></section><h2><a class="nav-anchor" id="Other-General-Purpose-Layers-1" href="#Other-General-Purpose-Layers-1">Other General Purpose Layers</a></h2><p>These are marginally more obscure than the Basic Layers. But in contrast to the layers described in the other sections are not readily grouped around a particular purpose (e.g. CNNs or RNNs).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Maxout" href="#Flux.Maxout"><code>Flux.Maxout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Maxout(over)</code></pre><p><code>Maxout</code> is a neural network layer, which has a number of internal layers, which all have the same input, and the maxout returns the elementwise maximium of the internal layers&#39; outputs.</p><p>Maxout over linear dense layers satisfies the univeral approximation theorem.</p><p>Reference: Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio.</p><ol><li>Maxout networks.</li></ol><p>In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML&#39;13), Sanjoy Dasgupta and David McAllester (Eds.), Vol. 28. JMLR.org III-1319-III-1327. https://arxiv.org/pdf/1302.4389.pdf</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/basic.jl#L149-L164">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.SkipConnection" href="#Flux.SkipConnection"><code>Flux.SkipConnection</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">SkipConnection(layers, connection)</code></pre><p>Creates a Skip Connection, of a layer or <code>Chain</code> of consecutive layers plus a shortcut connection. The connection function will combine the result of the layers with the original input, to give the final output.</p><p>The simplest &#39;ResNet&#39;-type connection is just <code>SkipConnection(layer, +)</code>, and requires the output of the layers to be the same shape as the input. Here is a more complicated example:</p><pre><code class="language-none">m = Conv((3,3), 4=&gt;7, pad=(1,1))
 x = ones(5,5,4,10);
 size(m(x)) == (5, 5, 7, 10)
 sm = SkipConnection(m, (mx, x) -&gt; cat(mx, x, dims=3))
-size(sm(x)) == (5, 5, 11, 10)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/basic.jl#L196-L214">source</a></section><h2><a class="nav-anchor" id="Activation-Functions-1" href="#Activation-Functions-1">Activation Functions</a></h2><p>Non-linearities that go between layers of your model. Most of these functions are defined in <a href="https://github.com/FluxML/NNlib.jl">NNlib</a> but are available by default in Flux.</p><p>Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call <code>σ.(xs)</code>, <code>relu.(xs)</code> and so on.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.σ" href="#NNlib.σ"><code>NNlib.σ</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">σ(x) = 1 / (1 + exp(-x))</code></pre><p>Classic <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid</a> activation function.</p></div></div></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.relu" href="#NNlib.relu"><code>NNlib.relu</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">relu(x) = max(0, x)</code></pre><p><a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Rectified Linear Unit</a> activation function.</p></div></div></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.leakyrelu" href="#NNlib.leakyrelu"><code>NNlib.leakyrelu</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">leakyrelu(x) = max(0.01x, x)</code></pre><p>Leaky <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Rectified Linear Unit</a> activation function. You can also specify the coefficient explicitly, e.g. <code>leakyrelu(x, 0.01)</code>.</p></div></div></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.elu" href="#NNlib.elu"><code>NNlib.elu</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">elu(x, α = 1) =
+size(sm(x)) == (5, 5, 11, 10)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/basic.jl#L196-L214">source</a></section><h2><a class="nav-anchor" id="Activation-Functions-1" href="#Activation-Functions-1">Activation Functions</a></h2><p>Non-linearities that go between layers of your model. Most of these functions are defined in <a href="https://github.com/FluxML/NNlib.jl">NNlib</a> but are available by default in Flux.</p><p>Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call <code>σ.(xs)</code>, <code>relu.(xs)</code> and so on.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.σ" href="#NNlib.σ"><code>NNlib.σ</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">σ(x) = 1 / (1 + exp(-x))</code></pre><p>Classic <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid</a> activation function.</p></div></div></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.relu" href="#NNlib.relu"><code>NNlib.relu</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">relu(x) = max(0, x)</code></pre><p><a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Rectified Linear Unit</a> activation function.</p></div></div></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.leakyrelu" href="#NNlib.leakyrelu"><code>NNlib.leakyrelu</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">leakyrelu(x) = max(0.01x, x)</code></pre><p>Leaky <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Rectified Linear Unit</a> activation function. You can also specify the coefficient explicitly, e.g. <code>leakyrelu(x, 0.01)</code>.</p></div></div></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.elu" href="#NNlib.elu"><code>NNlib.elu</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">elu(x, α = 1) =
  x &gt; 0 ? x : α * (exp(x) - 1)</code></pre><p>Exponential Linear Unit activation function. See <a href="https://arxiv.org/abs/1511.07289">Fast and Accurate Deep Network Learning by Exponential Linear Units</a>. You can also specify the coefficient explicitly, e.g. <code>elu(x, 1)</code>.</p></div></div></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.swish" href="#NNlib.swish"><code>NNlib.swish</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">swish(x) = x * σ(x)</code></pre><p>Self-gated activation function. See <a href="https://arxiv.org/pdf/1710.05941.pdf">Swish: a Self-Gated Activation Function</a>.</p></div></div></section><h2><a class="nav-anchor" id="Normalisation-and-Regularisation-1" href="#Normalisation-and-Regularisation-1">Normalisation &amp; Regularisation</a></h2><p>These layers don&#39;t affect the structure of the network but may improve training times or reduce overfitting.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.BatchNorm" href="#Flux.BatchNorm"><code>Flux.BatchNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">BatchNorm(channels::Integer, σ = identity;
          initβ = zeros, initγ = ones,
          ϵ = 1e-8, momentum = .1)</code></pre><p>Batch Normalization layer. The <code>channels</code> input should be the size of the channel dimension in your data (see below).</p><p>Given an array with <code>N</code> dimensions, call the <code>N-1</code>th the channel dimension. (For a batch of feature vectors this is just the data dimension, for <code>WHCN</code> images it&#39;s the usual channel dimension.)</p><p><code>BatchNorm</code> computes the mean and variance for each each <code>W×H×1×N</code> slice and shifts them to have a new mean and variance (corresponding to the learnable, per-channel <code>bias</code> and <code>scale</code> parameters).</p><p>See <a href="https://arxiv.org/pdf/1502.03167.pdf">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a>.</p><p>Example:</p><pre><code class="language-julia">m = Chain(
@ -46,7 +46,7 @@ size(sm(x)) == (5, 5, 11, 10)</code></pre></div></div><a class="source-link" tar
  BatchNorm(64, relu),
  Dense(64, 10),
  BatchNorm(10),
-  softmax)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/normalise.jl#L93-L121">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dropout" href="#Flux.Dropout"><code>Flux.Dropout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Dropout(p, dims = :)</code></pre><p>A Dropout layer. For each input, either sets that input to <code>0</code> (with probability <code>p</code>) or scales it by <code>1/(1-p)</code>. The <code>dims</code> argument is to specified the unbroadcasted  dimensions, i.e. <code>dims=1</code> does dropout along columns and <code>dims=2</code> along rows. This is  used as a regularisation, i.e. it reduces overfitting during training. see also <a href="models/@ref"><code>dropout</code></a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/normalise.jl#L18-L25">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.AlphaDropout" href="#Flux.AlphaDropout"><code>Flux.AlphaDropout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AlphaDropout(p)</code></pre><p>A dropout layer. It is used in Self-Normalizing Neural Networks. (https://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf) The AlphaDropout layer ensures that mean and variance of activations remains the same as before.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/normalise.jl#L44-L49">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LayerNorm" href="#Flux.LayerNorm"><code>Flux.LayerNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">LayerNorm(h::Integer)</code></pre><p>A <a href="https://arxiv.org/pdf/1607.06450.pdf">normalisation layer</a> designed to be used with recurrent hidden states of size <code>h</code>. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/normalise.jl#L71-L77">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GroupNorm" href="#Flux.GroupNorm"><code>Flux.GroupNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>Group Normalization. This layer can outperform Batch-Normalization and Instance-Normalization.</p><pre><code class="language-none">GroupNorm(chs::Integer, G::Integer, λ = identity;
+  softmax)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/normalise.jl#L93-L121">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dropout" href="#Flux.Dropout"><code>Flux.Dropout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Dropout(p, dims = :)</code></pre><p>A Dropout layer. For each input, either sets that input to <code>0</code> (with probability <code>p</code>) or scales it by <code>1/(1-p)</code>. The <code>dims</code> argument is to specified the unbroadcasted  dimensions, i.e. <code>dims=1</code> does dropout along columns and <code>dims=2</code> along rows. This is  used as a regularisation, i.e. it reduces overfitting during training. see also <a href="models/@ref"><code>dropout</code></a>.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/normalise.jl#L18-L25">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.AlphaDropout" href="#Flux.AlphaDropout"><code>Flux.AlphaDropout</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AlphaDropout(p)</code></pre><p>A dropout layer. It is used in Self-Normalizing Neural Networks. (https://papers.nips.cc/paper/6698-self-normalizing-neural-networks.pdf) The AlphaDropout layer ensures that mean and variance of activations remains the same as before.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/normalise.jl#L44-L49">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LayerNorm" href="#Flux.LayerNorm"><code>Flux.LayerNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">LayerNorm(h::Integer)</code></pre><p>A <a href="https://arxiv.org/pdf/1607.06450.pdf">normalisation layer</a> designed to be used with recurrent hidden states of size <code>h</code>. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/normalise.jl#L71-L77">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GroupNorm" href="#Flux.GroupNorm"><code>Flux.GroupNorm</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>Group Normalization. This layer can outperform Batch-Normalization and Instance-Normalization.</p><pre><code class="language-none">GroupNorm(chs::Integer, G::Integer, λ = identity;
          initβ = (i) -&gt; zeros(Float32, i), initγ = (i) -&gt; ones(Float32, i),
          ϵ = 1f-5, momentum = 0.1f0)</code></pre><p><span>$chs$</span> is the number of channels, the channel dimension of your input. For an array of N dimensions, the (N-1)th index is the channel dimension.</p><p><span>$G$</span> is the number of groups along which the statistics would be computed. The number of channels must be an integer multiple of the number of groups.</p><p>Example:</p><pre><code class="language-none">m = Chain(Conv((3,3), 1=&gt;32, leakyrelu;pad = 1),
-          GroupNorm(32,16)) # 32 channels, 16 groups (G = 16), thus 2 channels per group used</code></pre><p>Link : https://arxiv.org/pdf/1803.08494.pdf</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/layers/normalise.jl#L272-L293">source</a></section><h2><a class="nav-anchor" id="Cost-Functions-1" href="#Cost-Functions-1">Cost Functions</a></h2><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>mse</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>crossentropy</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>logitcrossentropy</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>binarycrossentropy</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>logitbinarycrossentropy</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>kldivergence</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>poisson</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>hinge</code>. Check Documenter&#39;s build log for details.</p></div></div><footer><hr/><a class="previous" href="../regularisation/"><span class="direction">Previous</span><span class="title">Regularisation</span></a><a class="next" href="../../training/optimisers/"><span class="direction">Next</span><span class="title">Optimisers</span></a></footer></article></body></html>
+          GroupNorm(32,16)) # 32 channels, 16 groups (G = 16), thus 2 channels per group used</code></pre><p>Link : https://arxiv.org/pdf/1803.08494.pdf</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/layers/normalise.jl#L272-L293">source</a></section><h2><a class="nav-anchor" id="Cost-Functions-1" href="#Cost-Functions-1">Cost Functions</a></h2><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>mse</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>crossentropy</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>logitcrossentropy</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>binarycrossentropy</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>logitbinarycrossentropy</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>kldivergence</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>poisson</code>. Check Documenter&#39;s build log for details.</p></div></div><div class="admonition warning"><div class="admonition-title">Missing docstring.</div><div class="admonition-text"><p>Missing docstring for <code>hinge</code>. Check Documenter&#39;s build log for details.</p></div></div><footer><hr/><a class="previous" href="../regularisation/"><span class="direction">Previous</span><span class="title">Regularisation</span></a><a class="next" href="../../training/optimisers/"><span class="direction">Next</span><span class="title">Optimisers</span></a></footer></article></body></html>
--- a/dev/search_index.js
+++ b/dev/search_index.js
--- a/dev/training/optimisers/index.html
+++ b/dev/training/optimisers/index.html
@ -37,23 +37,23 @@ gs = gradient(ps) do
  loss(x, y)
 end
-Flux.Optimise.update!(opt, ps, gs)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L9-L32">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Momentum" href="#Flux.Optimise.Momentum"><code>Flux.Optimise.Momentum</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Momentum(η, ρ)</code></pre><p>Gradient descent with learning rate <code>η</code> and momentum <code>ρ</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (<code>η</code>): Amount by which gradients are discounted before updating the weights. Defaults to <code>0.01</code>.</li><li>Momentum (<code>ρ</code>): Parameter that accelerates descent in the relevant direction and dampens oscillations. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = Momentum() # uses defaults of η = 0.01 and ρ = 0.9
+Flux.Optimise.update!(opt, ps, gs)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L9-L32">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Momentum" href="#Flux.Optimise.Momentum"><code>Flux.Optimise.Momentum</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Momentum(η, ρ)</code></pre><p>Gradient descent with learning rate <code>η</code> and momentum <code>ρ</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (<code>η</code>): Amount by which gradients are discounted before updating the weights. Defaults to <code>0.01</code>.</li><li>Momentum (<code>ρ</code>): Parameter that accelerates descent in the relevant direction and dampens oscillations. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = Momentum() # uses defaults of η = 0.01 and ρ = 0.9
-opt = Momentum(0.01, 0.99)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L43-L58">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Nesterov" href="#Flux.Optimise.Nesterov"><code>Flux.Optimise.Nesterov</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Nesterov(η, ρ)</code></pre><p>Gradient descent with learning rate  <code>η</code> and Nesterov momentum <code>ρ</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Amount by which the gradients are dicsounted berfore updating the weights. Defaults to <code>0.001</code>.</li><li>Nesterov Momentum (ρ): Paramters controlling the amount of nesterov momentum to be applied. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = Nesterov() # uses defaults η = 0.001 and ρ = 0.9
+opt = Momentum(0.01, 0.99)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L43-L58">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Nesterov" href="#Flux.Optimise.Nesterov"><code>Flux.Optimise.Nesterov</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">Nesterov(η, ρ)</code></pre><p>Gradient descent with learning rate  <code>η</code> and Nesterov momentum <code>ρ</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Amount by which the gradients are dicsounted berfore updating the weights. Defaults to <code>0.001</code>.</li><li>Nesterov Momentum (ρ): Paramters controlling the amount of nesterov momentum to be applied. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = Nesterov() # uses defaults η = 0.001 and ρ = 0.9
-opt = Nesterov(0.003, 0.95)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L74-L89">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.RMSProp" href="#Flux.Optimise.RMSProp"><code>Flux.Optimise.RMSProp</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">RMSProp(η, ρ)</code></pre><p>Implements the RMSProp algortihm. Often a good choice for recurrent networks. Paramters other than learning rate generally don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Rho (ρ): Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = RMSProp() # uses default η = 0.001 and ρ = 0.9
+opt = Nesterov(0.003, 0.95)</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L74-L89">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.RMSProp" href="#Flux.Optimise.RMSProp"><code>Flux.Optimise.RMSProp</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">RMSProp(η, ρ)</code></pre><p>Implements the RMSProp algortihm. Often a good choice for recurrent networks. Paramters other than learning rate generally don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Rho (ρ): Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = RMSProp() # uses default η = 0.001 and ρ = 0.9
-opt = RMSProp(0.002, 0.95)</code></pre><p><strong>References</strong></p><p><a href="https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">RMSProp</a></p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L106-L124">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAM" href="#Flux.Optimise.ADAM"><code>Flux.Optimise.ADAM</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADAM(η, β::Tuple)</code></pre><p>Implements the ADAM optimiser.</p><p><strong>Paramters</strong></p><ul><li>Learning Rate (<code>η</code>): Defaults to <code>0.001</code>.</li><li>Beta (<code>β::Tuple</code>): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAM() # uses the default η = 0.001 and β = (0.9, 0.999)
+opt = RMSProp(0.002, 0.95)</code></pre><p><strong>References</strong></p><p><a href="https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">RMSProp</a></p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L106-L124">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAM" href="#Flux.Optimise.ADAM"><code>Flux.Optimise.ADAM</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADAM(η, β::Tuple)</code></pre><p>Implements the ADAM optimiser.</p><p><strong>Paramters</strong></p><ul><li>Learning Rate (<code>η</code>): Defaults to <code>0.001</code>.</li><li>Beta (<code>β::Tuple</code>): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAM() # uses the default η = 0.001 and β = (0.9, 0.999)
-opt = ADAM(0.001, (0.9, 0.8))</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1412.6980v8">ADAM</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L140-L158">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.AdaMax" href="#Flux.Optimise.AdaMax"><code>Flux.Optimise.AdaMax</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AdaMax(η, β::Tuple)</code></pre><p>Variant of ADAM based on ∞-norm.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code></li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = AdaMax() # uses default η and β
+opt = ADAM(0.001, (0.9, 0.8))</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1412.6980v8">ADAM</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L140-L158">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.AdaMax" href="#Flux.Optimise.AdaMax"><code>Flux.Optimise.AdaMax</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AdaMax(η, β::Tuple)</code></pre><p>Variant of ADAM based on ∞-norm.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code></li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = AdaMax() # uses default η and β
-opt = AdaMax(0.001, (0.9, 0.995))</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1412.6980v9">AdaMax</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L222-L239">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAGrad" href="#Flux.Optimise.ADAGrad"><code>Flux.Optimise.ADAGrad</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADAGrad(η)</code></pre><p>Implements AdaGrad. It has parameter specific learning rates based on how frequently it is updated.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.1</code></li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAGrad() # uses default η = 0.1
+opt = AdaMax(0.001, (0.9, 0.995))</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1412.6980v9">AdaMax</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L222-L239">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAGrad" href="#Flux.Optimise.ADAGrad"><code>Flux.Optimise.ADAGrad</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADAGrad(η)</code></pre><p>Implements AdaGrad. It has parameter specific learning rates based on how frequently it is updated.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.1</code></li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAGrad() # uses default η = 0.1
-opt = ADAGrad(0.001)</code></pre><p><strong>References</strong></p><p><a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">ADAGrad</a> optimiser. Parameters don&#39;t need tuning.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L258-L276">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADADelta" href="#Flux.Optimise.ADADelta"><code>Flux.Optimise.ADADelta</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADADelta(ρ)</code></pre><p>Version of ADAGrad that adapts learning rate based on a window of past gradient updates. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Rho (ρ): Factor by which gradient is decayed at each time step. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADADelta() # uses default ρ = 0.9
+opt = ADAGrad(0.001)</code></pre><p><strong>References</strong></p><p><a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">ADAGrad</a> optimiser. Parameters don&#39;t need tuning.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L258-L276">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADADelta" href="#Flux.Optimise.ADADelta"><code>Flux.Optimise.ADADelta</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">ADADelta(ρ)</code></pre><p>Version of ADAGrad that adapts learning rate based on a window of past gradient updates. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Rho (ρ): Factor by which gradient is decayed at each time step. Defaults to <code>0.9</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADADelta() # uses default ρ = 0.9
-opt = ADADelta(0.89)</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1212.5701">ADADelta</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L291-L307">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.AMSGrad" href="#Flux.Optimise.AMSGrad"><code>Flux.Optimise.AMSGrad</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AMSGrad(η, β::Tuple)</code></pre><p>Implements AMSGrad version of the ADAM optimiser. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = AMSGrad() # uses default η and β
+opt = ADADelta(0.89)</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1212.5701">ADADelta</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L291-L307">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.AMSGrad" href="#Flux.Optimise.AMSGrad"><code>Flux.Optimise.AMSGrad</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">AMSGrad(η, β::Tuple)</code></pre><p>Implements AMSGrad version of the ADAM optimiser. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = AMSGrad() # uses default η and β
-opt = AMSGrad(0.001, (0.89, 0.995))</code></pre><p><strong>References</strong></p><p><a href="https://openreview.net/forum?id=ryQu7f-RZ">AMSGrad</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L324-L341">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.NADAM" href="#Flux.Optimise.NADAM"><code>Flux.Optimise.NADAM</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">NADAM(η, β::Tuple)</code></pre><p>Nesterov variant of ADAM. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = NADAM() # uses default η and β
+opt = AMSGrad(0.001, (0.89, 0.995))</code></pre><p><strong>References</strong></p><p><a href="https://openreview.net/forum?id=ryQu7f-RZ">AMSGrad</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L324-L341">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.NADAM" href="#Flux.Optimise.NADAM"><code>Flux.Optimise.NADAM</code></a> — <span class="docstring-category">Type</span>.</div><div><div><pre><code class="language-julia">NADAM(η, β::Tuple)</code></pre><p>Nesterov variant of ADAM. Parameters don&#39;t need tuning.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to <code>(0.9, 0.999)</code>.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = NADAM() # uses default η and β
-opt = NADAM(0.002, (0.89, 0.995))</code></pre><p><strong>References</strong></p><p><a href="http://cs229.stanford.edu/proj2015/054_report.pdf">NADAM</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L359-L376">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAMW" href="#Flux.Optimise.ADAMW"><code>Flux.Optimise.ADAMW</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">ADAMW(η, β::Tuple, decay)</code></pre><p>Variant of ADAM defined by fixing weight decay regularization.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to (0.9, 0.999).</li><li>decay: Decay applied to weights during optimisation. Defaults to 0.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAMW() # uses default η, β and decay
+opt = NADAM(0.002, (0.89, 0.995))</code></pre><p><strong>References</strong></p><p><a href="http://cs229.stanford.edu/proj2015/054_report.pdf">NADAM</a> optimiser.</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L359-L376">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAMW" href="#Flux.Optimise.ADAMW"><code>Flux.Optimise.ADAMW</code></a> — <span class="docstring-category">Function</span>.</div><div><div><pre><code class="language-julia">ADAMW(η, β::Tuple, decay)</code></pre><p>Variant of ADAM defined by fixing weight decay regularization.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (η): Defaults to <code>0.001</code>.</li><li>Beta (β::Tuple): The first element refers to β1 and the second to β2. Defaults to (0.9, 0.999).</li><li>decay: Decay applied to weights during optimisation. Defaults to 0.</li></ul><p><strong>Examples</strong></p><pre><code class="language-julia">opt = ADAMW() # uses default η, β and decay
-opt = ADAMW(0.001, (0.89, 0.995), 0.1)</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1711.05101">ADAMW</a></p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L395-L413">source</a></section><h2><a class="nav-anchor" id="Optimiser-Interface-1" href="#Optimiser-Interface-1">Optimiser Interface</a></h2><p>Flux&#39;s optimsers are built around a <code>struct</code> that holds all the optimiser parameters along with a definition of how to apply the update rule associated with it. We do this via the <code>apply!</code> function which takes the optimiser as the first argument followed by the parameter and its corresponding gradient.</p><p>In this manner Flux also allows one to create custom optimisers to be used seamlessly. Let&#39;s work this with a simple example.</p><pre><code class="language-julia">mutable struct Momentum
+opt = ADAMW(0.001, (0.89, 0.995), 0.1)</code></pre><p><strong>References</strong></p><p><a href="https://arxiv.org/abs/1711.05101">ADAMW</a></p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L395-L413">source</a></section><h2><a class="nav-anchor" id="Optimiser-Interface-1" href="#Optimiser-Interface-1">Optimiser Interface</a></h2><p>Flux&#39;s optimsers are built around a <code>struct</code> that holds all the optimiser parameters along with a definition of how to apply the update rule associated with it. We do this via the <code>apply!</code> function which takes the optimiser as the first argument followed by the parameter and its corresponding gradient.</p><p>In this manner Flux also allows one to create custom optimisers to be used seamlessly. Let&#39;s work this with a simple example.</p><pre><code class="language-julia">mutable struct Momentum
  eta
  rho
  velocity
@ -81,8 +81,8 @@ end
 loss(rand(10)) # around 0.9</code></pre><p>In this manner it is possible to compose optimisers for some added flexibility.</p><h2><a class="nav-anchor" id="Decays-1" href="#Decays-1">Decays</a></h2><p>Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ExpDecay" href="#Flux.Optimise.ExpDecay"><code>Flux.Optimise.ExpDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>ExpDecay(eta, decay, decay_step, clip)</p><p>Discount the learning rate <code>eta</code> by a multiplicative factor <code>decay</code> every <code>decay_step</code> till a minimum of <code>clip</code>.</p><p><strong>Parameters</strong></p><ul><li>Learning Rate (eta): Defaults to <code>0.001</code>.</li><li>decay: Factor by which the learning rate is discounted. Defaults to <code>0.1</code>.</li><li>decay_step: Schedules decay operations by setting number of steps between two decay operations. Defaults to <code>1000</code>.</li><li>clip: Minimum value of learning rate. Defaults to <code>1e-4</code>.</li></ul><p><strong>Example</strong></p><p>To apply exponential decay to an optimiser:</p><pre><code class="language-julia">  Optimiser(ExpDecay(..), Opt(..))
-  opt = Optimiser(ExpDecay(), ADAM())</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L473-L491">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.InvDecay" href="#Flux.Optimise.InvDecay"><code>Flux.Optimise.InvDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>InvDecay(γ)</p><p>Applies inverse time decay to an optimiser, i.e., the effective step size at iteration <code>n</code> is <code>eta / (1 + γ * n)</code> where <code>eta</code> is the initial step size. The wrapped optimiser&#39;s step size is not modified.</p><pre><code class="language-none">
+  opt = Optimiser(ExpDecay(), ADAM())</code></pre></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L473-L491">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.InvDecay" href="#Flux.Optimise.InvDecay"><code>Flux.Optimise.InvDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>InvDecay(γ)</p><p>Applies inverse time decay to an optimiser, i.e., the effective step size at iteration <code>n</code> is <code>eta / (1 + γ * n)</code> where <code>eta</code> is the initial step size. The wrapped optimiser&#39;s step size is not modified.</p><pre><code class="language-none">
 ## Parameters
  - gamma (γ): Defaults to `0.001`
-## Example</code></pre><p>julia   Optimiser(InvDecay(..), Opt(..)) ```</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L444-L457">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.WeightDecay" href="#Flux.Optimise.WeightDecay"><code>Flux.Optimise.WeightDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>WeightDecay(wd)</p><p>Decays the weight by <code>wd</code></p><p><strong>Parameters</strong></p><ul><li>weight decay (wd): 0</li></ul></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/620cffc45c8fbdbe4f274a395c48ce95d0045727/src/optimise/optimisers.jl#L512-L519">source</a></section><footer><hr/><a class="previous" href="../../models/layers/"><span class="direction">Previous</span><span class="title">Model Reference</span></a><a class="next" href="../training/"><span class="direction">Next</span><span class="title">Training</span></a></footer></article></body></html>
+## Example</code></pre><p>julia   Optimiser(InvDecay(..), Opt(..)) ```</p></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L444-L457">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.WeightDecay" href="#Flux.Optimise.WeightDecay"><code>Flux.Optimise.WeightDecay</code></a> — <span class="docstring-category">Type</span>.</div><div><div><p>WeightDecay(wd)</p><p>Decays the weight by <code>wd</code></p><p><strong>Parameters</strong></p><ul><li>weight decay (wd): 0</li></ul></div></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/ddc2c20e68919faa41a04ed39ed3cfa08d6d5189/src/optimise/optimisers.jl#L512-L519">source</a></section><footer><hr/><a class="previous" href="../../models/layers/"><span class="direction">Previous</span><span class="title">Model Reference</span></a><a class="next" href="../training/"><span class="direction">Next</span><span class="title">Training</span></a></footer></article></body></html>
--- a/dev/training/training/index.html
+++ b/dev/training/training/index.html
@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
 ga('create', 'UA-36890222-9', 'auto');
 ga('send', 'pageview');
-</script><link href="https://cdnjs.cloudflare.com/ajax/libs/normalize/4.2.0/normalize.min.css" rel="stylesheet" type="text/css"/><link href="https://fonts.googleapis.com/css?family=Lato|Roboto+Mono" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.2.0/require.min.js" data-main="../../assets/documenter.js"></script><script src="../../siteinfo.js"></script><script src="../../../versions.js"></script><link href="../../assets/documenter.css" rel="stylesheet" type="text/css"/><link href="../../assets/flux.css" rel="stylesheet" type="text/css"/></head><body><nav class="toc"><h1>Flux</h1><select id="version-selector" onChange="window.location.href=this.value" style="visibility: hidden"></select><form class="search" id="search-form" action="../../search/"><input id="search-query" name="q" type="text" placeholder="Search docs"/></form><ul><li><a class="toctext" href="../../">Home</a></li><li><span class="toctext">Building Models</span><ul><li><a class="toctext" href="../../models/basics/">Basics</a></li><li><a class="toctext" href="../../models/recurrence/">Recurrence</a></li><li><a class="toctext" href="../../models/regularisation/">Regularisation</a></li><li><a class="toctext" href="../../models/layers/">Model Reference</a></li></ul></li><li><span class="toctext">Training Models</span><ul><li><a class="toctext" href="../optimisers/">Optimisers</a></li><li class="current"><a class="toctext" href>Training</a><ul class="internal"><li><a class="toctext" href="#Loss-Functions-1">Loss Functions</a></li><li><a class="toctext" href="#Model-parameters-1">Model parameters</a></li><li><a class="toctext" href="#Datasets-1">Datasets</a></li><li><a class="toctext" href="#Callbacks-1">Callbacks</a></li></ul></li></ul></li><li><a class="toctext" href="../../data/onehot/">One-Hot Encoding</a></li><li><a class="toctext" href="../../gpu/">GPU Support</a></li><li><a class="toctext" href="../../saving/">Saving &amp; Loading</a></li><li><a class="toctext" href="../../performance/">Performance Tips</a></li><li><a class="toctext" href="../../community/">Community</a></li></ul></nav><article id="docs"><header><nav><ul><li>Training Models</li><li><a href>Training</a></li></ul><a class="edit-page" href="https://github.com/FluxML/Flux.jl/blob/master/docs/src/training/training.md"><span class="fa"></span> Edit on GitHub</a></nav><hr/><div id="topbar"><span>Training</span><a class="fa fa-bars" href="#"></a></div></header><h1><a class="nav-anchor" id="Training-1" href="#Training-1">Training</a></h1><p>To actually train a model we need four things:</p><ul><li>A <em>objective function</em>, that evaluates how well a model is doing given some input data.</li><li>The trainable parameters of the model.</li><li>A collection of data points that will be provided to the objective function.</li><li>An <a href="../optimisers/">optimiser</a> that will update the model parameters appropriately.</li></ul><p>With these we can call <code>Flux.train!</code>:</p><pre><code class="language-julia">Flux.train!(objective, params, data, opt)</code></pre><p>There are plenty of examples in the <a href="https://github.com/FluxML/model-zoo">model zoo</a>.</p><h2><a class="nav-anchor" id="Loss-Functions-1" href="#Loss-Functions-1">Loss Functions</a></h2><p>The objective function must return a number representing how far the model is from its target – the <em>loss</em> of the model. The <code>loss</code> function that we defined in <a href="../../models/basics/">basics</a> will work as an objective. We can also define an objective in terms of some model:</p><pre><code class="language-julia">m = Chain(
+</script><link href="https://cdnjs.cloudflare.com/ajax/libs/normalize/4.2.0/normalize.min.css" rel="stylesheet" type="text/css"/><link href="https://fonts.googleapis.com/css?family=Lato|Roboto+Mono" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.2.0/require.min.js" data-main="../../assets/documenter.js"></script><script src="../../siteinfo.js"></script><script src="../../../versions.js"></script><link href="../../assets/documenter.css" rel="stylesheet" type="text/css"/><link href="../../assets/flux.css" rel="stylesheet" type="text/css"/></head><body><nav class="toc"><h1>Flux</h1><select id="version-selector" onChange="window.location.href=this.value" style="visibility: hidden"></select><form class="search" id="search-form" action="../../search/"><input id="search-query" name="q" type="text" placeholder="Search docs"/></form><ul><li><a class="toctext" href="../../">Home</a></li><li><span class="toctext">Building Models</span><ul><li><a class="toctext" href="../../models/basics/">Basics</a></li><li><a class="toctext" href="../../models/recurrence/">Recurrence</a></li><li><a class="toctext" href="../../models/regularisation/">Regularisation</a></li><li><a class="toctext" href="../../models/layers/">Model Reference</a></li></ul></li><li><span class="toctext">Training Models</span><ul><li><a class="toctext" href="../optimisers/">Optimisers</a></li><li class="current"><a class="toctext" href>Training</a><ul class="internal"><li><a class="toctext" href="#Loss-Functions-1">Loss Functions</a></li><li><a class="toctext" href="#Model-parameters-1">Model parameters</a></li><li><a class="toctext" href="#Datasets-1">Datasets</a></li><li><a class="toctext" href="#Callbacks-1">Callbacks</a></li><li><a class="toctext" href="#Custom-Training-loops-1">Custom Training loops</a></li></ul></li></ul></li><li><a class="toctext" href="../../data/onehot/">One-Hot Encoding</a></li><li><a class="toctext" href="../../gpu/">GPU Support</a></li><li><a class="toctext" href="../../saving/">Saving &amp; Loading</a></li><li><a class="toctext" href="../../performance/">Performance Tips</a></li><li><a class="toctext" href="../../community/">Community</a></li></ul></nav><article id="docs"><header><nav><ul><li>Training Models</li><li><a href>Training</a></li></ul><a class="edit-page" href="https://github.com/FluxML/Flux.jl/blob/master/docs/src/training/training.md"><span class="fa"></span> Edit on GitHub</a></nav><hr/><div id="topbar"><span>Training</span><a class="fa fa-bars" href="#"></a></div></header><h1><a class="nav-anchor" id="Training-1" href="#Training-1">Training</a></h1><p>To actually train a model we need four things:</p><ul><li>A <em>objective function</em>, that evaluates how well a model is doing given some input data.</li><li>The trainable parameters of the model.</li><li>A collection of data points that will be provided to the objective function.</li><li>An <a href="../optimisers/">optimiser</a> that will update the model parameters appropriately.</li></ul><p>With these we can call <code>Flux.train!</code>:</p><pre><code class="language-julia">Flux.train!(objective, params, data, opt)</code></pre><p>There are plenty of examples in the <a href="https://github.com/FluxML/model-zoo">model zoo</a>.</p><h2><a class="nav-anchor" id="Loss-Functions-1" href="#Loss-Functions-1">Loss Functions</a></h2><p>The objective function must return a number representing how far the model is from its target – the <em>loss</em> of the model. The <code>loss</code> function that we defined in <a href="../../models/basics/">basics</a> will work as an objective. We can also define an objective in terms of some model:</p><pre><code class="language-julia">m = Chain(
  Dense(784, 32, σ),
  Dense(32, 10), softmax)
@ -35,4 +35,17 @@ evalcb() = @show(loss(test_x, test_y))
 Flux.train!(objective, ps, data, opt,
            cb = throttle(evalcb, 5))</code></pre><p>Calling <code>Flux.stop()</code> in a callback will exit the training loop early.</p><pre><code class="language-julia">cb = function ()
  accuracy() &gt; 0.9 &amp;&amp; Flux.stop()
-end</code></pre><footer><hr/><a class="previous" href="../optimisers/"><span class="direction">Previous</span><span class="title">Optimisers</span></a><a class="next" href="../../data/onehot/"><span class="direction">Next</span><span class="title">One-Hot Encoding</span></a></footer></article></body></html>
+end</code></pre><h2><a class="nav-anchor" id="Custom-Training-loops-1" href="#Custom-Training-loops-1">Custom Training loops</a></h2><p>The <code>Flux.train!</code> function can be very convenient, especially for simple problems. Its also very flexible with the use of callbacks. But for some problems its much cleaner to write your own custom training loop. An example follows that works similar to the default <code>Flux.train</code> but with no callbacks. You don&#39;t need callbacks if you just code the calls to your functions directly into the loop. E.g. in the places marked with comments.</p><pre><code class="language-none">function my_custom_train!(loss, ps, data, opt)
  ps = Params(ps)
  for d in data
    gs = gradient(ps) do
      training_loss = loss(d...)
      # Insert what ever code you want here that needs Training loss, e.g. logging
      return training_loss
    end
    # insert what ever code you want here that needs gradient
    # E.g. logging with TensorBoardLogger.jl as histogram so you can see if it is becoming huge
    update!(opt, ps, gs)
    # Here you might like to check validation set accuracy, and break out to do early stopping
  end
 end</code></pre><p>You could simplify this further, for example by hard-coding in the loss function.</p><footer><hr/><a class="previous" href="../optimisers/"><span class="direction">Previous</span><span class="title">Optimisers</span></a><a class="next" href="../../data/onehot/"><span class="direction">Next</span><span class="title">One-Hot Encoding</span></a></footer></article></body></html>