build based on 5d8b63d

This commit is contained in:
autodocs 2018-06-29 13:00:59 +00:00
parent 1901e26db1
commit c0a89b25cd
4 changed files with 19 additions and 18 deletions

View File

@ -20,12 +20,12 @@ b = param(b)
l = loss(x, y)
back!(l)</code></pre><p><code>loss(x, y)</code> returns the same number, but it&#39;s now a <em>tracked</em> value that records gradients as it goes along. Calling <code>back!</code> then accumulates the gradient of <code>W</code> and <code>b</code>. We can see what this gradient is, and modify <code>W</code> to train the model.</p><pre><code class="language-julia">W.grad
back!(l)</code></pre><p><code>loss(x, y)</code> returns the same number, but it&#39;s now a <em>tracked</em> value that records gradients as it goes along. Calling <code>back!</code> then accumulates the gradient of <code>W</code> and <code>b</code>. We can see what this gradient is, and modify <code>W</code> to train the model.</p><pre><code class="language-julia">using Flux.Tracker: grad, update!
# Update the parameter
W.data .-= 0.1(W.grad)
# Reset the gradient
W.grad .= 0
Δ = grad(W)
# Update the parameter and reset the gradient
update!(W, -0.1Δ)
loss(x, y) # ~ 2.5</code></pre><p>The loss has decreased a little, meaning that our prediction <code>x</code> is closer to the target <code>y</code>. If we have some data we can already try <a href="../training/training.html">training the model</a>.</p><p>All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can <em>look</em> very different they might have millions of parameters or complex control flow, and there are ways to manage this complexity. Let&#39;s see what that looks like.</p><h2><a class="nav-anchor" id="Building-Layers-1" href="#Building-Layers-1">Building Layers</a></h2><p>It&#39;s common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid</a> (<code>σ</code>) in between them. In the above style we could write this as:</p><pre><code class="language-julia">W1 = param(rand(3, 5))
b1 = param(rand(3))

View File

@ -11,26 +11,26 @@ m(5) == 26
m = Chain(Dense(10, 5), Dense(5, 2))
x = rand(10)
m(x) == m[2](m[1](x))</code></pre><p><code>Chain</code> also supports indexing and slicing, e.g. <code>m[2]</code> or <code>m[1:end-1]</code>. <code>m[1:3](x)</code> will calculate the output of the first three layers.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/basic.jl#L1-L18">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dense" href="#Flux.Dense"><code>Flux.Dense</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">Dense(in::Integer, out::Integer, σ = identity)</code></pre><p>Creates a traditional <code>Dense</code> layer with parameters <code>W</code> and <code>b</code>.</p><pre><code class="language-none">y = σ.(W * x .+ b)</code></pre><p>The input <code>x</code> must be a vector of length <code>in</code>, or a batch of vectors represented as an <code>in × N</code> matrix. The out <code>y</code> will be a vector or batch of length <code>out</code>.</p><pre><code class="language-julia">julia&gt; d = Dense(5, 2)
m(x) == m[2](m[1](x))</code></pre><p><code>Chain</code> also supports indexing and slicing, e.g. <code>m[2]</code> or <code>m[1:end-1]</code>. <code>m[1:3](x)</code> will calculate the output of the first three layers.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/basic.jl#L1-L18">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dense" href="#Flux.Dense"><code>Flux.Dense</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">Dense(in::Integer, out::Integer, σ = identity)</code></pre><p>Creates a traditional <code>Dense</code> layer with parameters <code>W</code> and <code>b</code>.</p><pre><code class="language-none">y = σ.(W * x .+ b)</code></pre><p>The input <code>x</code> must be a vector of length <code>in</code>, or a batch of vectors represented as an <code>in × N</code> matrix. The out <code>y</code> will be a vector or batch of length <code>out</code>.</p><pre><code class="language-julia">julia&gt; d = Dense(5, 2)
Dense(5, 2)
julia&gt; d(rand(5))
Tracked 2-element Array{Float64,1}:
0.00257447
-0.00449443</code></pre></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/basic.jl#L46-L65">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Conv" href="#Flux.Conv"><code>Flux.Conv</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">Conv(size, in=&gt;out)
Conv(size, in=&gt;out, relu)</code></pre><p>Standard convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/conv.jl#L8-L19">source</a></section><h2><a class="nav-anchor" id="Recurrent-Layers-1" href="#Recurrent-Layers-1">Recurrent Layers</a></h2><p>Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.RNN" href="#Flux.RNN"><code>Flux.RNN</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">RNN(in::Integer, out::Integer, σ = tanh)</code></pre><p>The most basic recurrent layer; essentially acts as a <code>Dense</code> layer, but with the output fed back into the input each time step.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/recurrent.jl#L105-L110">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LSTM" href="#Flux.LSTM"><code>Flux.LSTM</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">LSTM(in::Integer, out::Integer, σ = tanh)</code></pre><p>Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/recurrent.jl#L151-L159">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GRU" href="#Flux.GRU"><code>Flux.GRU</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">GRU(in::Integer, out::Integer, σ = tanh)</code></pre><p>Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/recurrent.jl#L192-L200">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Recur" href="#Flux.Recur"><code>Flux.Recur</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">Recur(cell)</code></pre><p><code>Recur</code> takes a recurrent cell and makes it stateful, managing the hidden state in the background. <code>cell</code> should be a model of the form:</p><pre><code class="language-none">h, y = cell(h, x...)</code></pre><p>For example, here&#39;s a recurrent network that keeps a running total of its inputs.</p><pre><code class="language-julia">accum(h, x) = (h+x, x)
-0.00449443</code></pre></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/basic.jl#L46-L65">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Conv" href="#Flux.Conv"><code>Flux.Conv</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">Conv(size, in=&gt;out)
Conv(size, in=&gt;out, relu)</code></pre><p>Standard convolutional layer. <code>size</code> should be a tuple like <code>(2, 2)</code>. <code>in</code> and <code>out</code> specify the number of input and output channels respectively.</p><p>Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a <code>100×100×3</code> array, and a batch of 50 would be a <code>100×100×3×50</code> array.</p><p>Takes the keyword arguments <code>pad</code>, <code>stride</code> and <code>dilation</code>.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/conv.jl#L8-L19">source</a></section><h2><a class="nav-anchor" id="Recurrent-Layers-1" href="#Recurrent-Layers-1">Recurrent Layers</a></h2><p>Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.RNN" href="#Flux.RNN"><code>Flux.RNN</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">RNN(in::Integer, out::Integer, σ = tanh)</code></pre><p>The most basic recurrent layer; essentially acts as a <code>Dense</code> layer, but with the output fed back into the input each time step.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/recurrent.jl#L105-L110">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LSTM" href="#Flux.LSTM"><code>Flux.LSTM</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">LSTM(in::Integer, out::Integer, σ = tanh)</code></pre><p>Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/recurrent.jl#L151-L159">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.GRU" href="#Flux.GRU"><code>Flux.GRU</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">GRU(in::Integer, out::Integer, σ = tanh)</code></pre><p>Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.</p><p>See <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">this article</a> for a good overview of the internals.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/recurrent.jl#L192-L200">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Recur" href="#Flux.Recur"><code>Flux.Recur</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">Recur(cell)</code></pre><p><code>Recur</code> takes a recurrent cell and makes it stateful, managing the hidden state in the background. <code>cell</code> should be a model of the form:</p><pre><code class="language-none">h, y = cell(h, x...)</code></pre><p>For example, here&#39;s a recurrent network that keeps a running total of its inputs.</p><pre><code class="language-julia">accum(h, x) = (h+x, x)
rnn = Flux.Recur(accum, 0)
rnn(2) # 2
rnn(3) # 3
rnn.state # 5
rnn.(1:10) # apply to a sequence
rnn.state # 60</code></pre></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/recurrent.jl#L7-L26">source</a></section><h2><a class="nav-anchor" id="Activation-Functions-1" href="#Activation-Functions-1">Activation Functions</a></h2><p>Non-linearities that go between layers of your model. Most of these functions are defined in <a href="https://github.com/FluxML/NNlib.jl">NNlib</a> but are available by default in Flux.</p><p>Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call <code>σ.(xs)</code>, <code>relu.(xs)</code> and so on.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.σ" href="#NNlib.σ"><code>NNlib.σ</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">σ(x) = 1 / (1 + exp(-x))</code></pre><p>Classic <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid</a> activation function.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/NNlib.jl/blob/5b4c5e2bf228a56f92e2fc75069e9e5e79fa563d/src/activation.jl#L1-L6">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.relu" href="#NNlib.relu"><code>NNlib.relu</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">relu(x) = max(0, x)</code></pre><p><a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Rectified Linear Unit</a> activation function.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/NNlib.jl/blob/5b4c5e2bf228a56f92e2fc75069e9e5e79fa563d/src/activation.jl#L42-L47">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.leakyrelu" href="#NNlib.leakyrelu"><code>NNlib.leakyrelu</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">leakyrelu(x) = max(0.01x, x)</code></pre><p>Leaky <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Rectified Linear Unit</a> activation function. You can also specify the coefficient explicitly, e.g. <code>leakyrelu(x, 0.01)</code>.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/NNlib.jl/blob/5b4c5e2bf228a56f92e2fc75069e9e5e79fa563d/src/activation.jl#L51-L57">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.elu" href="#NNlib.elu"><code>NNlib.elu</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">elu(x, α = 1) =
rnn.state # 60</code></pre></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/recurrent.jl#L7-L26">source</a></section><h2><a class="nav-anchor" id="Activation-Functions-1" href="#Activation-Functions-1">Activation Functions</a></h2><p>Non-linearities that go between layers of your model. Most of these functions are defined in <a href="https://github.com/FluxML/NNlib.jl">NNlib</a> but are available by default in Flux.</p><p>Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call <code>σ.(xs)</code>, <code>relu.(xs)</code> and so on.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.σ" href="#NNlib.σ"><code>NNlib.σ</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">σ(x) = 1 / (1 + exp(-x))</code></pre><p>Classic <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid</a> activation function.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/NNlib.jl/blob/5b4c5e2bf228a56f92e2fc75069e9e5e79fa563d/src/activation.jl#L1-L6">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.relu" href="#NNlib.relu"><code>NNlib.relu</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">relu(x) = max(0, x)</code></pre><p><a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Rectified Linear Unit</a> activation function.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/NNlib.jl/blob/5b4c5e2bf228a56f92e2fc75069e9e5e79fa563d/src/activation.jl#L42-L47">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.leakyrelu" href="#NNlib.leakyrelu"><code>NNlib.leakyrelu</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">leakyrelu(x) = max(0.01x, x)</code></pre><p>Leaky <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Rectified Linear Unit</a> activation function. You can also specify the coefficient explicitly, e.g. <code>leakyrelu(x, 0.01)</code>.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/NNlib.jl/blob/5b4c5e2bf228a56f92e2fc75069e9e5e79fa563d/src/activation.jl#L51-L57">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.elu" href="#NNlib.elu"><code>NNlib.elu</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">elu(x, α = 1) =
x &gt; 0 ? x : α * (exp(x) - 1)</code></pre><p>Exponential Linear Unit activation function. See <a href="https://arxiv.org/abs/1511.07289">Fast and Accurate Deep Network Learning by Exponential Linear Units</a>. You can also specify the coefficient explicitly, e.g. <code>elu(x, 1)</code>.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/NNlib.jl/blob/5b4c5e2bf228a56f92e2fc75069e9e5e79fa563d/src/activation.jl#L60-L67">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="NNlib.swish" href="#NNlib.swish"><code>NNlib.swish</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">swish(x) = x * σ(x)</code></pre><p>Self-gated actvation function. See <a href="https://arxiv.org/pdf/1710.05941.pdf">Swish: a Self-Gated Activation Function</a>.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/NNlib.jl/blob/5b4c5e2bf228a56f92e2fc75069e9e5e79fa563d/src/activation.jl#L70-L75">source</a></section><h2><a class="nav-anchor" id="Normalisation-and-Regularisation-1" href="#Normalisation-and-Regularisation-1">Normalisation &amp; Regularisation</a></h2><p>These layers don&#39;t affect the structure of the network but may improve training times or reduce overfitting.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.testmode!" href="#Flux.testmode!"><code>Flux.testmode!</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">testmode!(m)
testmode!(m, false)</code></pre><p>Put layers like <a href="layers.html#Flux.Dropout"><code>Dropout</code></a> and <a href="layers.html#Flux.BatchNorm"><code>BatchNorm</code></a> into testing mode (or back to training mode with <code>false</code>).</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/normalise.jl#L1-L7">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.BatchNorm" href="#Flux.BatchNorm"><code>Flux.BatchNorm</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">BatchNorm(channels::Integer, σ = identity;
testmode!(m, false)</code></pre><p>Put layers like <a href="layers.html#Flux.Dropout"><code>Dropout</code></a> and <a href="layers.html#Flux.BatchNorm"><code>BatchNorm</code></a> into testing mode (or back to training mode with <code>false</code>).</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/normalise.jl#L1-L7">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.BatchNorm" href="#Flux.BatchNorm"><code>Flux.BatchNorm</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">BatchNorm(channels::Integer, σ = identity;
initβ = zeros, initγ = ones,
ϵ = 1e-8, momentum = .1)</code></pre><p>Batch Normalization layer. The <code>channels</code> input should be the size of the channel dimension in your data (see below).</p><p>Given an array with <code>N</code> dimensions, call the <code>N-1</code>th the channel dimension. (For a batch of feature vectors this is just the data dimension, for <code>WHCN</code> images it&#39;s the usual channel dimension.)</p><p><code>BatchNorm</code> computes the mean and variance for each each <code>W×H×1×N</code> slice and shifts them to have a new mean and variance (corresponding to the learnable, per-channel <code>bias</code> and <code>scale</code> parameters).</p><p>See <a href="https://arxiv.org/pdf/1502.03167.pdf">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a>.</p><p>Example:</p><pre><code class="language-julia">m = Chain(
Dense(28^2, 64),
BatchNorm(64, relu),
Dense(64, 10),
BatchNorm(10),
softmax)</code></pre></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/normalise.jl#L69-L98">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dropout" href="#Flux.Dropout"><code>Flux.Dropout</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">Dropout(p)</code></pre><p>A Dropout layer. For each input, either sets that input to <code>0</code> (with probability <code>p</code>) or scales it by <code>1/(1-p)</code>. This is used as a regularisation, i.e. it reduces overfitting during training.</p><p>Does nothing to the input once in <a href="layers.html#Flux.testmode!"><code>testmode!</code></a>.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/normalise.jl#L15-L23">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LayerNorm" href="#Flux.LayerNorm"><code>Flux.LayerNorm</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">LayerNorm(h::Integer)</code></pre><p>A <a href="https://arxiv.org/pdf/1607.06450.pdf">normalisation layer</a> designed to be used with recurrent hidden states of size <code>h</code>. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/layers/normalise.jl#L46-L53">source</a></section><footer><hr/><a class="previous" href="regularisation.html"><span class="direction">Previous</span><span class="title">Regularisation</span></a><a class="next" href="../training/optimisers.html"><span class="direction">Next</span><span class="title">Optimisers</span></a></footer></article></body></html>
softmax)</code></pre></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/normalise.jl#L69-L98">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Dropout" href="#Flux.Dropout"><code>Flux.Dropout</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">Dropout(p)</code></pre><p>A Dropout layer. For each input, either sets that input to <code>0</code> (with probability <code>p</code>) or scales it by <code>1/(1-p)</code>. This is used as a regularisation, i.e. it reduces overfitting during training.</p><p>Does nothing to the input once in <a href="layers.html#Flux.testmode!"><code>testmode!</code></a>.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/normalise.jl#L15-L23">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.LayerNorm" href="#Flux.LayerNorm"><code>Flux.LayerNorm</code></a><span class="docstring-category">Type</span>.</div><div><pre><code class="language-none">LayerNorm(h::Integer)</code></pre><p>A <a href="https://arxiv.org/pdf/1607.06450.pdf">normalisation layer</a> designed to be used with recurrent hidden states of size <code>h</code>. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/layers/normalise.jl#L46-L53">source</a></section><footer><hr/><a class="previous" href="regularisation.html"><span class="direction">Previous</span><span class="title">Regularisation</span></a><a class="next" href="../training/optimisers.html"><span class="direction">Next</span><span class="title">Optimisers</span></a></footer></article></body></html>

View File

@ -45,7 +45,7 @@ var documenterSearchIndex = {"docs": [
"page": "Basics",
"title": "Taking Gradients",
"category": "section",
"text": "Consider a simple linear regression, which tries to predict an output array y from an input x. (It\'s a good idea to follow this example in the Julia repl.)W = rand(2, 5)\nb = rand(2)\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nloss(x, y) # ~ 3To improve the prediction we can take the gradients of W and b with respect to the loss function and perform gradient descent. We could calculate gradients by hand, but Flux will do it for us if we tell it that W and b are trainable parameters.using Flux.Tracker\n\nW = param(W)\nb = param(b)\n\nl = loss(x, y)\n\nback!(l)loss(x, y) returns the same number, but it\'s now a tracked value that records gradients as it goes along. Calling back! then accumulates the gradient of W and b. We can see what this gradient is, and modify W to train the model.W.grad\n\n# Update the parameter\nW.data .-= 0.1(W.grad)\n# Reset the gradient\nW.grad .= 0\n\nloss(x, y) # ~ 2.5The loss has decreased a little, meaning that our prediction x is closer to the target y. If we have some data we can already try training the model.All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can look very different they might have millions of parameters or complex control flow, and there are ways to manage this complexity. Let\'s see what that looks like."
"text": "Consider a simple linear regression, which tries to predict an output array y from an input x. (It\'s a good idea to follow this example in the Julia repl.)W = rand(2, 5)\nb = rand(2)\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nloss(x, y) # ~ 3To improve the prediction we can take the gradients of W and b with respect to the loss function and perform gradient descent. We could calculate gradients by hand, but Flux will do it for us if we tell it that W and b are trainable parameters.using Flux.Tracker\n\nW = param(W)\nb = param(b)\n\nl = loss(x, y)\n\nback!(l)loss(x, y) returns the same number, but it\'s now a tracked value that records gradients as it goes along. Calling back! then accumulates the gradient of W and b. We can see what this gradient is, and modify W to train the model.using Flux.Tracker: grad, update!\n\nΔ = grad(W)\n\n# Update the parameter and reset the gradient\nupdate!(W, -0.1Δ)\n\nloss(x, y) # ~ 2.5The loss has decreased a little, meaning that our prediction x is closer to the target y. If we have some data we can already try training the model.All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can look very different they might have millions of parameters or complex control flow, and there are ways to manage this complexity. Let\'s see what that looks like."
},
{
@ -317,7 +317,7 @@ var documenterSearchIndex = {"docs": [
"page": "Optimisers",
"title": "Optimisers",
"category": "section",
"text": "Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.W = param(rand(2, 5))\nb = param(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nl = loss(x, y) # ~ 3\nback!(l)We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here\'s one way to do that:function update()\n η = 0.1 # Learning Rate\n for p in (W, b)\n p.data .-= η .* p.grad # Apply the update\n p.grad .= 0 # Clear the gradient\n end\nendIf we call update, the parameters W and b will change and our loss should go down.There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.In this case, getting the variables is trivial, but you can imagine it\'d be more of a pain with some complex stack of layers.m = Chain(\n Dense(10, 5, σ),\n Dense(5, 2), softmax)Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.For the update step, there\'s nothing whatsoever wrong with writing the loop above it\'ll work just fine but Flux provides various optimisers that make it more convenient.opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1\n\nopt() # Carry out the update, modifying `W` and `b`.An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data."
"text": "Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.W = param(rand(2, 5))\nb = param(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nl = loss(x, y) # ~ 3\nback!(l)We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here\'s one way to do that:using Flux.Tracker: grad, update!\n\nfunction sgd()\n η = 0.1 # Learning Rate\n for p in (W, b)\n update!(p, -η * grad(p))\n end\nendIf we call sgd, the parameters W and b will change and our loss should go down.There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.In this case, getting the variables is trivial, but you can imagine it\'d be more of a pain with some complex stack of layers.m = Chain(\n Dense(10, 5, σ),\n Dense(5, 2), softmax)Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.For the update step, there\'s nothing whatsoever wrong with writing the loop above it\'ll work just fine but Flux provides various optimisers that make it more convenient.opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1\n\nopt() # Carry out the update, modifying `W` and `b`.An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data."
},
{

View File

@ -14,14 +14,15 @@ loss(x, y) = sum((predict(x) .- y).^2)
x, y = rand(5), rand(2) # Dummy data
l = loss(x, y) # ~ 3
back!(l)</code></pre><p>We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here&#39;s one way to do that:</p><pre><code class="language-julia">function update()
back!(l)</code></pre><p>We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here&#39;s one way to do that:</p><pre><code class="language-julia">using Flux.Tracker: grad, update!
function sgd()
η = 0.1 # Learning Rate
for p in (W, b)
p.data .-= η .* p.grad # Apply the update
p.grad .= 0 # Clear the gradient
update!(p, -η * grad(p))
end
end</code></pre><p>If we call <code>update</code>, the parameters <code>W</code> and <code>b</code> will change and our loss should go down.</p><p>There are two pieces here: one is that we need a list of trainable parameters for the model (<code>[W, b]</code> in this case), and the other is the update step. In this case the update is simply gradient descent (<code>x .-= η .* Δ</code>), but we might choose to do something more advanced, like adding momentum.</p><p>In this case, getting the variables is trivial, but you can imagine it&#39;d be more of a pain with some complex stack of layers.</p><pre><code class="language-julia">m = Chain(
end</code></pre><p>If we call <code>sgd</code>, the parameters <code>W</code> and <code>b</code> will change and our loss should go down.</p><p>There are two pieces here: one is that we need a list of trainable parameters for the model (<code>[W, b]</code> in this case), and the other is the update step. In this case the update is simply gradient descent (<code>x .-= η .* Δ</code>), but we might choose to do something more advanced, like adding momentum.</p><p>In this case, getting the variables is trivial, but you can imagine it&#39;d be more of a pain with some complex stack of layers.</p><pre><code class="language-julia">m = Chain(
Dense(10, 5, σ),
Dense(5, 2), softmax)</code></pre><p>Instead of having to write <code>[m[1].W, m[1].b, ...]</code>, Flux provides a params function <code>params(m)</code> that returns a list of all parameters in the model for you.</p><p>For the update step, there&#39;s nothing whatsoever wrong with writing the loop above it&#39;ll work just fine but Flux provides various <em>optimisers</em> that make it more convenient.</p><pre><code class="language-julia">opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1
opt() # Carry out the update, modifying `W` and `b`.</code></pre><p>An optimiser takes a parameter list and returns a function that does the same thing as <code>update</code> above. We can pass either <code>opt</code> or <code>update</code> to our <a href="training.html">training loop</a>, which will then run the optimiser after every mini-batch of data.</p><h2><a class="nav-anchor" id="Optimiser-Reference-1" href="#Optimiser-Reference-1">Optimiser Reference</a></h2><p>All optimisers return a function that, when called, will update the parameters passed to it.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.SGD" href="#Flux.Optimise.SGD"><code>Flux.Optimise.SGD</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">SGD(params, η = 0.1; decay = 0)</code></pre><p>Classic gradient descent optimiser with learning rate <code>η</code>. For each parameter <code>p</code> and its gradient <code>δp</code>, this runs <code>p -= η*δp</code>.</p><p>Supports inverse decaying learning rate if the <code>decay</code> argument is provided.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/optimise/interface.jl#L14-L21">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Momentum" href="#Flux.Optimise.Momentum"><code>Flux.Optimise.Momentum</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">Momentum(params, η = 0.01; ρ = 0.9, decay = 0)</code></pre><p>SGD with learning rate <code>η</code>, momentum <code>ρ</code> and optional learning rate inverse decay.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/optimise/interface.jl#L25-L29">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Nesterov" href="#Flux.Optimise.Nesterov"><code>Flux.Optimise.Nesterov</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">Nesterov(params, η = 0.01; ρ = 0.9, decay = 0)</code></pre><p>SGD with learning rate <code>η</code>, Nesterov momentum <code>ρ</code> and optional learning rate inverse decay.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/optimise/interface.jl#L33-L37">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAM" href="#Flux.Optimise.ADAM"><code>Flux.Optimise.ADAM</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">ADAM(params, η = 0.001; β1 = 0.9, β2 = 0.999, ϵ = 1e-08, decay = 0)</code></pre><p><a href="https://arxiv.org/abs/1412.6980v8">ADAM</a> optimiser.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/bed6d2311e4c9778c8057249ccce9a636f334cc0/src/optimise/interface.jl#L51-L55">source</a></section><footer><hr/><a class="previous" href="../models/layers.html"><span class="direction">Previous</span><span class="title">Model Reference</span></a><a class="next" href="training.html"><span class="direction">Next</span><span class="title">Training</span></a></footer></article></body></html>
opt() # Carry out the update, modifying `W` and `b`.</code></pre><p>An optimiser takes a parameter list and returns a function that does the same thing as <code>update</code> above. We can pass either <code>opt</code> or <code>update</code> to our <a href="training.html">training loop</a>, which will then run the optimiser after every mini-batch of data.</p><h2><a class="nav-anchor" id="Optimiser-Reference-1" href="#Optimiser-Reference-1">Optimiser Reference</a></h2><p>All optimisers return a function that, when called, will update the parameters passed to it.</p><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.SGD" href="#Flux.Optimise.SGD"><code>Flux.Optimise.SGD</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">SGD(params, η = 0.1; decay = 0)</code></pre><p>Classic gradient descent optimiser with learning rate <code>η</code>. For each parameter <code>p</code> and its gradient <code>δp</code>, this runs <code>p -= η*δp</code>.</p><p>Supports inverse decaying learning rate if the <code>decay</code> argument is provided.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/optimise/interface.jl#L14-L21">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Momentum" href="#Flux.Optimise.Momentum"><code>Flux.Optimise.Momentum</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">Momentum(params, η = 0.01; ρ = 0.9, decay = 0)</code></pre><p>SGD with learning rate <code>η</code>, momentum <code>ρ</code> and optional learning rate inverse decay.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/optimise/interface.jl#L25-L29">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.Nesterov" href="#Flux.Optimise.Nesterov"><code>Flux.Optimise.Nesterov</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">Nesterov(params, η = 0.01; ρ = 0.9, decay = 0)</code></pre><p>SGD with learning rate <code>η</code>, Nesterov momentum <code>ρ</code> and optional learning rate inverse decay.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/optimise/interface.jl#L33-L37">source</a></section><section class="docstring"><div class="docstring-header"><a class="docstring-binding" id="Flux.Optimise.ADAM" href="#Flux.Optimise.ADAM"><code>Flux.Optimise.ADAM</code></a><span class="docstring-category">Function</span>.</div><div><pre><code class="language-none">ADAM(params, η = 0.001; β1 = 0.9, β2 = 0.999, ϵ = 1e-08, decay = 0)</code></pre><p><a href="https://arxiv.org/abs/1412.6980v8">ADAM</a> optimiser.</p></div><a class="source-link" target="_blank" href="https://github.com/FluxML/Flux.jl/blob/5d8b63dc65fe6bf72109e10217e51c72dbce75f7/src/optimise/interface.jl#L51-L55">source</a></section><footer><hr/><a class="previous" href="../models/layers.html"><span class="direction">Previous</span><span class="title">Model Reference</span></a><a class="next" href="training.html"><span class="direction">Next</span><span class="title">Training</span></a></footer></article></body></html>