From a01696375525ac8b70f7ec8de1ee64f5b08ac61a Mon Sep 17 00:00:00 2001 From: autodocs Date: Wed, 18 Oct 2017 11:26:33 +0000 Subject: [PATCH] build based on c4166fd --- latest/models/layers.html | 2 +- latest/search_index.js | 4 ++-- latest/training/optimisers.html | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/latest/models/layers.html b/latest/models/layers.html index e38f76b9..14abe6c8 100644 --- a/latest/models/layers.html +++ b/latest/models/layers.html @@ -11,4 +11,4 @@ m(5) == 26 m = Chain(Dense(10, 5), Dense(5, 2)) x = rand(10) -m(x) == m[2](m[1](x))

Chain also supports indexing and slicing, e.g. m[2] or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.

source
Flux.DenseType.
Dense(in::Integer, out::Integer, σ = identity)

Creates a traditional Dense layer with parameters W and b.

y = σ.(W * x .+ b)

The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length in.

source
+m(x) == m[2](m[1](x))

Chain also supports indexing and slicing, e.g. m[2] or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.

source
Flux.DenseType.
Dense(in::Integer, out::Integer, σ = identity)

Creates a traditional Dense layer with parameters W and b.

y = σ.(W * x .+ b)

The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length in.

source
diff --git a/latest/search_index.js b/latest/search_index.js index 3787838e..adf62e53 100644 --- a/latest/search_index.js +++ b/latest/search_index.js @@ -157,7 +157,7 @@ var documenterSearchIndex = {"docs": [ "page": "Optimisers", "title": "Optimisers", "category": "section", - "text": "Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.W = param(rand(2, 5))\nb = param(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nl = loss(x, y) # ~ 3\nback!(l)We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:using Flux.Tracker: data, grad\n\nfunction update()\n η = 0.1 # Learning Rate\n for p in (W, b)\n x, Δ = data(p), grad(p)\n x .-= η .* Δ # Apply the update\n Δ .= 0 # Clear the gradient\n end\nendIf we call update, the parameters W and b will change and our loss should go down.There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.In this case, getting the variables is trivial, but you can imagine it'd be more of a pain with some complex stack of layers.m = Chain(\n Dense(10, 5, σ),\n Dense(5, 2), softmax)Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various optimisers that make it more convenient.opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1\n\nopt()An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data." + "text": "Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.W = param(rand(2, 5))\nb = param(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nl = loss(x, y) # ~ 3\nback!(l)We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:using Flux.Tracker: data, grad\n\nfunction update()\n η = 0.1 # Learning Rate\n for p in (W, b)\n x, Δ = data(p), grad(p)\n x .-= η .* Δ # Apply the update\n Δ .= 0 # Clear the gradient\n end\nendIf we call update, the parameters W and b will change and our loss should go down.There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.In this case, getting the variables is trivial, but you can imagine it'd be more of a pain with some complex stack of layers.m = Chain(\n Dense(10, 5, σ),\n Dense(5, 2), softmax)Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various optimisers that make it more convenient.opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1\n\nopt() # Carry out the update, modifying `W` and `b`.An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data." }, { @@ -221,7 +221,7 @@ var documenterSearchIndex = {"docs": [ "page": "Optimisers", "title": "Optimiser Reference", "category": "section", - "text": "SGD\nMomentum\nNesterov\nRMSProp\nADAM\nADAGrad\nADADelta" + "text": "All optimisers return a function that, when called, will update the parameters passed to it.SGD\nMomentum\nNesterov\nRMSProp\nADAM\nADAGrad\nADADelta" }, { diff --git a/latest/training/optimisers.html b/latest/training/optimisers.html index e8c25707..a0662e36 100644 --- a/latest/training/optimisers.html +++ b/latest/training/optimisers.html @@ -27,4 +27,4 @@ end

If we call update, the parameters W Dense(10, 5, σ), Dense(5, 2), softmax)

Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.

For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various optimisers that make it more convenient.

opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1
 
-opt()

An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data.

Optimiser Reference

Flux.Optimise.SGDFunction.
SGD(params, η = 1; decay = 0)

Classic gradient descent optimiser. For each parameter p and its gradient δp, this runs p -= η*δp.

Supports decayed learning rate decay if the decay argument is provided.

source
Momentum(params, ρ, decay = 0)

SGD with momentum ρ and optional learning rate decay.

source
Nesterov(params, ρ, decay = 0)

SGD with Nesterov momentum ρ and optional learning rate decay.

source
Flux.Optimise.RMSPropFunction.
RMSProp(params; η = 0.001, ρ = 0.9, ϵ = 1e-8, decay = 0)

RMSProp optimiser. Parameters other than learning rate don't need tuning. Often a good choice for recurrent networks.

source
Flux.Optimise.ADAMFunction.
ADAM(params; η = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e-08, decay = 0)

ADAM optimiser.

source
Flux.Optimise.ADAGradFunction.
ADAGrad(params; η = 0.01, ϵ = 1e-8, decay = 0)

ADAGrad optimiser. Parameters don't need tuning.

source
ADADelta(params; η = 0.01, ρ = 0.95, ϵ = 1e-8, decay = 0)

ADADelta optimiser. Parameters don't need tuning.

source
+opt() # Carry out the update, modifying `W` and `b`.

An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data.

Optimiser Reference

All optimisers return a function that, when called, will update the parameters passed to it.

Flux.Optimise.SGDFunction.
SGD(params, η = 1; decay = 0)

Classic gradient descent optimiser. For each parameter p and its gradient δp, this runs p -= η*δp.

Supports decayed learning rate decay if the decay argument is provided.

source
Momentum(params, ρ, decay = 0)

SGD with momentum ρ and optional learning rate decay.

source
Nesterov(params, ρ, decay = 0)

SGD with Nesterov momentum ρ and optional learning rate decay.

source
Flux.Optimise.RMSPropFunction.
RMSProp(params; η = 0.001, ρ = 0.9, ϵ = 1e-8, decay = 0)

RMSProp optimiser. Parameters other than learning rate don't need tuning. Often a good choice for recurrent networks.

source
Flux.Optimise.ADAMFunction.
ADAM(params; η = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e-08, decay = 0)

ADAM optimiser.

source
Flux.Optimise.ADAGradFunction.
ADAGrad(params; η = 0.01, ϵ = 1e-8, decay = 0)

ADAGrad optimiser. Parameters don't need tuning.

source
ADADelta(params; η = 0.01, ρ = 0.95, ϵ = 1e-8, decay = 0)

ADADelta optimiser. Parameters don't need tuning.

source