diff --git a/latest/models/layers.html b/latest/models/layers.html index 5b68f4ef..e38f76b9 100644 --- a/latest/models/layers.html +++ b/latest/models/layers.html @@ -11,4 +11,4 @@ m(5) == 26 m = Chain(Dense(10, 5), Dense(5, 2)) x = rand(10) -m(x) == m[2](m[1](x))

Chain also supports indexing and slicing, e.g. m[2] or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.

source
Flux.DenseType.
Dense(in::Integer, out::Integer, σ = identity)

Creates a traditional Dense layer with parameters W and b.

y = σ.(W * x .+ b)

The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length in.

source
+m(x) == m[2](m[1](x))

Chain also supports indexing and slicing, e.g. m[2] or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.

source
Flux.DenseType.
Dense(in::Integer, out::Integer, σ = identity)

Creates a traditional Dense layer with parameters W and b.

y = σ.(W * x .+ b)

The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length in.

source
diff --git a/latest/search_index.js b/latest/search_index.js index 7814fceb..3787838e 100644 --- a/latest/search_index.js +++ b/latest/search_index.js @@ -160,6 +160,70 @@ var documenterSearchIndex = {"docs": [ "text": "Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.W = param(rand(2, 5))\nb = param(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nl = loss(x, y) # ~ 3\nback!(l)We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:using Flux.Tracker: data, grad\n\nfunction update()\n η = 0.1 # Learning Rate\n for p in (W, b)\n x, Δ = data(p), grad(p)\n x .-= η .* Δ # Apply the update\n Δ .= 0 # Clear the gradient\n end\nendIf we call update, the parameters W and b will change and our loss should go down.There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.In this case, getting the variables is trivial, but you can imagine it'd be more of a pain with some complex stack of layers.m = Chain(\n Dense(10, 5, σ),\n Dense(5, 2), softmax)Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various optimisers that make it more convenient.opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1\n\nopt()An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data." }, +{ + "location": "training/optimisers.html#Flux.Optimise.SGD", + "page": "Optimisers", + "title": "Flux.Optimise.SGD", + "category": "Function", + "text": "SGD(params, η = 1; decay = 0)\n\nClassic gradient descent optimiser. For each parameter p and its gradient δp, this runs p -= η*δp.\n\nSupports decayed learning rate decay if the decay argument is provided.\n\n\n\n" +}, + +{ + "location": "training/optimisers.html#Flux.Optimise.Momentum", + "page": "Optimisers", + "title": "Flux.Optimise.Momentum", + "category": "Function", + "text": "Momentum(params, ρ, decay = 0)\n\nSGD with momentum ρ and optional learning rate decay.\n\n\n\n" +}, + +{ + "location": "training/optimisers.html#Flux.Optimise.Nesterov", + "page": "Optimisers", + "title": "Flux.Optimise.Nesterov", + "category": "Function", + "text": "Nesterov(params, ρ, decay = 0)\n\nSGD with Nesterov momentum ρ and optional learning rate decay.\n\n\n\n" +}, + +{ + "location": "training/optimisers.html#Flux.Optimise.RMSProp", + "page": "Optimisers", + "title": "Flux.Optimise.RMSProp", + "category": "Function", + "text": "RMSProp(params; η = 0.001, ρ = 0.9, ϵ = 1e-8, decay = 0)\n\nRMSProp optimiser. Parameters other than learning rate don't need tuning. Often a good choice for recurrent networks.\n\n\n\n" +}, + +{ + "location": "training/optimisers.html#Flux.Optimise.ADAM", + "page": "Optimisers", + "title": "Flux.Optimise.ADAM", + "category": "Function", + "text": "ADAM(params; η = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e-08, decay = 0)\n\nADAM optimiser.\n\n\n\n" +}, + +{ + "location": "training/optimisers.html#Flux.Optimise.ADAGrad", + "page": "Optimisers", + "title": "Flux.Optimise.ADAGrad", + "category": "Function", + "text": "ADAGrad(params; η = 0.01, ϵ = 1e-8, decay = 0)\n\nADAGrad optimiser. Parameters don't need tuning.\n\n\n\n" +}, + +{ + "location": "training/optimisers.html#Flux.Optimise.ADADelta", + "page": "Optimisers", + "title": "Flux.Optimise.ADADelta", + "category": "Function", + "text": "ADADelta(params; η = 0.01, ρ = 0.95, ϵ = 1e-8, decay = 0)\n\nADADelta optimiser. Parameters don't need tuning.\n\n\n\n" +}, + +{ + "location": "training/optimisers.html#Optimiser-Reference-1", + "page": "Optimisers", + "title": "Optimiser Reference", + "category": "section", + "text": "SGD\nMomentum\nNesterov\nRMSProp\nADAM\nADAGrad\nADADelta" +}, + { "location": "training/training.html#", "page": "Training", diff --git a/latest/training/optimisers.html b/latest/training/optimisers.html index 949cb4f1..e8c25707 100644 --- a/latest/training/optimisers.html +++ b/latest/training/optimisers.html @@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

Optimisers

Optimisers

Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.

W = param(rand(2, 5))
+

Optimisers

Optimisers

Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.

W = param(rand(2, 5))
 b = param(rand(2))
 
 predict(x) = W*x .+ b
@@ -27,4 +27,4 @@ end

If we call update, the parameters W Dense(10, 5, σ), Dense(5, 2), softmax)

Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.

For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various optimisers that make it more convenient.

opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1
 
-opt()

An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data.

+opt()

An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data.

Optimiser Reference

Flux.Optimise.SGDFunction.
SGD(params, η = 1; decay = 0)

Classic gradient descent optimiser. For each parameter p and its gradient δp, this runs p -= η*δp.

Supports decayed learning rate decay if the decay argument is provided.

source
Flux.Optimise.MomentumFunction.
Momentum(params, ρ, decay = 0)

SGD with momentum ρ and optional learning rate decay.

source
Flux.Optimise.NesterovFunction.
Nesterov(params, ρ, decay = 0)

SGD with Nesterov momentum ρ and optional learning rate decay.

source
Flux.Optimise.RMSPropFunction.
RMSProp(params; η = 0.001, ρ = 0.9, ϵ = 1e-8, decay = 0)

RMSProp optimiser. Parameters other than learning rate don't need tuning. Often a good choice for recurrent networks.

source
Flux.Optimise.ADAMFunction.
ADAM(params; η = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 1e-08, decay = 0)

ADAM optimiser.

source
Flux.Optimise.ADAGradFunction.
ADAGrad(params; η = 0.01, ϵ = 1e-8, decay = 0)

ADAGrad optimiser. Parameters don't need tuning.

source
Flux.Optimise.ADADeltaFunction.
ADADelta(params; η = 0.01, ρ = 0.95, ϵ = 1e-8, decay = 0)

ADADelta optimiser. Parameters don't need tuning.

source