Merge pull request #493 from dhairyagandhi96/master
[WIP] New Optimiser Docs
This commit is contained in:
commit
f0d5624ed2
|
@ -23,41 +23,27 @@ We want to update each parameter, using the gradient, in order to improve (reduc
|
||||||
```julia
|
```julia
|
||||||
using Flux.Tracker: grad, update!
|
using Flux.Tracker: grad, update!
|
||||||
|
|
||||||
function sgd()
|
η = 0.1 # Learning Rate
|
||||||
η = 0.1 # Learning Rate
|
for p in (W, b)
|
||||||
for p in (W, b)
|
|
||||||
update!(p, -η * grads[p])
|
update!(p, -η * grads[p])
|
||||||
end
|
|
||||||
end
|
end
|
||||||
```
|
```
|
||||||
|
|
||||||
If we call `sgd`, the parameters `W` and `b` will change and our loss should go down.
|
Running this will alter the parameters `W` and `b` and our loss should go down. Flux provides a more general way to do optimiser updates like this.
|
||||||
|
|
||||||
There are two pieces here: one is that we need a list of trainable parameters for the model (`[W, b]` in this case), and the other is the update step. In this case the update is simply gradient descent (`x .-= η .* Δ`), but we might choose to do something more advanced, like adding momentum.
|
|
||||||
|
|
||||||
In this case, getting the variables is trivial, but you can imagine it'd be more of a pain with some complex stack of layers.
|
|
||||||
|
|
||||||
```julia
|
```julia
|
||||||
m = Chain(
|
opt = Descent(0.1) # Gradient descent with learning rate 0.1
|
||||||
Dense(10, 5, σ),
|
|
||||||
Dense(5, 2), softmax)
|
for p in (W, b)
|
||||||
|
update!(opt, p, -η * grads[p])
|
||||||
|
end
|
||||||
```
|
```
|
||||||
|
|
||||||
Instead of having to write `[m[1].W, m[1].b, ...]`, Flux provides a params function `params(m)` that returns a list of all parameters in the model for you.
|
An optimiser `update!` accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass `opt` to our [training loop](training.md), which will update all parameters of the model in a loop. However, we can now easily replace `Descent` with a more advanced optimiser such as `ADAM`.
|
||||||
|
|
||||||
For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various *optimisers* that make it more convenient.
|
|
||||||
|
|
||||||
```julia
|
|
||||||
opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1
|
|
||||||
|
|
||||||
opt() # Carry out the update, modifying `W` and `b`.
|
|
||||||
```
|
|
||||||
|
|
||||||
An optimiser takes a parameter list and returns a function that does the same thing as `update` above. We can pass either `opt` or `update` to our [training loop](training.md), which will then run the optimiser after every mini-batch of data.
|
|
||||||
|
|
||||||
## Optimiser Reference
|
## Optimiser Reference
|
||||||
|
|
||||||
All optimisers return a function that, when called, will update the parameters passed to it.
|
All optimisers return an object that, when passed to `train!`, will update the parameters passed to it.
|
||||||
|
|
||||||
```@docs
|
```@docs
|
||||||
SGD
|
SGD
|
||||||
|
|
|
@ -9,7 +9,7 @@ To actually train a model we need three things:
|
||||||
With these we can call `Flux.train!`:
|
With these we can call `Flux.train!`:
|
||||||
|
|
||||||
```julia
|
```julia
|
||||||
Flux.train!(objective, data, opt)
|
Flux.train!(objective, params, data, opt)
|
||||||
```
|
```
|
||||||
|
|
||||||
There are plenty of examples in the [model zoo](https://github.com/FluxML/model-zoo).
|
There are plenty of examples in the [model zoo](https://github.com/FluxML/model-zoo).
|
||||||
|
@ -24,9 +24,10 @@ m = Chain(
|
||||||
Dense(32, 10), softmax)
|
Dense(32, 10), softmax)
|
||||||
|
|
||||||
loss(x, y) = Flux.mse(m(x), y)
|
loss(x, y) = Flux.mse(m(x), y)
|
||||||
|
ps = Flux.params(m)
|
||||||
|
|
||||||
# later
|
# later
|
||||||
Flux.train!(loss, data, opt)
|
Flux.train!(loss, ps, data, opt)
|
||||||
```
|
```
|
||||||
|
|
||||||
The objective will almost always be defined in terms of some *cost function* that measures the distance of the prediction `m(x)` from the target `y`. Flux has several of these built in, like `mse` for mean squared error or `crossentropy` for cross entropy loss, but you can calculate it however you want.
|
The objective will almost always be defined in terms of some *cost function* that measures the distance of the prediction `m(x)` from the target `y`. Flux has several of these built in, like `mse` for mean squared error or `crossentropy` for cross entropy loss, but you can calculate it however you want.
|
||||||
|
@ -78,7 +79,7 @@ julia> @epochs 2 Flux.train!(...)
|
||||||
`train!` takes an additional argument, `cb`, that's used for callbacks so that you can observe the training process. For example:
|
`train!` takes an additional argument, `cb`, that's used for callbacks so that you can observe the training process. For example:
|
||||||
|
|
||||||
```julia
|
```julia
|
||||||
train!(objective, data, opt, cb = () -> println("training"))
|
train!(objective, ps, data, opt, cb = () -> println("training"))
|
||||||
```
|
```
|
||||||
|
|
||||||
Callbacks are called for every batch of training data. You can slow this down using `Flux.throttle(f, timeout)` which prevents `f` from being called more than once every `timeout` seconds.
|
Callbacks are called for every batch of training data. You can slow this down using `Flux.throttle(f, timeout)` which prevents `f` from being called more than once every `timeout` seconds.
|
||||||
|
@ -89,6 +90,6 @@ A more typical callback might look like this:
|
||||||
test_x, test_y = # ... create single batch of test data ...
|
test_x, test_y = # ... create single batch of test data ...
|
||||||
evalcb() = @show(loss(test_x, test_y))
|
evalcb() = @show(loss(test_x, test_y))
|
||||||
|
|
||||||
Flux.train!(objective, data, opt,
|
Flux.train!(objective, ps, data, opt,
|
||||||
cb = throttle(evalcb, 5))
|
cb = throttle(evalcb, 5))
|
||||||
```
|
```
|
||||||
|
|
|
@ -257,6 +257,14 @@ function update!(o::Optimiser, x, Δ)
|
||||||
return Δ
|
return Δ
|
||||||
end
|
end
|
||||||
|
|
||||||
|
"""
|
||||||
|
`InvDecay(γ)`
|
||||||
|
|
||||||
|
Apply inverse time decay to an optimiser
|
||||||
|
```julia
|
||||||
|
Optimiser(InvDecay(..), Opt(..))
|
||||||
|
```
|
||||||
|
"""
|
||||||
mutable struct InvDecay
|
mutable struct InvDecay
|
||||||
gamma::Float64
|
gamma::Float64
|
||||||
state::IdDict
|
state::IdDict
|
||||||
|
@ -272,6 +280,16 @@ function update!(o::InvDecay, x, Δ)
|
||||||
return Δ
|
return Δ
|
||||||
end
|
end
|
||||||
|
|
||||||
|
"""
|
||||||
|
`ExpDecay(eta, decay, decay_step, clip)`
|
||||||
|
|
||||||
|
Schedule the learning rate `eta` by `decay` every `decay_step` till a minimum of `clip`.
|
||||||
|
|
||||||
|
To apply exponential decay to an optimiser:
|
||||||
|
```julia
|
||||||
|
Optimiser(ExpDecay(..), Opt(..))
|
||||||
|
```
|
||||||
|
"""
|
||||||
mutable struct ExpDecay
|
mutable struct ExpDecay
|
||||||
eta::Float64
|
eta::Float64
|
||||||
decay::Float64
|
decay::Float64
|
||||||
|
@ -292,6 +310,11 @@ function update!(o::ExpDecay, x, Δ)
|
||||||
@. Δ *= decay
|
@. Δ *= decay
|
||||||
end
|
end
|
||||||
|
|
||||||
|
"""
|
||||||
|
`WeightDecay(wd)`
|
||||||
|
|
||||||
|
Decay the weight parameter by `wd`
|
||||||
|
"""
|
||||||
mutable struct WeightDecay
|
mutable struct WeightDecay
|
||||||
wd::Real
|
wd::Real
|
||||||
end
|
end
|
||||||
|
|
|
@ -45,7 +45,7 @@ function stop()
|
||||||
end
|
end
|
||||||
|
|
||||||
"""
|
"""
|
||||||
train!(model, loss, data, opt)
|
train!(loss, params, data, opt; cb)
|
||||||
|
|
||||||
For each datapoint `d` in `data` computes the gradient of `loss(d...)` through
|
For each datapoint `d` in `data` computes the gradient of `loss(d...)` through
|
||||||
backpropagation and calls the optimizer `opt`.
|
backpropagation and calls the optimizer `opt`.
|
||||||
|
@ -54,11 +54,11 @@ Takes a callback as keyword argument `cb`. For example, this will print "trainin
|
||||||
every 10 seconds:
|
every 10 seconds:
|
||||||
|
|
||||||
```julia
|
```julia
|
||||||
Flux.train!(model, loss, data, opt,
|
Flux.train!(loss, params, data, opt,
|
||||||
cb = throttle(() -> println("training"), 10))
|
cb = throttle(() -> println("training"), 10))
|
||||||
```
|
```
|
||||||
|
|
||||||
The callback can return `:stop` to interrupt the training loop.
|
The callback can call `Flux.stop()` to interrupt the training loop.
|
||||||
|
|
||||||
Multiple optimisers and callbacks can be passed to `opt` and `cb` as arrays.
|
Multiple optimisers and callbacks can be passed to `opt` and `cb` as arrays.
|
||||||
"""
|
"""
|
||||||
|
|
Loading…
Reference in New Issue