fixed optimisers syntax

This commit is contained in:
Dhairya Gandhi 2018-12-04 16:08:03 +05:30
parent d412845192
commit eb287ae9a0
1 changed files with 9 additions and 68 deletions

View File

@ -23,7 +23,7 @@ We want to update each parameter, using the gradient, in order to improve (reduc
```julia
using Flux.Tracker: grad, update!
struct sgd()
function sgd()
η = 0.1 # Learning Rate
for p in (W, b)
update!(p, -η * grads[p])
@ -50,77 +50,18 @@ For the update step, there's nothing whatsoever wrong with writing the loop abov
```julia
opt = Descent(0.1) # Gradient descent with learning rate 0.1
update!(opt, params(m)) # Carry out the update, modifying `W` and `b`.
Optimise.update!(opt, [W, b]) # Carry out the update, modifying `W` and `b`.
```
An optimiser takes a parameter list and returns a function that does the same thing as `update` above. We can pass either `opt` or `update` to our [training loop](training.md), which will then run the optimiser after every mini-batch of data.
An optimiser takes a parameter list and returns a object that holds the current values in the optimiser. We can pass `opt` to our [training loop](training.md), which will then run the `update!` step for the optimiser after every mini-batch of data.
## Optimiser Reference
All optimisers return a `struct` that, when called with their `update!`, will update the parameters passed to it.
All optimisers return an object that, when passed to `train!`, will update the parameters passed to it.
* [Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
* [Momentum](https://arxiv.org/abs/1712.09677)
* [Nesterov](https://arxiv.org/abs/1607.01981)
* [RMSProp](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
* [ADAM](https://arxiv.org/abs/1412.6980v8)
* [AdaMax](https://arxiv.org/abs/1412.6980v9)
* [ADAGrad](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)
* [ADADelta](http://arxiv.org/abs/1212.5701)
* [AMSGrad](https://openreview.net/forum?id=ryQu7f-RZ)
* [NADAM](http://cs229.stanford.edu/proj2015/054_report.pdf)
* [ADAMW](https://arxiv.org/abs/1711.05101)
* InvDecay
* ExpDecay
* WeightDecay
## Optimiser API
All optimsers now exist as their own `structs` which house all the different parameters required to satisfy their respective update rules.
This is done by overloading the `Flux.Optimise.update!` method which takes the optimiser, the data and the gradients of the parameters to return the change (or the step) from the update. This follows the following design:
```julia
mutable struct Descent
eta::Float64
end
function update!(o::Descent, x, Δ)
Δ .*= o.eta
end
```@docs
SGD
Momentum
Nesterov
ADAM
```
After this, it is sufficient to either call `Flux.train!` as usual or `Optimise.update!(opt, params(model))` in a training loop. This also comes with the change in the API of the training loop to take in the model parameters as necessary.
The `struct`s allow for decoupling the optimiser structure from its update rule allowing us to treat them as independent entities. It means we can do things like changing the optimiser parameters at will, and hooking together custom optimizers, with or without the predefined ones.
```julia
opt = Descent(0.5)
update!(opt, params(model))
opt.eta = 0.2 # valid statment, useful for annealing/ scaling
```
The `ExpDecay` function defined within Flux, takes advantage of this flexibility. It can be used as a way of scheduling the learning rate. It makes it easy to scale the learning rate, every `n` epochs. Additionaly, it is easy to specify a `clip` or a bound to the learning rate, beyond which it will be maintained throughout the remainder of the training.
```julia
ExpDecay(opt = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4)
```
The above would take the initial learning rate `0.001`, and decay it by `0.1` every `1000` steps until it reaches a minimum of `1e-4`. It can be used such that it can be applied on to any optimiser like so:
```julia
Optimiser(ExpDecay(...), Descent(...))
```
## Optimiser
An equally easy to use interface is that of `Optimiser` which is designed for creating compound optimisers or in general let us take an action against the training loop as defined on the parameters. The `update!` API remains unified.
```julia
opt1 = Descent()
opt2 = Optimiser(InvDecay(), RMSProp())
opt = Opitmiser(opt1, opt2)
update!(opt, params(model))
```
`opt = Optimiser(ExpDecay(), ADAM())` generates an optimiser that applies the previously discussed `ExpDecay` on the `ADAM` optimiser, during the training. It can also be extended as `Optimiser(..., Optimiser(...))` to create sophisticated and general optimisers that can be customised extensively. It follows many of julia's semantics, so it is possible to `push!` to them, index on them, slice them etc.