[WIP] add optimiser docs

2018-11-12 17:42:52 +05:30 · 2018-11-12 17:42:52 +05:30 · 4562682528
commit 4562682528
parent 392c3c942b
2 changed files with 58 additions and 6 deletions
--- a/docs/src/training/optimisers.md
+++ b/docs/src/training/optimisers.md
@ -48,20 +48,72 @@ Instead of having to write `[m[1].W, m[1].b, ...]`, Flux provides a params funct
 For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various *optimisers* that make it more convenient.

 ```julia
-opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1
+opt = Descent(0.1) # Gradient descent with learning rate 0.1

-opt() # Carry out the update, modifying `W` and `b`.
+update!(opt, params(m)) # Carry out the update, modifying `W` and `b`.
 ```

 An optimiser takes a parameter list and returns a function that does the same thing as `update` above. We can pass either `opt` or `update` to our [training loop](training.md), which will then run the optimiser after every mini-batch of data.

 ## Optimiser Reference

-All optimisers return a function that, when called, will update the parameters passed to it.
+All optimisers return a `struct` that, when called with their `update!`, will update the parameters passed to it.

 ```@docs
 SGD
+Descent
 Momentum
 Nesterov
 ADAM
 ```
+
+## Optimiser API
+
+All optimsers now exist as their own `structs` which house all the different parameters required to satisfy their respective update rules.
+This is done by overloading the `Flux.Optimise.update!` method which takes the optimiser, the data and the gradients of the parameters to return the change (or the step) from the update. This follows the following design:
+
+```julia
+mutable struct Descent
+  eta::Float64
+end
+
+function update!(o::Descent, x, Δ)
+  Δ .*= o.eta
+end
+```
+
+After this, it is sufficient to either call `Flux.train!` as usual or `Optimise.update!(opt, params(model))` in a training loop. This also comes with the change in the API of the training loop to take in the model parameters as necessary.
+
+The `struct`s allow for decoupling the optimiser structure from its update rule allowing us to treat them as independent entities. It means we can do things like changing the optimiser parameters at will, and hooking together custom optimizers, with or without the predefined ones.
+
+```julia
+opt = Descent(0.5)
+update!(opt, params(model))
+opt.eta = 0.2 # valid statment, useful for annealing/ scaling
+```
+
+The `ExpDecay` function defined within Flux, takes advantage of this flexibility. It can be used as a way of scheduling the learning rate. It makes it easy to scale the learning rate, every `n` epochs. Additionaly, it is easy to specify a `clip` or a bound to the learning rate, beyond which it will be maintained throughout the remainder of the training.
+
+```julia
+mutable struct ExpDecay
+  eta::Float64
+  decay::Float64
+  step::Int64
+  clip::Float64
+  current::IdDict
+end
+```
+
+## Optimiser
+
+An equally easy to use interface is that of `Optimiser` which is designed for creating compound optimisers or in general let us take an action against the training loop as defined on the parameters. The `update!` API remains unified.
+
+```julia
+opt1 = Descent()
+opt2 = Optimiser(InvDecay(), RMSProp())
+opt = Opitmiser(opt1, opt2)
+
+update!(opt, params(model))
+```
+
+`opt = Optimiser(ExpDecay(), ADAM())` generates an optimiser that applies the previously discussed `ExpDecay` on the `ADAM` optimiser, during the training. It can also be extended as `Optimiser(..., Optimiser(...))` to create sophisticated and general optimisers that can be customised extensively. It follows many of julia's semantics, so it is possible to `push!` to them, index on them, slice them etc.
--- a/docs/src/training/training.md
+++ b/docs/src/training/training.md
@ -9,7 +9,7 @@ To actually train a model we need three things:
 With these we can call `Flux.train!`:

 ```julia
-Flux.train!(objective, data, opt)
+Flux.train!(objective, params, data, opt)
 ```

 There are plenty of examples in the [model zoo](https://github.com/FluxML/model-zoo).
@ -26,7 +26,7 @@ m = Chain(
 loss(x, y) = Flux.mse(m(x), y)

 # later
-Flux.train!(loss, data, opt)
+Flux.train!(loss, params, data, opt)
 ```

 The objective will almost always be defined in terms of some *cost function* that measures the distance of the prediction `m(x)` from the target `y`. Flux has several of these built in, like `mse` for mean squared error or `crossentropy` for cross entropy loss, but you can calculate it however you want.
@ -78,7 +78,7 @@ julia> @epochs 2 Flux.train!(...)
 `train!` takes an additional argument, `cb`, that's used for callbacks so that you can observe the training process. For example:

 ```julia
-train!(objective, data, opt, cb = () -> println("training"))
+train!(objective, params, data, opt, cb = () -> println("training"))
 ```

 Callbacks are called for every batch of training data. You can slow this down using `Flux.throttle(f, timeout)` which prevents `f` from being called more than once every `timeout` seconds.