basic training docs

2017-09-09 21:01:19 -04:00 · 2017-09-09 21:01:19 -04:00 · 17e40b1f76
commit 17e40b1f76
parent 33a5d26e57
3 changed files with 62 additions and 1 deletions
--- a/docs/make.jl
+++ b/docs/make.jl
@ -7,10 +7,13 @@ makedocs(modules=[Flux],
         sitename = "Flux",
         assets = ["../flux.css"],
         pages = ["Home" => "index.md",
-                  "Models" =>
+                  "Building Models" =>
                    ["Basics" => "models/basics.md",
                     "Recurrence" => "models/recurrence.md",
                     "Layer Reference" => "models/layers.md"],
+                  "Training Models" =>
+                    ["Optimisers" => "training/optimisers.md",
+                     "Training" => "training/training.md"],
                  "Contributing & Help" => "contributing.md"])

 deploydocs(
--- a/docs/src/training/optimisers.md
+++ b/docs/src/training/optimisers.md
@ -0,0 +1,54 @@
+# Optimisers
+
+Consider a [simple linear regression](../models/basics.html). We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters `W` and `b`.
+
+```julia
+W = param(rand(2, 5))
+b = param(rand(2))
+
+predict(x) = W*x .+ b
+loss(x, y) = sum((predict(x) .- y).^2)
+
+x, y = rand(5), rand(2) # Dummy data
+l = loss(x, y) # ~ 3
+back!(l)
+```
+
+We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:
+
+```julia
+using Flux.Tracker: data, grad
+
+function update()
+  η = 0.1 # Learning Rate
+  for p in (W, b)
+    x, Δ = data(p), grad(p)
+    x .-= η .* Δ # Apply the update
+    Δ .= 0       # Clear the gradient
+  end
+end
+```
+
+If we call `update`, the parameters `W` and `b` will change and our loss should go down.
+
+There are two pieces here: one is that we need a list of trainable parameters for the model (`[W, b]` in this case), and the other is the update step. In this case the update is simply gradient descent (`x .-= η .* Δ`), but we might choose to do something more advanced, like adding momentum.
+
+In this case, getting the variables is trivial, but you can imagine it'd be more of a pain with some complex stack of layers.
+
+```julia
+m = Chain(
+  Dense(10, 5, σ),
+  Dense(5, 2), softmax)
+```
+
+Instead of having to write `[m[1].W, m[1].b, ...]`, Flux provides a params function `params(m)` that returns a list of all parameters in the model for you.
+
+For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various *optimisers* that make it more convenient.
+
+```julia
+opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1
+
+opt()
+```
+
+An optimiser takes a parameter list and returns a function that does the same thing as `update` above. We can pass either `opt` or `update` to our [training loop](training.html), which will then run the optimiser after every mini-batch of data.
--- a/docs/src/training/training.md
+++ b/docs/src/training/training.md
@ -0,0 +1,4 @@
+```julia
+Flux.train!(loss, repeated((x,y), 1000), SGD(params(m), 0.1),
+            cb = throttle(() -> @show(loss(x, y)), 5))
+```