Flux.jl/docs/src/training/optimisers.md
2017-09-12 11:34:04 +01:00

1.9 KiB
Raw Blame History

Optimisers

Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.

W = param(rand(2, 5))
b = param(rand(2))

predict(x) = W*x .+ b
loss(x, y) = sum((predict(x) .- y).^2)

x, y = rand(5), rand(2) # Dummy data
l = loss(x, y) # ~ 3
back!(l)

We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:

using Flux.Tracker: data, grad

function update()
  η = 0.1 # Learning Rate
  for p in (W, b)
    x, Δ = data(p), grad(p)
    x .-= η .* Δ # Apply the update
    Δ .= 0       # Clear the gradient
  end
end

If we call update, the parameters W and b will change and our loss should go down.

There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.

In this case, getting the variables is trivial, but you can imagine it'd be more of a pain with some complex stack of layers.

m = Chain(
  Dense(10, 5, σ),
  Dense(5, 2), softmax)

Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.

For the update step, there's nothing whatsoever wrong with writing the loop above it'll work just fine but Flux provides various optimisers that make it more convenient.

opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1

opt()

An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data.