Optimisers

Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.

W = param(rand(2, 5))
b = param(rand(2))

predict(x) = W*x .+ b
loss(x, y) = sum((predict(x) .- y).^2)

x, y = rand(5), rand(2) # Dummy data
l = loss(x, y) # ~ 3
back!(l)

We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:

using Flux.Tracker: data, grad

function update()
  η = 0.1 # Learning Rate
  for p in (W, b)
    x, Δ = data(p), grad(p)
    x .-= η .* Δ # Apply the update
    Δ .= 0       # Clear the gradient
  end
end

If we call update, the parameters W and b will change and our loss should go down.

There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.

In this case, getting the variables is trivial, but you can imagine it'd be more of a pain with some complex stack of layers.

m = Chain(
  Dense(10, 5, σ),
  Dense(5, 2), softmax)

Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.

For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various optimisers that make it more convenient.

opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1

opt()

An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data.

1.9 KiB Raw Blame History Unescape Escape

Optimisers

1.9 KiB

Raw Blame History