Optimisers

Optimisers

Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.

using Flux, Flux.Tracker

W = param(rand(2, 5))
b = param(rand(2))

predict(x) = W*x .+ b
loss(x, y) = sum((predict(x) .- y).^2)

x, y = rand(5), rand(2) # Dummy data
l = loss(x, y) # ~ 3

θ = Params([W, b])
grads = Tracker.gradient(() -> loss(x, y), θ)

We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:

using Flux.Tracker: grad, update!

η = 0.1 # Learning Rate
for p in (W, b)
  update!(p, -η * grads[p])
end

Running this will alter the parameters W and b and our loss should go down. Flux provides a more general way to do optimiser updates like this.

opt = Descent(0.1) # Gradient descent with learning rate 0.1

for p in (W, b)
  update!(opt, p, grads[p])
end

An optimiser update! accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass opt to our training loop, which will update all parameters of the model in a loop. However, we can now easily replace Descent with a more advanced optimiser such as ADAM.

Optimiser Reference

All optimisers return an object that, when passed to train!, will update the parameters passed to it.

Descent(η)

Classic gradient descent optimiser with learning rate η. For each parameter p and its gradient δp, this runs p -= η*δp.

source
Momentum(params, η = 0.01; ρ = 0.9)

Gradient descent with learning rate η and momentum ρ.

source
Nesterov(eta, ρ = 0.9)

Gradient descent with learning rate η and Nesterov momentum ρ.

source
ADAM(η = 0.001, β = (0.9, 0.999))

ADAM optimiser.

source