[{"location":"training/optimisers/#Optimisers-1","page":"Optimisers","title":"Optimisers","text":"","category":"section"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.","category":"page"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"using Flux\n\nW = rand(2, 5)\nb = rand(2)\n\npredict(x) = (W * x) .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nl = loss(x, y) # ~ 3\n\nθ = Params([W, b])\ngrads = gradient(() -> loss(x, y), θ)","category":"page"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:","category":"page"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"using Flux.Optimise: update!\n\nη = 0.1 # Learning Rate\nfor p in (W, b)\n update!(p, -η * grads[p])\nend","category":"page"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"Running this will alter the parameters W and b and our loss should go down. Flux provides a more general way to do optimiser updates like this.","category":"page"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"opt = Descent(0.1) # Gradient descent with learning rate 0.1\n\nfor p in (W, b)\n update!(opt, p, grads[p])\nend","category":"page"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"An optimiser update! accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass opt to our training loop, which will update all parameters of the model in a loop. However, we can now easily replace Descent with a more advanced optimiser such as ADAM.","category":"page"},{"location":"training/optimisers/#Optimiser-Reference-1","page":"Optimisers","title":"Optimiser Reference","text":"","category":"section"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"All optimisers return an object that, when passed to train!, will update the parameters passed to it.","category":"page"},{"location":"training/optimisers/#","page":"Optimisers","title":"Optimisers","text":"Flux.Optimise.update!\nDescent\nMomentum\nNesterov\nRMSProp\nADAM\nRADAM\nAdaMax\nADAGrad\nADADelta\nAMSGrad\nNADAM\nADAMW","category":"page"},{"location":"training/optimisers/#Flux.Optimise.update!","page":"Optimisers","title":"Flux.Optimise.update!","text":"update!(x, x̄)\n\nUpdate the array x according to x .-= x̄.\n\n\n\n\n\nupdate!(opt, p, g)\nupdate!(opt, ps::Params, gs)\n\nPerform an update step of the parameters ps (or the single parameter p) according to optimizer opt and the gradients gs (the gradient g).\n\nAs a result, the parameters are mutated and the optimizer's internal state may change.\n\n\n\n\n\n","category":"function"},{"location":"training/optimisers/#Flux.Optimise.Descent","page":"Optimisers","title":"Flux.Optimise.Descent","text":"Descent(η = 0.1)\n\nClassic gradient descent optimiser with learning rate η. For each parameter p and its gradient δp, this runs p -= η*δp\n\nParameters\n\nLearning rate (η): Amount by which gradients are discounted before updating the weights.\n\nExamples\n\nopt = Descent()\n\nopt = Descent(0.3)\n\nps = params(model)\n\ngs = gradient(ps) do\n loss(x, y)\nend\n\nFlux.Optimise.update!(opt, ps, gs)\n\n\n\n\n\n","category":"type"},{"location":"training/optimisers/#Flux.Optimise.Momentum","page":"Optimisers","title":"Flux.Optimise.Momentum","text":"Momentum(η=0.01,ρ=0.9)\n\nGradientdescentoptimizerwithlearningrateηandmomentumρ.\n\nParameters\n\nLearningrate(η):Amountbywhichgradientsarediscountedbeforeupdatingtheweights.\nMomentum(ρ):Controlstheaccelerationofgradientdes