diff --git a/src/optimise/optimisers.jl b/src/optimise/optimisers.jl index 4f121edf..611edddb 100644 --- a/src/optimise/optimisers.jl +++ b/src/optimise/optimisers.jl @@ -12,8 +12,8 @@ Classic gradient descent optimiser with learning rate `η`. For each parameter `p` and its gradient `δp`, this runs `p -= η*δp` # Parameters - - Learning rate (`η`): Amount by which the gradients are discounted before updating - the weights. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. # Examples ```julia @@ -24,7 +24,7 @@ opt = Descent(0.3) ps = params(model) gs = gradient(ps) do - loss(x, y) + loss(x, y) end Flux.Optimise.update!(opt, ps, gs) @@ -46,10 +46,10 @@ end Gradient descent optimizer with learning rate `η` and momentum `ρ`. # Parameters - - Learning rate (`η`): Amount by which gradients are discounted before updating the - weights. - - Momentum (`ρ`): Controls the acceleration of gradient descent in the relevant direction - and therefore the dampening of oscillations. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Momentum (`ρ`): Controls the acceleration of gradient descent in the + prominent direction, in effect dampening oscillations. # Examples ```julia @@ -79,9 +79,10 @@ end Gradient descent optimizer with learning rate `η` and Nesterov momentum `ρ`. # Parameters - - Learning rate (`η`): Amount by which the gradients are discounted before updating the - weights. - - Nesterov momentum (`ρ`): The amount of Nesterov momentum to be applied. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Nesterov momentum (`ρ`): Controls the acceleration of gradient descent in the + prominent direction, in effect dampening oscillations. # Examples ```julia @@ -115,8 +116,10 @@ algorithm. Often a good choice for recurrent networks. Parameters other than lea generally don't need tuning. # Parameters - - Learning rate (`η`) - - Momentum (`ρ`) +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Momentum (`ρ`): Controls the acceleration of gradient descent in the + prominent direction, in effect dampening oscillations. # Examples ```julia @@ -146,9 +149,10 @@ end [ADAM](https://arxiv.org/abs/1412.6980v8) optimiser. # Parameters - - Learning rate (`η`) - - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the - second (β2) momentum estimate. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the + second (β2) momentum estimate. # Examples ```julia @@ -181,9 +185,10 @@ end [Rectified ADAM](https://arxiv.org/pdf/1908.03265v1.pdf) optimizer. # Parameters - - Learning rate (`η`) - - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the - second (β2) momentum estimate. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the + second (β2) momentum estimate. # Examples ```julia @@ -223,9 +228,10 @@ end [AdaMax](https://arxiv.org/abs/1412.6980v9) is a variant of ADAM based on the ∞-norm. # Parameters - - Learning rate (`η`) - - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the - second (β2) momentum estimate. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the + second (β2) momentum estimate. # Examples ```julia @@ -260,7 +266,8 @@ parameter specific learning rates based on how frequently it is updated. Parameters don't need tuning. # Parameters - - Learning rate (`η`) +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. # Examples ```julia @@ -291,7 +298,7 @@ rate based on a window of past gradient updates. Parameters don't need tuning. # Parameters - - Rho (`ρ`): Factor by which gradient is decayed at each time step. +- Rho (`ρ`): Factor by which the gradient is decayed at each time step. # Examples ```julia @@ -323,9 +330,10 @@ The [AMSGrad](https://openreview.net/forum?id=ryQu7f-RZ) version of the ADAM optimiser. Parameters don't need tuning. # Parameters - - Learning Rate (`η`) - - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the - second (β2) momentum estimate. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the + second (β2) momentum estimate. # Examples ```julia @@ -358,9 +366,10 @@ end Parameters don't need tuning. # Parameters - - Learning rate (`η`) - - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the - second (β2) momentum estimate. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the + second (β2) momentum estimate. # Examples ```julia @@ -394,10 +403,11 @@ end weight decay regularization. # Parameters - - Learning rate (`η`) - - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the - second (β2) momentum estimate. - - `decay`: Decay applied to weights during optimisation. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the + second (β2) momentum estimate. +- `decay`: Decay applied to weights during optimisation. # Examples ```julia @@ -464,17 +474,18 @@ function apply!(o::InvDecay, x, Δ) end """ - ExpDecay(eta = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4) + ExpDecay(η = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4) -Discount the learning rate `eta` by the factor `decay` every `decay_step` steps till +Discount the learning rate `η` by the factor `decay` every `decay_step` steps till a minimum of `clip`. # Parameters - - Learning rate (`eta`) - - `decay`: Factor by which the learning rate is discounted. - - `decay_step`: Schedule decay operations by setting number of steps between two decay - operations. - - `clip`: Minimum value of learning rate. +- Learning rate (`η`): Amount by which gradients are discounted before updating + the weights. +- `decay`: Factor by which the learning rate is discounted. +- `decay_step`: Schedule decay operations by setting the number of steps between + two decay operations. +- `clip`: Minimum value of learning rate. # Examples To apply exponential decay to an optimiser: @@ -510,7 +521,7 @@ end Decay weights by `wd`. # Parameters - - Weight decay (`wd`) +- Weight decay (`wd`) """ mutable struct WeightDecay wd::Real