Improve parameter lists in optimisers.jl
This commit is contained in:
parent
aaa0a82b74
commit
a614983e0b
|
@ -12,8 +12,8 @@ Classic gradient descent optimiser with learning rate `η`.
|
||||||
For each parameter `p` and its gradient `δp`, this runs `p -= η*δp`
|
For each parameter `p` and its gradient `δp`, this runs `p -= η*δp`
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`): Amount by which the gradients are discounted before updating
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
the weights.
|
the weights.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -24,7 +24,7 @@ opt = Descent(0.3)
|
||||||
ps = params(model)
|
ps = params(model)
|
||||||
|
|
||||||
gs = gradient(ps) do
|
gs = gradient(ps) do
|
||||||
loss(x, y)
|
loss(x, y)
|
||||||
end
|
end
|
||||||
|
|
||||||
Flux.Optimise.update!(opt, ps, gs)
|
Flux.Optimise.update!(opt, ps, gs)
|
||||||
|
@ -46,10 +46,10 @@ end
|
||||||
Gradient descent optimizer with learning rate `η` and momentum `ρ`.
|
Gradient descent optimizer with learning rate `η` and momentum `ρ`.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`): Amount by which gradients are discounted before updating the
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
weights.
|
the weights.
|
||||||
- Momentum (`ρ`): Controls the acceleration of gradient descent in the relevant direction
|
- Momentum (`ρ`): Controls the acceleration of gradient descent in the
|
||||||
and therefore the dampening of oscillations.
|
prominent direction, in effect dampening oscillations.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -79,9 +79,10 @@ end
|
||||||
Gradient descent optimizer with learning rate `η` and Nesterov momentum `ρ`.
|
Gradient descent optimizer with learning rate `η` and Nesterov momentum `ρ`.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`): Amount by which the gradients are discounted before updating the
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
weights.
|
the weights.
|
||||||
- Nesterov momentum (`ρ`): The amount of Nesterov momentum to be applied.
|
- Nesterov momentum (`ρ`): Controls the acceleration of gradient descent in the
|
||||||
|
prominent direction, in effect dampening oscillations.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -115,8 +116,10 @@ algorithm. Often a good choice for recurrent networks. Parameters other than lea
|
||||||
generally don't need tuning.
|
generally don't need tuning.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
- Momentum (`ρ`)
|
the weights.
|
||||||
|
- Momentum (`ρ`): Controls the acceleration of gradient descent in the
|
||||||
|
prominent direction, in effect dampening oscillations.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -146,9 +149,10 @@ end
|
||||||
[ADAM](https://arxiv.org/abs/1412.6980v8) optimiser.
|
[ADAM](https://arxiv.org/abs/1412.6980v8) optimiser.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
the weights.
|
||||||
second (β2) momentum estimate.
|
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
||||||
|
second (β2) momentum estimate.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -181,9 +185,10 @@ end
|
||||||
[Rectified ADAM](https://arxiv.org/pdf/1908.03265v1.pdf) optimizer.
|
[Rectified ADAM](https://arxiv.org/pdf/1908.03265v1.pdf) optimizer.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
the weights.
|
||||||
second (β2) momentum estimate.
|
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
||||||
|
second (β2) momentum estimate.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -223,9 +228,10 @@ end
|
||||||
[AdaMax](https://arxiv.org/abs/1412.6980v9) is a variant of ADAM based on the ∞-norm.
|
[AdaMax](https://arxiv.org/abs/1412.6980v9) is a variant of ADAM based on the ∞-norm.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
the weights.
|
||||||
second (β2) momentum estimate.
|
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
||||||
|
second (β2) momentum estimate.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -260,7 +266,8 @@ parameter specific learning rates based on how frequently it is updated.
|
||||||
Parameters don't need tuning.
|
Parameters don't need tuning.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
|
the weights.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -291,7 +298,7 @@ rate based on a window of past gradient updates.
|
||||||
Parameters don't need tuning.
|
Parameters don't need tuning.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Rho (`ρ`): Factor by which gradient is decayed at each time step.
|
- Rho (`ρ`): Factor by which the gradient is decayed at each time step.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -323,9 +330,10 @@ The [AMSGrad](https://openreview.net/forum?id=ryQu7f-RZ) version of the ADAM
|
||||||
optimiser. Parameters don't need tuning.
|
optimiser. Parameters don't need tuning.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning Rate (`η`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
the weights.
|
||||||
second (β2) momentum estimate.
|
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
||||||
|
second (β2) momentum estimate.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -358,9 +366,10 @@ end
|
||||||
Parameters don't need tuning.
|
Parameters don't need tuning.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
the weights.
|
||||||
second (β2) momentum estimate.
|
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
||||||
|
second (β2) momentum estimate.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -394,10 +403,11 @@ end
|
||||||
weight decay regularization.
|
weight decay regularization.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`η`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
the weights.
|
||||||
second (β2) momentum estimate.
|
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
|
||||||
- `decay`: Decay applied to weights during optimisation.
|
second (β2) momentum estimate.
|
||||||
|
- `decay`: Decay applied to weights during optimisation.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
```julia
|
```julia
|
||||||
|
@ -464,17 +474,18 @@ function apply!(o::InvDecay, x, Δ)
|
||||||
end
|
end
|
||||||
|
|
||||||
"""
|
"""
|
||||||
ExpDecay(eta = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4)
|
ExpDecay(η = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4)
|
||||||
|
|
||||||
Discount the learning rate `eta` by the factor `decay` every `decay_step` steps till
|
Discount the learning rate `η` by the factor `decay` every `decay_step` steps till
|
||||||
a minimum of `clip`.
|
a minimum of `clip`.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Learning rate (`eta`)
|
- Learning rate (`η`): Amount by which gradients are discounted before updating
|
||||||
- `decay`: Factor by which the learning rate is discounted.
|
the weights.
|
||||||
- `decay_step`: Schedule decay operations by setting number of steps between two decay
|
- `decay`: Factor by which the learning rate is discounted.
|
||||||
operations.
|
- `decay_step`: Schedule decay operations by setting the number of steps between
|
||||||
- `clip`: Minimum value of learning rate.
|
two decay operations.
|
||||||
|
- `clip`: Minimum value of learning rate.
|
||||||
|
|
||||||
# Examples
|
# Examples
|
||||||
To apply exponential decay to an optimiser:
|
To apply exponential decay to an optimiser:
|
||||||
|
@ -510,7 +521,7 @@ end
|
||||||
Decay weights by `wd`.
|
Decay weights by `wd`.
|
||||||
|
|
||||||
# Parameters
|
# Parameters
|
||||||
- Weight decay (`wd`)
|
- Weight decay (`wd`)
|
||||||
"""
|
"""
|
||||||
mutable struct WeightDecay
|
mutable struct WeightDecay
|
||||||
wd::Real
|
wd::Real
|
||||||
|
|
Loading…
Reference in New Issue