Improve parameter lists in optimisers.jl

This commit is contained in:
janEbert 2019-10-25 13:23:27 +02:00
parent aaa0a82b74
commit a614983e0b
1 changed files with 52 additions and 41 deletions

View File

@ -12,7 +12,7 @@ Classic gradient descent optimiser with learning rate `η`.
For each parameter `p` and its gradient `δp`, this runs `p -= η*δp`
# Parameters
- Learning rate (`η`): Amount by which the gradients are discounted before updating
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
# Examples
@ -46,10 +46,10 @@ end
Gradient descent optimizer with learning rate `η` and momentum `ρ`.
# Parameters
- Learning rate (`η`): Amount by which gradients are discounted before updating the
weights.
- Momentum (`ρ`): Controls the acceleration of gradient descent in the relevant direction
and therefore the dampening of oscillations.
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Momentum (`ρ`): Controls the acceleration of gradient descent in the
prominent direction, in effect dampening oscillations.
# Examples
```julia
@ -79,9 +79,10 @@ end
Gradient descent optimizer with learning rate `η` and Nesterov momentum `ρ`.
# Parameters
- Learning rate (`η`): Amount by which the gradients are discounted before updating the
weights.
- Nesterov momentum (`ρ`): The amount of Nesterov momentum to be applied.
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Nesterov momentum (`ρ`): Controls the acceleration of gradient descent in the
prominent direction, in effect dampening oscillations.
# Examples
```julia
@ -115,8 +116,10 @@ algorithm. Often a good choice for recurrent networks. Parameters other than lea
generally don't need tuning.
# Parameters
- Learning rate (`η`)
- Momentum (`ρ`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Momentum (`ρ`): Controls the acceleration of gradient descent in the
prominent direction, in effect dampening oscillations.
# Examples
```julia
@ -146,7 +149,8 @@ end
[ADAM](https://arxiv.org/abs/1412.6980v8) optimiser.
# Parameters
- Learning rate (`η`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
second (β2) momentum estimate.
@ -181,7 +185,8 @@ end
[Rectified ADAM](https://arxiv.org/pdf/1908.03265v1.pdf) optimizer.
# Parameters
- Learning rate (`η`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
second (β2) momentum estimate.
@ -223,7 +228,8 @@ end
[AdaMax](https://arxiv.org/abs/1412.6980v9) is a variant of ADAM based on the -norm.
# Parameters
- Learning rate (`η`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
second (β2) momentum estimate.
@ -260,7 +266,8 @@ parameter specific learning rates based on how frequently it is updated.
Parameters don't need tuning.
# Parameters
- Learning rate (`η`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
# Examples
```julia
@ -291,7 +298,7 @@ rate based on a window of past gradient updates.
Parameters don't need tuning.
# Parameters
- Rho (`ρ`): Factor by which gradient is decayed at each time step.
- Rho (`ρ`): Factor by which the gradient is decayed at each time step.
# Examples
```julia
@ -323,7 +330,8 @@ The [AMSGrad](https://openreview.net/forum?id=ryQu7f-RZ) version of the ADAM
optimiser. Parameters don't need tuning.
# Parameters
- Learning Rate (`η`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
second (β2) momentum estimate.
@ -358,7 +366,8 @@ end
Parameters don't need tuning.
# Parameters
- Learning rate (`η`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
second (β2) momentum estimate.
@ -394,7 +403,8 @@ end
weight decay regularization.
# Parameters
- Learning rate (`η`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
second (β2) momentum estimate.
- `decay`: Decay applied to weights during optimisation.
@ -464,16 +474,17 @@ function apply!(o::InvDecay, x, Δ)
end
"""
ExpDecay(eta = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4)
ExpDecay(η = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4)
Discount the learning rate `eta` by the factor `decay` every `decay_step` steps till
Discount the learning rate `η` by the factor `decay` every `decay_step` steps till
a minimum of `clip`.
# Parameters
- Learning rate (`eta`)
- Learning rate (`η`): Amount by which gradients are discounted before updating
the weights.
- `decay`: Factor by which the learning rate is discounted.
- `decay_step`: Schedule decay operations by setting number of steps between two decay
operations.
- `decay_step`: Schedule decay operations by setting the number of steps between
two decay operations.
- `clip`: Minimum value of learning rate.
# Examples