Improve parameter lists in optimisers.jl

2019-10-25 13:23:27 +02:00 · 2019-10-25 13:23:27 +02:00 · a614983e0b
commit a614983e0b
parent aaa0a82b74
1 changed files with 52 additions and 41 deletions
--- a/src/optimise/optimisers.jl
+++ b/src/optimise/optimisers.jl
@ -12,8 +12,8 @@ Classic gradient descent optimiser with learning rate `η`.
 For each parameter `p` and its gradient `δp`, this runs `p -= η*δp`
 # Parameters
-  - Learning rate (`η`): Amount by which the gradients are discounted before updating
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-                         the weights.
+                       the weights.
 # Examples
 ```julia
@ -24,7 +24,7 @@ opt = Descent(0.3)
 ps = params(model)
 gs = gradient(ps) do
-  loss(x, y)
+    loss(x, y)
 end
 Flux.Optimise.update!(opt, ps, gs)
@ -46,10 +46,10 @@ end
 Gradient descent optimizer with learning rate `η` and momentum `ρ`.
 # Parameters
-  - Learning rate (`η`): Amount by which gradients are discounted before updating the
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-                         weights.
+                       the weights.
-  - Momentum (`ρ`): Controls the acceleration of gradient descent in the relevant direction
+- Momentum (`ρ`): Controls the acceleration of gradient descent in the
-                    and therefore the dampening of oscillations.
+                  prominent direction, in effect dampening oscillations.
 # Examples
 ```julia
@ -79,9 +79,10 @@ end
 Gradient descent optimizer with learning rate `η` and Nesterov momentum `ρ`.
 # Parameters
-  - Learning rate (`η`): Amount by which the gradients are discounted before updating the
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-                         weights.
+                       the weights.
-  - Nesterov momentum (`ρ`): The amount of Nesterov momentum to be applied.
+- Nesterov momentum (`ρ`): Controls the acceleration of gradient descent in the
                           prominent direction, in effect dampening oscillations.
 # Examples
 ```julia
@ -115,8 +116,10 @@ algorithm. Often a good choice for recurrent networks. Parameters other than lea
 generally don't need tuning.
 # Parameters
-  - Learning rate (`η`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-  - Momentum (`ρ`)
+                       the weights.
 - Momentum (`ρ`): Controls the acceleration of gradient descent in the
                  prominent direction, in effect dampening oscillations.
 # Examples
 ```julia
@ -146,9 +149,10 @@ end
 [ADAM](https://arxiv.org/abs/1412.6980v8) optimiser.
 # Parameters
-  - Learning rate (`η`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-  - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
+                       the weights.
-                                     second (β2) momentum estimate.
+- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
                                   second (β2) momentum estimate.
 # Examples
 ```julia
@ -181,9 +185,10 @@ end
 [Rectified ADAM](https://arxiv.org/pdf/1908.03265v1.pdf) optimizer.
 # Parameters
-  - Learning rate (`η`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-  - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
+                       the weights.
-                                     second (β2) momentum estimate.
+- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
                                   second (β2) momentum estimate.
 # Examples
 ```julia
@ -223,9 +228,10 @@ end
 [AdaMax](https://arxiv.org/abs/1412.6980v9) is a variant of ADAM based on the ∞-norm.
 # Parameters
-  - Learning rate (`η`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-  - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
+                       the weights.
-                                     second (β2) momentum estimate.
+- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
                                   second (β2) momentum estimate.
 # Examples
 ```julia
@ -260,7 +266,8 @@ parameter specific learning rates based on how frequently it is updated.
 Parameters don't need tuning.
 # Parameters
-  - Learning rate (`η`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
                       the weights.
 # Examples
 ```julia
@ -291,7 +298,7 @@ rate based on a window of past gradient updates.
 Parameters don't need tuning.
 # Parameters
-  - Rho (`ρ`): Factor by which gradient is decayed at each time step.
+- Rho (`ρ`): Factor by which the gradient is decayed at each time step.
 # Examples
 ```julia
@ -323,9 +330,10 @@ The [AMSGrad](https://openreview.net/forum?id=ryQu7f-RZ) version of the ADAM
 optimiser. Parameters don't need tuning.
 # Parameters
-  - Learning Rate (`η`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-  - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
+                       the weights.
-                                     second (β2) momentum estimate.
+- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
                                   second (β2) momentum estimate.
 # Examples
 ```julia
@ -358,9 +366,10 @@ end
 Parameters don't need tuning.
 # Parameters
-  - Learning rate (`η`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-  - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
+                       the weights.
-                                     second (β2) momentum estimate.
+- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
                                   second (β2) momentum estimate.
 # Examples
 ```julia
@ -394,10 +403,11 @@ end
 weight decay regularization.
 # Parameters
-  - Learning rate (`η`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-  - Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
+                       the weights.
-                                     second (β2) momentum estimate.
+- Decay of momentums (`β::Tuple`): Exponential decay for the first (β1) and the
-  - `decay`: Decay applied to weights during optimisation.
+                                   second (β2) momentum estimate.
 - `decay`: Decay applied to weights during optimisation.
 # Examples
 ```julia
@ -464,17 +474,18 @@ function apply!(o::InvDecay, x, Δ)
 end
 """
-    ExpDecay(eta = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4)
+    ExpDecay(η = 0.001, decay = 0.1, decay_step = 1000, clip = 1e-4)
-Discount the learning rate `eta` by the factor `decay` every `decay_step` steps till
+Discount the learning rate `η` by the factor `decay` every `decay_step` steps till
 a minimum of `clip`.
 # Parameters
-  - Learning rate (`eta`)
+- Learning rate (`η`): Amount by which gradients are discounted before updating
-  - `decay`: Factor by which the learning rate is discounted.
+                       the weights.
-  - `decay_step`: Schedule decay operations by setting number of steps between two decay
+- `decay`: Factor by which the learning rate is discounted.
-                  operations.
+- `decay_step`: Schedule decay operations by setting the number of steps between
-  - `clip`: Minimum value of learning rate.
+                two decay operations.
 - `clip`: Minimum value of learning rate.
 # Examples
 To apply exponential decay to an optimiser:
@ -510,7 +521,7 @@ end
 Decay weights by `wd`.
 # Parameters
-  - Weight decay (`wd`)
+- Weight decay (`wd`)
 """
 mutable struct WeightDecay
  wd::Real