From 296ab58c7e671837c8cb2e2b94ef2b6d1adaa7e9 Mon Sep 17 00:00:00 2001 From: autodocs Date: Thu, 10 Jan 2019 11:20:10 +0000 Subject: [PATCH] build based on f0d5624 --- latest/models/layers.html | 14 +++++++------- latest/search_index.js | 10 +++++----- latest/training/optimisers.html | 16 +++++++--------- latest/training/training.html | 9 +++++---- 4 files changed, 24 insertions(+), 25 deletions(-) diff --git a/latest/models/layers.html b/latest/models/layers.html index 92fcf785..28ce5a2c 100644 --- a/latest/models/layers.html +++ b/latest/models/layers.html @@ -11,28 +11,28 @@ m(5) == 26 m = Chain(Dense(10, 5), Dense(5, 2)) x = rand(10) -m(x) == m[2](m[1](x))

Chain also supports indexing and slicing, e.g. m[2] or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.

source
Flux.DenseType.
Dense(in::Integer, out::Integer, σ = identity)

Creates a traditional Dense layer with parameters W and b.

y = σ.(W * x .+ b)

The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length out.

julia> d = Dense(5, 2)
+m(x) == m[2](m[1](x))

Chain also supports indexing and slicing, e.g. m[2] or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.

source
Flux.DenseType.
Dense(in::Integer, out::Integer, σ = identity)

Creates a traditional Dense layer with parameters W and b.

y = σ.(W * x .+ b)

The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length out.

julia> d = Dense(5, 2)
 Dense(5, 2)
 
 julia> d(rand(5))
 Tracked 2-element Array{Float64,1}:
   0.00257447
-  -0.00449443
source
Flux.ConvType.
Conv(size, in=>out)
-Conv(size, in=>out, relu)

Standard convolutional layer. size should be a tuple like (2, 2). in and out specify the number of input and output channels respectively.

Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3 array, and a batch of 50 would be a 100×100×3×50 array.

Takes the keyword arguments pad, stride and dilation.

source
Flux.MaxPoolType.
MaxPool(k)

Max pooling layer. k stands for the size of the window for each dimension of the input.

Takes the keyword arguments pad and stride.

source
Flux.MeanPoolType.
MeanPool(k)

Mean pooling layer. k stands for the size of the window for each dimension of the input.

Takes the keyword arguments pad and stride.

source

Additional Convolution Layers

DepthwiseConv(size, in)
+  -0.00449443
source
Flux.ConvType.
Conv(size, in=>out)
+Conv(size, in=>out, relu)

Standard convolutional layer. size should be a tuple like (2, 2). in and out specify the number of input and output channels respectively.

Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3 array, and a batch of 50 would be a 100×100×3×50 array.

Takes the keyword arguments pad, stride and dilation.

source
Flux.MaxPoolType.
MaxPool(k)

Max pooling layer. k stands for the size of the window for each dimension of the input.

Takes the keyword arguments pad and stride.

source
Flux.MeanPoolType.
MeanPool(k)

Mean pooling layer. k stands for the size of the window for each dimension of the input.

Takes the keyword arguments pad and stride.

source

Additional Convolution Layers

DepthwiseConv(size, in)
 DepthwiseConv(size, in=>mul)
-DepthwiseConv(size, in=>mul, relu)

Depthwise convolutional layer. size should be a tuple like (2, 2). in and mul specify the number of input channels and channel multiplier respectively. In case the mul is not specified it is taken as 1.

Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3 array, and a batch of 50 would be a 100×100×3×50 array.

Takes the keyword arguments pad and stride.

source

Recurrent Layers

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNFunction.
RNN(in::Integer, out::Integer, σ = tanh)

The most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.

source
Flux.LSTMFunction.
LSTM(in::Integer, out::Integer)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

source
Flux.GRUFunction.
GRU(in::Integer, out::Integer)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

source
Flux.RecurType.
Recur(cell)

Recur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:

h, y = cell(h, x...)

For example, here's a recurrent network that keeps a running total of its inputs.

accum(h, x) = (h+x, x)
+DepthwiseConv(size, in=>mul, relu)

Depthwise convolutional layer. size should be a tuple like (2, 2). in and mul specify the number of input channels and channel multiplier respectively. In case the mul is not specified it is taken as 1.

Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3 array, and a batch of 50 would be a 100×100×3×50 array.

Takes the keyword arguments pad and stride.

source

Recurrent Layers

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNFunction.
RNN(in::Integer, out::Integer, σ = tanh)

The most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.

source
Flux.LSTMFunction.
LSTM(in::Integer, out::Integer)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

source
Flux.GRUFunction.
GRU(in::Integer, out::Integer)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

source
Flux.RecurType.
Recur(cell)

Recur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:

h, y = cell(h, x...)

For example, here's a recurrent network that keeps a running total of its inputs.

accum(h, x) = (h+x, x)
 rnn = Flux.Recur(accum, 0)
 rnn(2) # 2
 rnn(3) # 3
 rnn.state # 5
 rnn.(1:10) # apply to a sequence
-rnn.state # 60
source

Activation Functions

Non-linearities that go between layers of your model. Most of these functions are defined in NNlib but are available by default in Flux.

Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call σ.(xs), relu.(xs) and so on.

NNlib.σFunction.
σ(x) = 1 / (1 + exp(-x))

Classic sigmoid activation function.

NNlib.reluFunction.
relu(x) = max(0, x)

Rectified Linear Unit activation function.

NNlib.leakyreluFunction.
leakyrelu(x) = max(0.01x, x)

Leaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).

NNlib.eluFunction.
elu(x, α = 1) =
+rnn.state # 60
source

Activation Functions

Non-linearities that go between layers of your model. Most of these functions are defined in NNlib but are available by default in Flux.

Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call σ.(xs), relu.(xs) and so on.

NNlib.σFunction.
σ(x) = 1 / (1 + exp(-x))

Classic sigmoid activation function.

NNlib.reluFunction.
relu(x) = max(0, x)

Rectified Linear Unit activation function.

NNlib.leakyreluFunction.
leakyrelu(x) = max(0.01x, x)

Leaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).

NNlib.eluFunction.
elu(x, α = 1) =
   x > 0 ? x : α * (exp(x) - 1)

Exponential Linear Unit activation function. See Fast and Accurate Deep Network Learning by Exponential Linear Units. You can also specify the coefficient explicitly, e.g. elu(x, 1).

NNlib.swishFunction.
swish(x) = x * σ(x)

Self-gated actvation function. See Swish: a Self-Gated Activation Function.

Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting.

Flux.testmode!Function.
testmode!(m)
-testmode!(m, false)

Put layers like Dropout and BatchNorm into testing mode (or back to training mode with false).

source
Flux.BatchNormType.
BatchNorm(channels::Integer, σ = identity;
+testmode!(m, false)

Put layers like Dropout and BatchNorm into testing mode (or back to training mode with false).

source
Flux.BatchNormType.
BatchNorm(channels::Integer, σ = identity;
           initβ = zeros, initγ = ones,
           ϵ = 1e-8, momentum = .1)

Batch Normalization layer. The channels input should be the size of the channel dimension in your data (see below).

Given an array with N dimensions, call the N-1th the channel dimension. (For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.)

BatchNorm computes the mean and variance for each each W×H×1×N slice and shifts them to have a new mean and variance (corresponding to the learnable, per-channel bias and scale parameters).

See Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

Example:

m = Chain(
   Dense(28^2, 64),
   BatchNorm(64, relu),
   Dense(64, 10),
   BatchNorm(10),
-  softmax)
source
Flux.DropoutType.
Dropout(p)

A Dropout layer. For each input, either sets that input to 0 (with probability p) or scales it by 1/(1-p). This is used as a regularisation, i.e. it reduces overfitting during training.

Does nothing to the input once in testmode!.

source
Flux.LayerNormType.
LayerNorm(h::Integer)

A normalisation layer designed to be used with recurrent hidden states of size h. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.

source
+ softmax)source
Flux.DropoutType.
Dropout(p)

A Dropout layer. For each input, either sets that input to 0 (with probability p) or scales it by 1/(1-p). This is used as a regularisation, i.e. it reduces overfitting during training.

Does nothing to the input once in testmode!.

source
Flux.LayerNormType.
LayerNorm(h::Integer)

A normalisation layer designed to be used with recurrent hidden states of size h. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.

source
diff --git a/latest/search_index.js b/latest/search_index.js index ed8546e2..5ac89835 100644 --- a/latest/search_index.js +++ b/latest/search_index.js @@ -365,7 +365,7 @@ var documenterSearchIndex = {"docs": [ "page": "Optimisers", "title": "Optimisers", "category": "section", - "text": "Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.using Flux.Tracker\n\nW = param(rand(2, 5))\nb = param(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nl = loss(x, y) # ~ 3\n\nparams = Params([W, b])\ngrads = Tracker.gradient(() -> loss(x, y), params)We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here\'s one way to do that:using Flux.Tracker: grad, update!\n\nfunction sgd()\n η = 0.1 # Learning Rate\n for p in (W, b)\n update!(p, -η * grads[p])\n end\nendIf we call sgd, the parameters W and b will change and our loss should go down.There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.In this case, getting the variables is trivial, but you can imagine it\'d be more of a pain with some complex stack of layers.m = Chain(\n Dense(10, 5, σ),\n Dense(5, 2), softmax)Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.For the update step, there\'s nothing whatsoever wrong with writing the loop above – it\'ll work just fine – but Flux provides various optimisers that make it more convenient.opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1\n\nopt() # Carry out the update, modifying `W` and `b`.An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data." + "text": "Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.using Flux.Tracker\n\nW = param(rand(2, 5))\nb = param(rand(2))\n\npredict(x) = W*x .+ b\nloss(x, y) = sum((predict(x) .- y).^2)\n\nx, y = rand(5), rand(2) # Dummy data\nl = loss(x, y) # ~ 3\n\nparams = Params([W, b])\ngrads = Tracker.gradient(() -> loss(x, y), params)We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here\'s one way to do that:using Flux.Tracker: grad, update!\n\nη = 0.1 # Learning Rate\nfor p in (W, b)\n update!(p, -η * grads[p])\nendRunning this will alter the parameters W and b and our loss should go down. Flux provides a more general way to do optimiser updates like this.opt = Descent(0.1) # Gradient descent with learning rate 0.1\n\nfor p in (W, b)\n update!(opt, p, -η * grads[p])\nendAn optimiser update! accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass opt to our training loop, which will update all parameters of the model in a loop. However, we can now easily replace Descent with a more advanced optimiser such as ADAM." }, { @@ -373,7 +373,7 @@ var documenterSearchIndex = {"docs": [ "page": "Optimisers", "title": "Optimiser Reference", "category": "section", - "text": "All optimisers return a function that, when called, will update the parameters passed to it.SGD\nMomentum\nNesterov\nADAM" + "text": "All optimisers return an object that, when passed to train!, will update the parameters passed to it.SGD\nMomentum\nNesterov\nADAM" }, { @@ -389,7 +389,7 @@ var documenterSearchIndex = {"docs": [ "page": "Training", "title": "Training", "category": "section", - "text": "To actually train a model we need three things:A objective function, that evaluates how well a model is doing given some input data.\nA collection of data points that will be provided to the objective function.\nAn optimiser that will update the model parameters appropriately.With these we can call Flux.train!:Flux.train!(objective, data, opt)There are plenty of examples in the model zoo." + "text": "To actually train a model we need three things:A objective function, that evaluates how well a model is doing given some input data.\nA collection of data points that will be provided to the objective function.\nAn optimiser that will update the model parameters appropriately.With these we can call Flux.train!:Flux.train!(objective, params, data, opt)There are plenty of examples in the model zoo." }, { @@ -397,7 +397,7 @@ var documenterSearchIndex = {"docs": [ "page": "Training", "title": "Loss Functions", "category": "section", - "text": "The objective function must return a number representing how far the model is from its target – the loss of the model. The loss function that we defined in basics will work as an objective. We can also define an objective in terms of some model:m = Chain(\n Dense(784, 32, σ),\n Dense(32, 10), softmax)\n\nloss(x, y) = Flux.mse(m(x), y)\n\n# later\nFlux.train!(loss, data, opt)The objective will almost always be defined in terms of some cost function that measures the distance of the prediction m(x) from the target y. Flux has several of these built in, like mse for mean squared error or crossentropy for cross entropy loss, but you can calculate it however you want." + "text": "The objective function must return a number representing how far the model is from its target – the loss of the model. The loss function that we defined in basics will work as an objective. We can also define an objective in terms of some model:m = Chain(\n Dense(784, 32, σ),\n Dense(32, 10), softmax)\n\nloss(x, y) = Flux.mse(m(x), y)\nps = Flux.params(m)\n\n# later\nFlux.train!(loss, ps, data, opt)The objective will almost always be defined in terms of some cost function that measures the distance of the prediction m(x) from the target y. Flux has several of these built in, like mse for mean squared error or crossentropy for cross entropy loss, but you can calculate it however you want." }, { @@ -413,7 +413,7 @@ var documenterSearchIndex = {"docs": [ "page": "Training", "title": "Callbacks", "category": "section", - "text": "train! takes an additional argument, cb, that\'s used for callbacks so that you can observe the training process. For example:train!(objective, data, opt, cb = () -> println(\"training\"))Callbacks are called for every batch of training data. You can slow this down using Flux.throttle(f, timeout) which prevents f from being called more than once every timeout seconds.A more typical callback might look like this:test_x, test_y = # ... create single batch of test data ...\nevalcb() = @show(loss(test_x, test_y))\n\nFlux.train!(objective, data, opt,\n cb = throttle(evalcb, 5))" + "text": "train! takes an additional argument, cb, that\'s used for callbacks so that you can observe the training process. For example:train!(objective, ps, data, opt, cb = () -> println(\"training\"))Callbacks are called for every batch of training data. You can slow this down using Flux.throttle(f, timeout) which prevents f from being called more than once every timeout seconds.A more typical callback might look like this:test_x, test_y = # ... create single batch of test data ...\nevalcb() = @show(loss(test_x, test_y))\n\nFlux.train!(objective, ps, data, opt,\n cb = throttle(evalcb, 5))" }, { diff --git a/latest/training/optimisers.html b/latest/training/optimisers.html index c08db2d7..72372146 100644 --- a/latest/training/optimisers.html +++ b/latest/training/optimisers.html @@ -20,16 +20,14 @@ l = loss(x, y) # ~ 3 params = Params([W, b]) grads = Tracker.gradient(() -> loss(x, y), params)

We want to update each parameter, using the gradient, in order to improve (reduce) the loss. Here's one way to do that:

using Flux.Tracker: grad, update!
 
-function sgd()
-  η = 0.1 # Learning Rate
-  for p in (W, b)
-    update!(p, -η * grads[p])
-  end
-end

If we call sgd, the parameters W and b will change and our loss should go down.

There are two pieces here: one is that we need a list of trainable parameters for the model ([W, b] in this case), and the other is the update step. In this case the update is simply gradient descent (x .-= η .* Δ), but we might choose to do something more advanced, like adding momentum.

In this case, getting the variables is trivial, but you can imagine it'd be more of a pain with some complex stack of layers.

m = Chain(
-  Dense(10, 5, σ),
-  Dense(5, 2), softmax)

Instead of having to write [m[1].W, m[1].b, ...], Flux provides a params function params(m) that returns a list of all parameters in the model for you.

For the update step, there's nothing whatsoever wrong with writing the loop above – it'll work just fine – but Flux provides various optimisers that make it more convenient.

opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1
+η = 0.1 # Learning Rate
+for p in (W, b)
+  update!(p, -η * grads[p])
+end

Running this will alter the parameters W and b and our loss should go down. Flux provides a more general way to do optimiser updates like this.

opt = Descent(0.1) # Gradient descent with learning rate 0.1
 
-opt() # Carry out the update, modifying `W` and `b`.

An optimiser takes a parameter list and returns a function that does the same thing as update above. We can pass either opt or update to our training loop, which will then run the optimiser after every mini-batch of data.

Optimiser Reference

All optimisers return a function that, when called, will update the parameters passed to it.

SGD
+for p in (W, b)
+  update!(opt, p, -η * grads[p])
+end

An optimiser update! accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass opt to our training loop, which will update all parameters of the model in a loop. However, we can now easily replace Descent with a more advanced optimiser such as ADAM.

Optimiser Reference

All optimisers return an object that, when passed to train!, will update the parameters passed to it.

SGD
 Momentum
 Nesterov
 ADAM
diff --git a/latest/training/training.html b/latest/training/training.html index 3e64dad5..6283a7c8 100644 --- a/latest/training/training.html +++ b/latest/training/training.html @@ -6,14 +6,15 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

Training

Training

To actually train a model we need three things:

With these we can call Flux.train!:

Flux.train!(objective, data, opt)

There are plenty of examples in the model zoo.

Loss Functions

The objective function must return a number representing how far the model is from its target – the loss of the model. The loss function that we defined in basics will work as an objective. We can also define an objective in terms of some model:

m = Chain(
+

Training

Training

To actually train a model we need three things:

  • A objective function, that evaluates how well a model is doing given some input data.
  • A collection of data points that will be provided to the objective function.
  • An optimiser that will update the model parameters appropriately.

With these we can call Flux.train!:

Flux.train!(objective, params, data, opt)

There are plenty of examples in the model zoo.

Loss Functions

The objective function must return a number representing how far the model is from its target – the loss of the model. The loss function that we defined in basics will work as an objective. We can also define an objective in terms of some model:

m = Chain(
   Dense(784, 32, σ),
   Dense(32, 10), softmax)
 
 loss(x, y) = Flux.mse(m(x), y)
+ps = Flux.params(m)
 
 # later
-Flux.train!(loss, data, opt)

The objective will almost always be defined in terms of some cost function that measures the distance of the prediction m(x) from the target y. Flux has several of these built in, like mse for mean squared error or crossentropy for cross entropy loss, but you can calculate it however you want.

Datasets

The data argument provides a collection of data to train with (usually a set of inputs x and target outputs y). For example, here's a dummy data set with only one data point:

x = rand(784)
+Flux.train!(loss, ps, data, opt)

The objective will almost always be defined in terms of some cost function that measures the distance of the prediction m(x) from the target y. Flux has several of these built in, like mse for mean squared error or crossentropy for cross entropy loss, but you can calculate it however you want.

Datasets

The data argument provides a collection of data to train with (usually a set of inputs x and target outputs y). For example, here's a dummy data set with only one data point:

x = rand(784)
 y = rand(10)
 data = [(x, y)]

Flux.train! will call loss(x, y), calculate gradients, update the weights and then move on to the next data point if there is one. We can train the model on the same data three times:

data = [(x, y), (x, y), (x, y)]
 # Or equivalently
@@ -28,8 +29,8 @@ INFO: Epoch 2
 hello
 
 julia> @epochs 2 Flux.train!(...)
-# Train for two epochs

Callbacks

train! takes an additional argument, cb, that's used for callbacks so that you can observe the training process. For example:

train!(objective, data, opt, cb = () -> println("training"))

Callbacks are called for every batch of training data. You can slow this down using Flux.throttle(f, timeout) which prevents f from being called more than once every timeout seconds.

A more typical callback might look like this:

test_x, test_y = # ... create single batch of test data ...
+# Train for two epochs

Callbacks

train! takes an additional argument, cb, that's used for callbacks so that you can observe the training process. For example:

train!(objective, ps, data, opt, cb = () -> println("training"))

Callbacks are called for every batch of training data. You can slow this down using Flux.throttle(f, timeout) which prevents f from being called more than once every timeout seconds.

A more typical callback might look like this:

test_x, test_y = # ... create single batch of test data ...
 evalcb() = @show(loss(test_x, test_y))
 
-Flux.train!(objective, data, opt,
+Flux.train!(objective, ps, data, opt,
             cb = throttle(evalcb, 5))