From 9a75fcd05657a4407bde66b2410134427d5916a7 Mon Sep 17 00:00:00 2001 From: zeptodoctor <44736852+zeptodoctor@users.noreply.github.com> Date: Tue, 19 Feb 2019 15:20:47 +0000 Subject: [PATCH] build based on ebf50f4 --- dev/community/index.html | 2 +- dev/data/onehot/index.html | 2 +- dev/gpu/index.html | 2 +- dev/index.html | 2 +- dev/internals/tracker/index.html | 4 +-- dev/models/basics/index.html | 2 +- dev/models/layers/index.html | 18 ++++++------- dev/models/recurrence/index.html | 2 +- dev/models/regularisation/index.html | 2 +- dev/performance/index.html | 20 ++++++++++++++ dev/saving/index.html | 4 +-- dev/search/index.html | 2 +- dev/search_index.js | 40 ++++++++++++++++++++++++++++ dev/training/optimisers/index.html | 4 +-- dev/training/training/index.html | 2 +- 15 files changed, 84 insertions(+), 24 deletions(-) create mode 100644 dev/performance/index.html diff --git a/dev/community/index.html b/dev/community/index.html index be6a32be..6c6e0709 100644 --- a/dev/community/index.html +++ b/dev/community/index.html @@ -6,4 +6,4 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

Community

Community

All Flux users are welcome to join our community on the Julia forum, the slack (channel #machine-learning), or Flux's Gitter. If you have questions or issues we'll try to help you out.

If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started.

+

Community

Community

All Flux users are welcome to join our community on the Julia forum, the slack (channel #machine-learning), or Flux's Gitter. If you have questions or issues we'll try to help you out.

If you're interested in hacking on Flux, the source code is open and easy to understand – it's all just the same Julia code you work with normally. You might be interested in our intro issues to get started.

diff --git a/dev/data/onehot/index.html b/dev/data/onehot/index.html index c8a24999..aac2ab23 100644 --- a/dev/data/onehot/index.html +++ b/dev/data/onehot/index.html @@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

One-Hot Encoding

One-Hot Encoding

It's common to encode categorical variables (like true, false or cat, dog) in "one-of-k" or "one-hot" form. Flux provides the onehot function to make this easy.

julia> using Flux: onehot, onecold
+

One-Hot Encoding

One-Hot Encoding

It's common to encode categorical variables (like true, false or cat, dog) in "one-of-k" or "one-hot" form. Flux provides the onehot function to make this easy.

julia> using Flux: onehot, onecold
 
 julia> onehot(:b, [:a, :b, :c])
 3-element Flux.OneHotVector:
diff --git a/dev/gpu/index.html b/dev/gpu/index.html
index bcde7038..59d8376b 100644
--- a/dev/gpu/index.html
+++ b/dev/gpu/index.html
@@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
 
 ga('create', 'UA-36890222-9', 'auto');
 ga('send', 'pageview');
-

GPU Support

GPU Support

Installation

To get GPU support for NVIDIA graphics cards, you need to install CuArrays.jl

Steps needed

  1. Install NVIDIA toolkit
  2. Install NVIDIA cuDNN library
  3. In Julia's terminal run ]add CuArrays

GPU Usage

Support for array operations on other hardware backends, like GPUs, is provided by external packages like CuArrays. Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.

For example, we can use CuArrays (with the cu converter) to run our basic example on an NVIDIA GPU.

(Note that you need to have CUDA available to use CuArrays – please see the CuArrays.jl instructions for more details.)

using CuArrays
+

GPU Support

GPU Support

Installation

To get GPU support for NVIDIA graphics cards, you need to install CuArrays.jl

Steps needed

  1. Install NVIDIA toolkit
  2. Install NVIDIA cuDNN library
  3. In Julia's terminal run ]add CuArrays

GPU Usage

Support for array operations on other hardware backends, like GPUs, is provided by external packages like CuArrays. Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.

For example, we can use CuArrays (with the cu converter) to run our basic example on an NVIDIA GPU.

(Note that you need to have CUDA available to use CuArrays – please see the CuArrays.jl instructions for more details.)

using CuArrays
 
 W = cu(rand(2, 5)) # a 2×5 CuArray
 b = cu(rand(2))
diff --git a/dev/index.html b/dev/index.html
index 7b6afbeb..4c1aba7f 100644
--- a/dev/index.html
+++ b/dev/index.html
@@ -6,4 +6,4 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
 
 ga('create', 'UA-36890222-9', 'auto');
 ga('send', 'pageview');
-

Home

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

  • Doing the obvious thing. Flux has relatively few explicit APIs for features like regularisation or embeddings. Instead, writing down the mathematical form will work – and be fast.
  • You could have written Flux. All of it, from LSTMs to GPU kernels, is straightforward Julia code. When in doubt, it’s well worth looking at the source. If you need something different, you can easily roll your own.
  • Play nicely with others. Flux works well with Julia libraries from data frames and images to differential equation solvers, so you can easily build complex data processing pipelines that integrate Flux models.

Installation

Download Julia 1.0 or later, if you haven't already. You can add Flux from using Julia's package manager, by typing ] add Flux in the Julia prompt.

If you have CUDA you can also run ] add CuArrays to get GPU support; see here for more details.

Learning Flux

There are several different ways to learn Flux. If you just want to get started writing models, the model zoo gives good starting points for many common ones. This documentation provides a reference to all of Flux's APIs, as well as a from-scratch introduction to Flux's take on models and how they work. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.

+

Home

Flux: The Julia Machine Learning Library

Flux is a library for machine learning. It comes "batteries-included" with many useful tools built in, but also lets you use the full power of the Julia language where you need it. We follow a few key principles:

  • Doing the obvious thing. Flux has relatively few explicit APIs for features like regularisation or embeddings. Instead, writing down the mathematical form will work – and be fast.
  • You could have written Flux. All of it, from LSTMs to GPU kernels, is straightforward Julia code. When in doubt, it’s well worth looking at the source. If you need something different, you can easily roll your own.
  • Play nicely with others. Flux works well with Julia libraries from data frames and images to differential equation solvers, so you can easily build complex data processing pipelines that integrate Flux models.

Installation

Download Julia 1.0 or later, if you haven't already. You can add Flux from using Julia's package manager, by typing ] add Flux in the Julia prompt.

If you have CUDA you can also run ] add CuArrays to get GPU support; see here for more details.

Learning Flux

There are several different ways to learn Flux. If you just want to get started writing models, the model zoo gives good starting points for many common ones. This documentation provides a reference to all of Flux's APIs, as well as a from-scratch introduction to Flux's take on models and how they work. Once you understand these docs, congratulations, you also understand Flux's source code, which is intended to be concise, legible and a good reference for more advanced concepts.

diff --git a/dev/internals/tracker/index.html b/dev/internals/tracker/index.html index a4bbd412..04a64e31 100644 --- a/dev/internals/tracker/index.html +++ b/dev/internals/tracker/index.html @@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

Backpropagation

Flux.Tracker

Backpropagation, or reverse-mode automatic differentiation, is handled by the Flux.Tracker module.

julia> using Flux.Tracker

Here we discuss some more advanced uses of this module, as well as covering its internals.

Taking Gradients

In the basics section we covered basic usage of the gradient function.

using Flux.Tracker
+

Backpropagation

Flux.Tracker

Backpropagation, or reverse-mode automatic differentiation, is handled by the Flux.Tracker module.

julia> using Flux.Tracker

Here we discuss some more advanced uses of this module, as well as covering its internals.

Taking Gradients

In the basics section we covered basic usage of the gradient function.

using Flux.Tracker
 
 Tracker.gradient((a, b) -> a*b, 2, 3) # (3.0 (tracked), 2.0 (tracked))

gradient is actually just a thin wrapper around the backpropagator-based interface, forward.

using Flux.Tracker: forward
 
@@ -63,4 +63,4 @@ Flux.Tracker.Tracked{Array{Float64,1}}(0x00000000, Flux.Tracker.Call{Nothing,Tup
  -2.0
  -2.0

The tracker also contains a Call object, which simply represents a function call that was made at some point during the forward pass. For example, the + call would look like this:

julia> Tracker.Call(+, 1, 2)
 Flux.Tracker.Call{Base.#+,Tuple{Int64,Int64}}(+, (1, 2))

In the case of the y we produced above, we can see that it stores the call that produced it – that is, W*x.

julia> y.tracker.f
-Flux.Tracker.Call{...}(*, (param([1.0 2.0; 3.0 4.0]), param([5.0, 6.0])))

Notice that because the arguments to the call may also be tracked arrays, storing their own calls, this means that Tracker ends up forming a data structure that records everything that happened during the forward pass (often known as a tape).

When we call back!(y, [1, -1]), the sensitivities [1, -1] simply get forwarded to y's call (*), effectively calling

Tracker.back(*, [1, -1], W, x)

which in turn calculates the sensitivities of the arguments (W and x) and back-propagates through their calls. This is recursive, so it will walk the entire program graph and propagate gradients to the original model parameters.

+Flux.Tracker.Call{...}(*, (param([1.0 2.0; 3.0 4.0]), param([5.0, 6.0])))

Notice that because the arguments to the call may also be tracked arrays, storing their own calls, this means that Tracker ends up forming a data structure that records everything that happened during the forward pass (often known as a tape).

When we call back!(y, [1, -1]), the sensitivities [1, -1] simply get forwarded to y's call (*), effectively calling

Tracker.back(*, [1, -1], W, x)

which in turn calculates the sensitivities of the arguments (W and x) and back-propagates through their calls. This is recursive, so it will walk the entire program graph and propagate gradients to the original model parameters.

diff --git a/dev/models/basics/index.html b/dev/models/basics/index.html index c51f4177..99089dff 100644 --- a/dev/models/basics/index.html +++ b/dev/models/basics/index.html @@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

Basics

Model-Building Basics

Taking Gradients

Flux's core feature is taking gradients of Julia code. The gradient function takes another Julia function f and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)

julia> using Flux.Tracker
+

Basics

Model-Building Basics

Taking Gradients

Flux's core feature is taking gradients of Julia code. The gradient function takes another Julia function f and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)

julia> using Flux.Tracker
 
 julia> f(x) = 3x^2 + 2x + 1;
 
diff --git a/dev/models/layers/index.html b/dev/models/layers/index.html
index 77219f85..8ae89f32 100644
--- a/dev/models/layers/index.html
+++ b/dev/models/layers/index.html
@@ -6,34 +6,34 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
 
 ga('create', 'UA-36890222-9', 'auto');
 ga('send', 'pageview');
-

Model Reference

Basic Layers

These core layers form the foundation of almost all neural networks.

Flux.ChainType.
Chain(layers...)

Chain multiple layers / functions together, so that they are called in sequence on a given input.

m = Chain(x -> x^2, x -> x+1)
+

Model Reference

Basic Layers

These core layers form the foundation of almost all neural networks.

Flux.ChainType.
Chain(layers...)

Chain multiple layers / functions together, so that they are called in sequence on a given input.

m = Chain(x -> x^2, x -> x+1)
 m(5) == 26
 
 m = Chain(Dense(10, 5), Dense(5, 2))
 x = rand(10)
-m(x) == m[2](m[1](x))

Chain also supports indexing and slicing, e.g. m[2] or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.

source
Flux.DenseType.
Dense(in::Integer, out::Integer, σ = identity)

Creates a traditional Dense layer with parameters W and b.

y = σ.(W * x .+ b)

The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length out.

julia> d = Dense(5, 2)
+m(x) == m[2](m[1](x))

Chain also supports indexing and slicing, e.g. m[2] or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.

source
Flux.DenseType.
Dense(in::Integer, out::Integer, σ = identity)

Creates a traditional Dense layer with parameters W and b.

y = σ.(W * x .+ b)

The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length out.

julia> d = Dense(5, 2)
 Dense(5, 2)
 
 julia> d(rand(5))
 Tracked 2-element Array{Float64,1}:
   0.00257447
-  -0.00449443
source
Flux.ConvType.
Conv(size, in=>out)
-Conv(size, in=>out, relu)

Standard convolutional layer. size should be a tuple like (2, 2). in and out specify the number of input and output channels respectively.

Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array.

Takes the keyword arguments pad, stride and dilation.

source
Flux.MaxPoolType.
MaxPool(k)

Max pooling layer. k stands for the size of the window for each dimension of the input.

Takes the keyword arguments pad and stride.

source
Flux.MeanPoolType.
MeanPool(k)

Mean pooling layer. k stands for the size of the window for each dimension of the input.

Takes the keyword arguments pad and stride.

source

Additional Convolution Layers

DepthwiseConv(size, in)
+  -0.00449443
source
Flux.ConvType.
Conv(size, in=>out)
+Conv(size, in=>out, relu)

Standard convolutional layer. size should be a tuple like (2, 2). in and out specify the number of input and output channels respectively.

Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array.

Takes the keyword arguments pad, stride and dilation.

source
Flux.MaxPoolType.
MaxPool(k)

Max pooling layer. k stands for the size of the window for each dimension of the input.

Takes the keyword arguments pad and stride.

source
Flux.MeanPoolType.
MeanPool(k)

Mean pooling layer. k stands for the size of the window for each dimension of the input.

Takes the keyword arguments pad and stride.

source

Additional Convolution Layers

DepthwiseConv(size, in)
 DepthwiseConv(size, in=>mul)
-DepthwiseConv(size, in=>mul, relu)

Depthwise convolutional layer. size should be a tuple like (2, 2). in and mul specify the number of input channels and channel multiplier respectively. In case the mul is not specified it is taken as 1.

Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3 array, and a batch of 50 would be a 100×100×3×50 array.

Takes the keyword arguments pad and stride.

source
ConvTranspose(size, in=>out)
-ConvTranspose(size, in=>out, relu)

Standard convolutional transpose layer. size should be a tuple like (2, 2). in and out specify the number of input and output channels respectively. Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3 array, and a batch of 50 would be a 100×100×3×50 array. Takes the keyword arguments pad, stride and dilation.

source

Recurrent Layers

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNFunction.
RNN(in::Integer, out::Integer, σ = tanh)

The most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.

source
Flux.LSTMFunction.
LSTM(in::Integer, out::Integer)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

source
Flux.GRUFunction.
GRU(in::Integer, out::Integer)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

source
Flux.RecurType.
Recur(cell)

Recur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:

h, y = cell(h, x...)

For example, here's a recurrent network that keeps a running total of its inputs.

accum(h, x) = (h+x, x)
+DepthwiseConv(size, in=>mul, relu)

Depthwise convolutional layer. size should be a tuple like (2, 2). in and mul specify the number of input channels and channel multiplier respectively. In case the mul is not specified it is taken as 1.

Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3 array, and a batch of 50 would be a 100×100×3×50 array.

Takes the keyword arguments pad and stride.

source
ConvTranspose(size, in=>out)
+ConvTranspose(size, in=>out, relu)

Standard convolutional transpose layer. size should be a tuple like (2, 2). in and out specify the number of input and output channels respectively. Data should be stored in WHCN order. In other words, a 100×100 RGB image would be a 100×100×3 array, and a batch of 50 would be a 100×100×3×50 array. Takes the keyword arguments pad, stride and dilation.

source

Recurrent Layers

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

Flux.RNNFunction.
RNN(in::Integer, out::Integer, σ = tanh)

The most basic recurrent layer; essentially acts as a Dense layer, but with the output fed back into the input each time step.

source
Flux.LSTMFunction.
LSTM(in::Integer, out::Integer)

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

source
Flux.GRUFunction.
GRU(in::Integer, out::Integer)

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

See this article for a good overview of the internals.

source
Flux.RecurType.
Recur(cell)

Recur takes a recurrent cell and makes it stateful, managing the hidden state in the background. cell should be a model of the form:

h, y = cell(h, x...)

For example, here's a recurrent network that keeps a running total of its inputs.

accum(h, x) = (h+x, x)
 rnn = Flux.Recur(accum, 0)
 rnn(2) # 2
 rnn(3) # 3
 rnn.state # 5
 rnn.(1:10) # apply to a sequence
-rnn.state # 60
source

Activation Functions

Non-linearities that go between layers of your model. Most of these functions are defined in NNlib but are available by default in Flux.

Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call σ.(xs), relu.(xs) and so on.

NNlib.σFunction.
σ(x) = 1 / (1 + exp(-x))

Classic sigmoid activation function.

NNlib.reluFunction.
relu(x) = max(0, x)

Rectified Linear Unit activation function.

NNlib.leakyreluFunction.
leakyrelu(x) = max(0.01x, x)

Leaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).

NNlib.eluFunction.
elu(x, α = 1) =
+rnn.state # 60
source

Activation Functions

Non-linearities that go between layers of your model. Most of these functions are defined in NNlib but are available by default in Flux.

Note that, unless otherwise stated, activation functions operate on scalars. To apply them to an array you can call σ.(xs), relu.(xs) and so on.

NNlib.σFunction.
σ(x) = 1 / (1 + exp(-x))

Classic sigmoid activation function.

NNlib.reluFunction.
relu(x) = max(0, x)

Rectified Linear Unit activation function.

NNlib.leakyreluFunction.
leakyrelu(x) = max(0.01x, x)

Leaky Rectified Linear Unit activation function. You can also specify the coefficient explicitly, e.g. leakyrelu(x, 0.01).

NNlib.eluFunction.
elu(x, α = 1) =
   x > 0 ? x : α * (exp(x) - 1)

Exponential Linear Unit activation function. See Fast and Accurate Deep Network Learning by Exponential Linear Units. You can also specify the coefficient explicitly, e.g. elu(x, 1).

NNlib.swishFunction.
swish(x) = x * σ(x)

Self-gated actvation function. See Swish: a Self-Gated Activation Function.

Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting.

Flux.testmode!Function.
testmode!(m)
-testmode!(m, false)

Put layers like Dropout and BatchNorm into testing mode (or back to training mode with false).

source
Flux.BatchNormType.
BatchNorm(channels::Integer, σ = identity;
+testmode!(m, false)

Put layers like Dropout and BatchNorm into testing mode (or back to training mode with false).

source
Flux.BatchNormType.
BatchNorm(channels::Integer, σ = identity;
           initβ = zeros, initγ = ones,
           ϵ = 1e-8, momentum = .1)

Batch Normalization layer. The channels input should be the size of the channel dimension in your data (see below).

Given an array with N dimensions, call the N-1th the channel dimension. (For a batch of feature vectors this is just the data dimension, for WHCN images it's the usual channel dimension.)

BatchNorm computes the mean and variance for each each W×H×1×N slice and shifts them to have a new mean and variance (corresponding to the learnable, per-channel bias and scale parameters).

See Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

Example:

m = Chain(
   Dense(28^2, 64),
   BatchNorm(64, relu),
   Dense(64, 10),
   BatchNorm(10),
-  softmax)
source
Flux.DropoutType.
Dropout(p)

A Dropout layer. For each input, either sets that input to 0 (with probability p) or scales it by 1/(1-p). This is used as a regularisation, i.e. it reduces overfitting during training.

Does nothing to the input once in testmode!.

source
Flux.LayerNormType.
LayerNorm(h::Integer)

A normalisation layer designed to be used with recurrent hidden states of size h. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.

source
+ softmax)
source
Flux.DropoutType.
Dropout(p)

A Dropout layer. For each input, either sets that input to 0 (with probability p) or scales it by 1/(1-p). This is used as a regularisation, i.e. it reduces overfitting during training.

Does nothing to the input once in testmode!.

source
Flux.LayerNormType.
LayerNorm(h::Integer)

A normalisation layer designed to be used with recurrent hidden states of size h. Normalises the mean/stddev of each input before applying a per-neuron gain/bias.

source
diff --git a/dev/models/recurrence/index.html b/dev/models/recurrence/index.html index 27f7d901..9f2de587 100644 --- a/dev/models/recurrence/index.html +++ b/dev/models/recurrence/index.html @@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

Recurrence

Recurrent Models

Recurrent Cells

In the simple feedforward case, our model m is a simple function from various inputs xᵢ to predictions yᵢ. (For example, each x might be an MNIST digit and each y a digit label.) Each prediction is completely independent of any others, and using the same x will always produce the same y.

y₁ = f(x₁)
+

Recurrence

Recurrent Models

Recurrent Cells

In the simple feedforward case, our model m is a simple function from various inputs xᵢ to predictions yᵢ. (For example, each x might be an MNIST digit and each y a digit label.) Each prediction is completely independent of any others, and using the same x will always produce the same y.

y₁ = f(x₁)
 y₂ = f(x₂)
 y₃ = f(x₃)
 # ...

Recurrent networks introduce a hidden state that gets carried over each time we run the model. The model now takes the old h as an input, and produces a new h as output, each time we run it.

h = # ... initial state ...
diff --git a/dev/models/regularisation/index.html b/dev/models/regularisation/index.html
index 6646d937..9dd23051 100644
--- a/dev/models/regularisation/index.html
+++ b/dev/models/regularisation/index.html
@@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
 
 ga('create', 'UA-36890222-9', 'auto');
 ga('send', 'pageview');
-

Regularisation

Regularisation

Applying regularisation to model parameters is straightforward. We just need to apply an appropriate regulariser, such as norm, to each model parameter and add the result to the overall loss.

For example, say we have a simple regression.

using Flux: crossentropy
+

Regularisation

Regularisation

Applying regularisation to model parameters is straightforward. We just need to apply an appropriate regulariser, such as norm, to each model parameter and add the result to the overall loss.

For example, say we have a simple regression.

using Flux: crossentropy
 m = Dense(10, 5)
 loss(x, y) = crossentropy(softmax(m(x)), y)

We can regularise this by taking the (L2) norm of the parameters, m.W and m.b.

penalty() = norm(m.W) + norm(m.b)
 loss(x, y) = crossentropy(softmax(m(x)), y) + penalty()

When working with layers, Flux provides the params function to grab all parameters at once. We can easily penalise everything with sum(norm, params).

julia> params(m)
diff --git a/dev/performance/index.html b/dev/performance/index.html
new file mode 100644
index 00000000..9a981139
--- /dev/null
+++ b/dev/performance/index.html
@@ -0,0 +1,20 @@
+
+Performance Tips · Flux

Performance Tips

Performance Tips

All the usual Julia performance tips apply. As always profiling your code is generally a useful way of finding bottlenecks. Below follow some Flux specific tips/reminders.

Don't use more precision than you need.

Flux works great with all kinds of number types. But often you do not need to be working with say Float64 (let alone BigFloat). Switching to Float32 can give you a significant speed up, not because the operations are faster, but because the memory usage is halved. Which means allocations occur much faster. And you use less memory.

Make sure your custom activation functions preserve the type of their inputs

Not only should your activation functions be type-stable, they should also preserve the type of their inputs.

A very artificial example using an activatioon function like

    my_tanh(x) = Float64(tanh(x))

will result in performance on Float32 input orders of magnitude slower than the normal tanh would, because it results in having to use slow mixed type multiplication in the dense layers.

Which means if you change your data say from Float64 to Float32 (which should give a speedup: see above), you will see a large slow-down

This can occur sneakily, because you can cause type-promotion by interacting with a numeric literals. E.g. the following will have run into the same problem as above:

    leaky_tanh(x) = 0.01x + tanh(x)

While one could change your activation function (e.g. to use 0.01f0x) to avoid this when ever your inputs change, the idiomatic (and safe way) is to use oftype.

    leaky_tanh(x) = oftype(x/1, 0.01) + tanh(x)

Evaluate batches as Matrices of features, rather than sequences of Vector features

While it can sometimes be tempting to process your observations (feature vectors) one at a time e.g.

function loss_total(xs::AbstractVector{<:Vector}, ys::AbstractVector{<:Vector})
+    sum(zip(xs, ys)) do (x, y_target)
+        y_pred = model(x) #  evaluate the model
+        return loss(y_pred, y_target)
+    end
+end

It is much faster to concatenate them into a matrix, as this will hit BLAS matrix-matrix multiplication, which is much faster than the equivalent sequence of matrix-vector multiplications. Even though this means allocating new memory to store them contiguously.

x_batch = reduce(hcat, xs)
+y_batch = reduce(hcat, ys)
+...
+function loss_total(x_batch::Matrix, y_batch::Matrix)
+    y_preds = model(x_batch)
+    sum(loss.(y_preds, y_batch))
+end

When doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penality, and will hit the optimised reduce method.

diff --git a/dev/saving/index.html b/dev/saving/index.html index 0b3c93f4..6f21561b 100644 --- a/dev/saving/index.html +++ b/dev/saving/index.html @@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

Saving & Loading

Saving and Loading Models

You may wish to save models so that they can be loaded and run in a later session. The easiest way to do this is via BSON.jl.

Save a model:

julia> using Flux
+

Saving & Loading

Saving and Loading Models

You may wish to save models so that they can be loaded and run in a later session. The easiest way to do this is via BSON.jl.

Save a model:

julia> using Flux
 
 julia> model = Chain(Dense(10,5,relu),Dense(5,2),softmax)
 Chain(Dense(10, 5, NNlib.relu), Dense(5, 2), NNlib.softmax)
@@ -47,4 +47,4 @@ evalcb = throttle(30) do
   # Show loss
   @save "model-checkpoint.bson" model
 end

This will update the "model-checkpoint.bson" file every thirty seconds.

You can get more advanced by saving a series of models throughout training, for example

@save "model-$(now()).bson" model

will produce a series of models like "model-2018-03-06T02:57:10.41.bson". You could also store the current test set loss, so that it's easy to (for example) revert to an older copy of the model if it starts to overfit.

@save "model-$(now()).bson" model loss = testloss()

You can even store optimiser state alongside the model, to resume training exactly where you left off.

opt = ADAM(params(model))
-@save "model-$(now()).bson" model opt
+@save "model-$(now()).bson" model opt
diff --git a/dev/search/index.html b/dev/search/index.html index 8105e38b..fa2bff6a 100644 --- a/dev/search/index.html +++ b/dev/search/index.html @@ -6,4 +6,4 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

Search

Search

Number of results: loading...

    +

    Search

    Search

    Number of results: loading...

      diff --git a/dev/search_index.js b/dev/search_index.js index 1dff79ae..04ab508f 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -544,6 +544,46 @@ var documenterSearchIndex = {"docs": [ "text": "In longer training runs it\'s a good idea to periodically save your model, so that you can resume if training is interrupted (for example, if there\'s a power cut). You can do this by saving the model in the callback provided to train!.using Flux: throttle\nusing BSON: @save\n\nm = Chain(Dense(10,5,relu),Dense(5,2),softmax)\n\nevalcb = throttle(30) do\n # Show loss\n @save \"model-checkpoint.bson\" model\nendThis will update the \"model-checkpoint.bson\" file every thirty seconds.You can get more advanced by saving a series of models throughout training, for example@save \"model-$(now()).bson\" modelwill produce a series of models like \"model-2018-03-06T02:57:10.41.bson\". You could also store the current test set loss, so that it\'s easy to (for example) revert to an older copy of the model if it starts to overfit.@save \"model-$(now()).bson\" model loss = testloss()You can even store optimiser state alongside the model, to resume training exactly where you left off.opt = ADAM(params(model))\n@save \"model-$(now()).bson\" model opt" }, +{ + "location": "performance/#", + "page": "Performance Tips", + "title": "Performance Tips", + "category": "page", + "text": "" +}, + +{ + "location": "performance/#Performance-Tips-1", + "page": "Performance Tips", + "title": "Performance Tips", + "category": "section", + "text": "All the usual Julia performance tips apply. As always profiling your code is generally a useful way of finding bottlenecks. Below follow some Flux specific tips/reminders." +}, + +{ + "location": "performance/#Don\'t-use-more-precision-than-you-need.-1", + "page": "Performance Tips", + "title": "Don\'t use more precision than you need.", + "category": "section", + "text": "Flux works great with all kinds of number types. But often you do not need to be working with say Float64 (let alone BigFloat). Switching to Float32 can give you a significant speed up, not because the operations are faster, but because the memory usage is halved. Which means allocations occur much faster. And you use less memory." +}, + +{ + "location": "performance/#Make-sure-your-custom-activation-functions-preserve-the-type-of-their-inputs-1", + "page": "Performance Tips", + "title": "Make sure your custom activation functions preserve the type of their inputs", + "category": "section", + "text": "Not only should your activation functions be type-stable, they should also preserve the type of their inputs.A very artificial example using an activatioon function like my_tanh(x) = Float64(tanh(x))will result in performance on Float32 input orders of magnitude slower than the normal tanh would, because it results in having to use slow mixed type multiplication in the dense layers.Which means if you change your data say from Float64 to Float32 (which should give a speedup: see above), you will see a large slow-downThis can occur sneakily, because you can cause type-promotion by interacting with a numeric literals. E.g. the following will have run into the same problem as above: leaky_tanh(x) = 0.01x + tanh(x)While one could change your activation function (e.g. to use 0.01f0x) to avoid this when ever your inputs change, the idiomatic (and safe way) is to use oftype. leaky_tanh(x) = oftype(x/1, 0.01) + tanh(x)" +}, + +{ + "location": "performance/#Evaluate-batches-as-Matrices-of-features,-rather-than-sequences-of-Vector-features-1", + "page": "Performance Tips", + "title": "Evaluate batches as Matrices of features, rather than sequences of Vector features", + "category": "section", + "text": "While it can sometimes be tempting to process your observations (feature vectors) one at a time e.g.function loss_total(xs::AbstractVector{<:Vector}, ys::AbstractVector{<:Vector})\n sum(zip(xs, ys)) do (x, y_target)\n y_pred = model(x) # evaluate the model\n return loss(y_pred, y_target)\n end\nendIt is much faster to concatenate them into a matrix, as this will hit BLAS matrix-matrix multiplication, which is much faster than the equivalent sequence of matrix-vector multiplications. Even though this means allocating new memory to store them contiguously.x_batch = reduce(hcat, xs)\ny_batch = reduce(hcat, ys)\n...\nfunction loss_total(x_batch::Matrix, y_batch::Matrix)\n y_preds = model(x_batch)\n sum(loss.(y_preds, y_batch))\nendWhen doing this kind of concatenation use reduce(hcat, xs) rather than hcat(xs...). This will avoid the splatting penality, and will hit the optimised reduce method." +}, + { "location": "internals/tracker/#", "page": "Backpropagation", diff --git a/dev/training/optimisers/index.html b/dev/training/optimisers/index.html index 6f9d12fe..513db317 100644 --- a/dev/training/optimisers/index.html +++ b/dev/training/optimisers/index.html @@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

      Optimisers

      Optimisers

      Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.

      using Flux, Flux.Tracker
      +

      Optimisers

      Optimisers

      Consider a simple linear regression. We create some dummy data, calculate a loss, and backpropagate to calculate gradients for the parameters W and b.

      using Flux, Flux.Tracker
       
       W = param(rand(2, 5))
       b = param(rand(2))
      @@ -27,4 +27,4 @@ end

      Running this will alter the parameters W and

      An optimiser update! accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass opt to our training loop, which will update all parameters of the model in a loop. However, we can now easily replace Descent with a more advanced optimiser such as ADAM.

      Optimiser Reference

      All optimisers return an object that, when passed to train!, will update the parameters passed to it.

      Descent(η)

      Classic gradient descent optimiser with learning rate η. For each parameter p and its gradient δp, this runs p -= η*δp.

      source
      Momentum(params, η = 0.01; ρ = 0.9)

      Gradient descent with learning rate η and momentum ρ.

      source
      Nesterov(eta, ρ = 0.9)

      Gradient descent with learning rate η and Nesterov momentum ρ.

      source
      ADAM(η = 0.001, β = (0.9, 0.999))

      ADAM optimiser.

      source
      +end

      An optimiser update! accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass opt to our training loop, which will update all parameters of the model in a loop. However, we can now easily replace Descent with a more advanced optimiser such as ADAM.

      Optimiser Reference

      All optimisers return an object that, when passed to train!, will update the parameters passed to it.

      Descent(η)

      Classic gradient descent optimiser with learning rate η. For each parameter p and its gradient δp, this runs p -= η*δp.

      source
      Momentum(params, η = 0.01; ρ = 0.9)

      Gradient descent with learning rate η and momentum ρ.

      source
      Nesterov(eta, ρ = 0.9)

      Gradient descent with learning rate η and Nesterov momentum ρ.

      source
      ADAM(η = 0.001, β = (0.9, 0.999))

      ADAM optimiser.

      source
      diff --git a/dev/training/training/index.html b/dev/training/training/index.html index 726436b1..7946d5e2 100644 --- a/dev/training/training/index.html +++ b/dev/training/training/index.html @@ -6,7 +6,7 @@ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview'); -

      Training

      Training

      To actually train a model we need three things:

      • A objective function, that evaluates how well a model is doing given some input data.
      • A collection of data points that will be provided to the objective function.
      • An optimiser that will update the model parameters appropriately.

      With these we can call Flux.train!:

      Flux.train!(objective, params, data, opt)

      There are plenty of examples in the model zoo.

      Loss Functions

      The objective function must return a number representing how far the model is from its target – the loss of the model. The loss function that we defined in basics will work as an objective. We can also define an objective in terms of some model:

      m = Chain(
      +

      Training

      Training

      To actually train a model we need three things:

      • A objective function, that evaluates how well a model is doing given some input data.
      • A collection of data points that will be provided to the objective function.
      • An optimiser that will update the model parameters appropriately.

      With these we can call Flux.train!:

      Flux.train!(objective, params, data, opt)

      There are plenty of examples in the model zoo.

      Loss Functions

      The objective function must return a number representing how far the model is from its target – the loss of the model. The loss function that we defined in basics will work as an objective. We can also define an objective in terms of some model:

      m = Chain(
         Dense(784, 32, σ),
         Dense(32, 10), softmax)