Flux.jl/docs/src/models/basics.md

# Model-Building Basics

## Taking Gradients

Flux's core feature is taking gradients of Julia code. The `gradient` function takes another Julia function `f` and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)

```jldoctest basics
julia> using Flux

julia> f(x) = 3x^2 + 2x + 1;

julia> df(x) = gradient(f, x)[1]; # df/dx = 6x + 2

julia> df(2)
14

julia> d2f(x) = gradient(df, x)[1]; # d²f/dx² = 6

julia> d2f(2)
6
```

When a function has many parameters, we can get gradients of each one at the same time:

```jldoctest basics
julia> f(x, y) = sum((x .- y).^2);

julia> gradient(f, [2, 1], [2, 0])
([0, 2], [0, -2])
```

But machine learning models can have *hundreds* of parameters! To handle this, Flux lets you work with collections of parameters, via `params`. You can get the gradient of all parameters used in a program without explicitly passing them in.

```jldoctest basics
julia> using Flux

julia> x = [2, 1];

julia> y = [2, 0];

julia> gs = gradient(params(x, y)) do
         f(x, y)
       end
Grads(...)

julia> gs[x]
2-element Array{Int64,1}:
 0
 2

julia> gs[y]
2-element Array{Int64,1}:
  0
 -2
```

Here, `gradient` takes a zero-argument function; no arguments are necessary because the `params` tell it what to differentiate.

This will come in really handy when dealing with big, complicated models. For now, though, let's start with something simple.

## Simple Models

Consider a simple linear regression, which tries to predict an output array `y` from an input `x`.

```julia
W = rand(2, 5)
b = rand(2)

predict(x) = W*x .+ b

function loss(x, y)
  ŷ = predict(x)
  sum((y .- ŷ).^2)
end

x, y = rand(5), rand(2) # Dummy data
loss(x, y) # ~ 3
```

To improve the prediction we can take the gradients of `W` and `b` with respect to the loss and perform gradient descent.

```julia
using Flux

gs = gradient(() -> loss(x, y), params(W, b))
```

Now that we have gradients, we can pull them out and update `W` to train the model.

```julia
W̄ = gs[W]

W .-= 0.1 .* W̄

loss(x, y) # ~ 2.5
```

The loss has decreased a little, meaning that our prediction `x` is closer to the target `y`. If we have some data we can already try [training the model](../training/training.md).

All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can *look* very different – they might have millions of parameters or complex control flow. Let's see how Flux handles more complex models.

## Building Layers

It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) (`σ`) in between them. In the above style we could write this as:

```julia
using Flux

W1 = rand(3, 5)
b1 = rand(3)
layer1(x) = W1 * x .+ b1

W2 = rand(2, 3)
b2 = rand(2)
layer2(x) = W2 * x .+ b2

model(x) = layer2(σ.(layer1(x)))

model(rand(5)) # => 2-element vector
```

This works but is fairly unwieldy, with a lot of repetition – especially as we add more layers. One way to factor this out is to create a function that returns linear layers.

```julia
function linear(in, out)
  W = randn(out, in)
  b = randn(out)
  x -> W * x .+ b
end

linear1 = linear(5, 3) # we can access linear1.W etc
linear2 = linear(3, 2)

model(x) = linear2(σ.(linear1(x)))

model(rand(5)) # => 2-element vector
```

Another (equivalent) way is to create a struct that explicitly represents the affine layer.

```julia
struct Affine
  W
  b
end

Affine(in::Integer, out::Integer) =
  Affine(randn(out, in), randn(out))

# Overload call, so the object can be used as a function
(m::Affine)(x) = m.W * x .+ m.b

a = Affine(10, 5)

a(rand(10)) # => 5-element vector
```

Congratulations! You just built the `Dense` layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.

(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10, 5, σ)`.)

## Stacking It Up

It's pretty common to write models that look something like:

```julia
layer1 = Dense(10, 5, σ)
# ...
model(x) = layer3(layer2(layer1(x)))
```

For long chains, it might be a bit more intuitive to have a list of layers, like this:

```julia
using Flux

layers = [Dense(10, 5, σ), Dense(5, 2), softmax]

model(x) = foldl((x, m) -> m(x), layers, init = x)

model(rand(10)) # => 2-element vector
```

Handily, this is also provided for in Flux:

```julia
model2 = Chain(
  Dense(10, 5, σ),
  Dense(5, 2),
  softmax)

model2(rand(10)) # => 2-element vector
```

This quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code.

A nice property of this approach is that because "models" are just functions (possibly with trainable parameters), you can also see this as simple function composition.

```julia
m = Dense(5, 2) ∘ Dense(10, 5, σ)

m(rand(10))
```

Likewise, `Chain` will happily work with any Julia function.

```julia
m = Chain(x -> x^2, x -> x+1)

m(5) # => 26
```

## Layer helpers

Flux provides a set of helpers for custom layers, which you can enable by calling

```julia
Flux.@functor Affine
```

This enables a useful extra set of functionality for our `Affine` layer, such as [collecting its parameters](../training/optimisers.md) or [moving it to the GPU](../gpu.md).

For some more helpful tricks, including parameter freezing, please checkout the [advanced usage guide](advacned.md).

## Utility functions

Flux provides some utility functions to help you generate models in an automated fashion.

`outdims` enables you to calculate the spatial output dimensions of layers like `Conv` when applied to input images of a given size.
Currently limited to the following layers:
- `Chain`
- `Dense`
- `Conv`
- `Diagonal`
- `Maxout`
- `ConvTranspose`
- `DepthwiseConv`
- `CrossCor`
- `MaxPool`
- `MeanPool`

```@docs
outdims
```
-												docs updates

											
										
										
											2017-09-09 23:58:32 +00:00
+								# Model-Building Basics
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
+								## Taking Gradients
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
+								Flux's core feature is taking gradients of Julia code. The `gradient` function takes another Julia function `f` and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)
-												Transition to doctests (#616)

* basics.md: Initial doctest to an example

Related to https://github.com/FluxML/Flux.jl/issues/561

* make.jl: Allow doctest to run

* Fix comments in order to pass doctests

* basic.md: Add doctests to examples

											
										
										
											2019-02-14 18:29:27 +00:00
+								```jldoctest basics
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> using Flux
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												Transition to doctests (#616)

* basics.md: Initial doctest to an example

Related to https://github.com/FluxML/Flux.jl/issues/561

* make.jl: Allow doctest to run

* Fix comments in order to pass doctests

* basic.md: Add doctests to examples

											
										
										
											2019-02-14 18:29:27 +00:00
+								julia> f(x) = 3x^2 + 2x + 1;
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> df(x) = gradient(f, x)[1]; # df/dx = 6x + 2
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												Transition to doctests (#616)

* basics.md: Initial doctest to an example

Related to https://github.com/FluxML/Flux.jl/issues/561

* make.jl: Allow doctest to run

* Fix comments in order to pass doctests

* basic.md: Add doctests to examples

											
										
										
											2019-02-14 18:29:27 +00:00
+								julia> df(2)
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> d2f(x) = gradient(df, x)[1]; # d²f/dx² = 6
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												Transition to doctests (#616)

* basics.md: Initial doctest to an example

Related to https://github.com/FluxML/Flux.jl/issues/561

* make.jl: Allow doctest to run

* Fix comments in order to pass doctests

* basic.md: Add doctests to examples

											
										
										
											2019-02-14 18:29:27 +00:00
+								julia> d2f(2)
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
+								```
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								When a function has many parameters, we can get gradients of each one at the same time:
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												Transition to doctests (#616)

* basics.md: Initial doctest to an example

Related to https://github.com/FluxML/Flux.jl/issues/561

* make.jl: Allow doctest to run

* Fix comments in order to pass doctests

* basic.md: Add doctests to examples

											
										
										
											2019-02-14 18:29:27 +00:00
+								```jldoctest basics
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> f(x, y) = sum((x .- y).^2);
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> gradient(f, [2, 1], [2, 0])
 								([0, 2], [0, -2])
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
+								```
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								But machine learning models can have *hundreds* of parameters! To handle this, Flux lets you work with collections of parameters, via `params`. You can get the gradient of all parameters used in a program without explicitly passing them in.
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												Transition to doctests (#616)

* basics.md: Initial doctest to an example

Related to https://github.com/FluxML/Flux.jl/issues/561

* make.jl: Allow doctest to run

* Fix comments in order to pass doctests

* basic.md: Add doctests to examples

											
										
										
											2019-02-14 18:29:27 +00:00
+								```jldoctest basics
 								julia> using Flux
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> x = [2, 1];
-												Transition to doctests (#616)

* basics.md: Initial doctest to an example

Related to https://github.com/FluxML/Flux.jl/issues/561

* make.jl: Allow doctest to run

* Fix comments in order to pass doctests

* basic.md: Add doctests to examples

											
										
										
											2019-02-14 18:29:27 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> y = [2, 0];
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> gs = gradient(params(x, y)) do
 								         f(x, y)
 								       end
 								Grads(...)
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> gs[x]
 -element Array{Int64,1}:
 
 
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								julia> gs[y]
 -element Array{Int64,1}:
 
 								 -2
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
+								```
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								Here, `gradient` takes a zero-argument function; no arguments are necessary because the `params` tell it what to differentiate.
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
 								This will come in really handy when dealing with big, complicated models. For now, though, let's start with something simple.
 								## Simple Models
 								Consider a simple linear regression, which tries to predict an output array `y` from an input `x`.
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								```julia
 								W = rand(2, 5)
 								b = rand(2)
 								predict(x) = W*x .+ b
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
 								function loss(x, y)
-												Merge branch 'master' into dg/params_docs
											
										
										
											2020-03-02 07:15:30 +00:00
+								  ŷ = predict(x)
 								  sum((y .- ŷ).^2)
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
+								end
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								x, y = rand(5), rand(2) # Dummy data
 								loss(x, y) # ~ 3
 								```
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								To improve the prediction we can take the gradients of `W` and `b` with respect to the loss and perform gradient descent.
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								```julia
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								using Flux
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								gs = gradient(() -> loss(x, y), params(W, b))
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
+								```
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								Now that we have gradients, we can pull them out and update `W` to train the model.
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								```julia
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								W̄ = gs[W]
-												avoid implementation details in docs

											
										
										
											2018-06-29 12:53:50 +00:00
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								W .-= 0.1 .* W̄
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								loss(x, y) # ~ 2.5
 								```
-												link fixes

											
										
										
											2017-09-12 10:34:04 +00:00
+								The loss has decreased a little, meaning that our prediction `x` is closer to the target `y`. If we have some data we can already try [training the model](../training/training.md).
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
-												update docs

											
										
										
											2018-07-11 14:31:22 +00:00
+								All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can *look* very different – they might have millions of parameters or complex control flow. Let's see how Flux handles more complex models.
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								## Building Layers
-												sigmoid link

											
										
										
											2017-09-08 21:54:10 +00:00
+								It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) (`σ`) in between them. In the above style we could write this as:
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								```julia
-												docs/basics.md: Add `using Flux`

In order to import sigmoid function.
											
										
										
											2019-01-23 13:50:10 +00:00
+								using Flux
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								W1 = rand(3, 5)
 								b1 = rand(3)
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
+								layer1(x) = W1 * x .+ b1
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								W2 = rand(2, 3)
 								b2 = rand(2)
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
+								layer2(x) = W2 * x .+ b2
 								model(x) = layer2(σ.(layer1(x)))
 								model(rand(5)) # => 2-element vector
 								```
 								This works but is fairly unwieldy, with a lot of repetition – especially as we add more layers. One way to factor this out is to create a function that returns linear layers.
 								```julia
 								function linear(in, out)
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								  W = randn(out, in)
 								  b = randn(out)
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
+								  x -> W * x .+ b
 								end
 								linear1 = linear(5, 3) # we can access linear1.W etc
 								linear2 = linear(3, 2)
 								model(x) = linear2(σ.(linear1(x)))
-												Very Little typo.
											
										
										
											2018-07-18 21:20:43 +00:00
+								model(rand(5)) # => 2-element vector
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
+								```
 								Another (equivalent) way is to create a struct that explicitly represents the affine layer.
 								```julia
 								struct Affine
 								  W
 								  b
 								end
 								Affine(in::Integer, out::Integer) =
-												doctests passing

											
										
										
											2019-09-10 14:02:43 +00:00
+								  Affine(randn(out, in), randn(out))
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								# Overload call, so the object can be used as a function
 								(m::Affine)(x) = m.W * x .+ m.b
 								a = Affine(10, 5)
 								a(rand(10)) # => 5-element vector
 								```
 								Congratulations! You just built the `Dense` layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.
 								(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10, 5, σ)`.)
 								## Stacking It Up
 								It's pretty common to write models that look something like:
 								```julia
 								layer1 = Dense(10, 5, σ)
 								# ...
 								model(x) = layer3(layer2(layer1(x)))
 								```
 								For long chains, it might be a bit more intuitive to have a list of layers, like this:
 								```julia
 								using Flux
 								layers = [Dense(10, 5, σ), Dense(5, 2), softmax]
-												fix foldl in tutorial

											
										
										
											2018-08-29 17:36:24 +00:00
+								model(x) = foldl((x, m) -> m(x), layers, init = x)
-												basics

											
										
										
											2017-09-08 21:34:58 +00:00
 								model(rand(10)) # => 2-element vector
 								```
 								Handily, this is also provided for in Flux:
 								```julia
 								model2 = Chain(
 								  Dense(10, 5, σ),
 								  Dense(5, 2),
 								  softmax)
 								model2(rand(10)) # => 2-element vector
 								```
 								This quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code.
 								A nice property of this approach is that because "models" are just functions (possibly with trainable parameters), you can also see this as simple function composition.
 								```julia
 								m = Dense(5, 2) ∘ Dense(10, 5, σ)
 								m(rand(10))
 								```
 								Likewise, `Chain` will happily work with any Julia function.
 								```julia
 								m = Chain(x -> x^2, x -> x+1)
 								m(5) # => 26
 								```
-												mention treelike

											
										
										
											2017-11-21 11:59:39 +00:00
 								## Layer helpers
 								Flux provides a set of helpers for custom layers, which you can enable by calling
 								```julia
-												internal rename

											
										
										
											2019-09-19 14:53:31 +00:00
+								Flux.@functor Affine
-												mention treelike

											
										
										
											2017-11-21 11:59:39 +00:00
+								```
 								This enables a useful extra set of functionality for our `Affine` layer, such as [collecting its parameters](../training/optimisers.md) or [moving it to the GPU](../gpu.md).
-												common questions answered in docs

											
										
										
											2020-02-08 11:03:18 +00:00
-												refactor

											
										
										
											2020-03-02 07:10:19 +00:00
+								For some more helpful tricks, including parameter freezing, please checkout the [advanced usage guide](advacned.md).
-												Merge branch 'master' into dg/params_docs
											
										
										
											2020-03-02 07:15:30 +00:00
-												Added entry to docs for outdims

											
										
										
											2019-12-07 20:06:11 +00:00
+								## Utility functions
 								Flux provides some utility functions to help you generate models in an automated fashion.
 								`outdims` enables you to calculate the spatial output dimensions of layers like `Conv` when applied to input images of a given size.
-												Updated docs to specify exactly what layers support outdims

											
										
										
											2020-02-21 05:38:56 +00:00
+								Currently limited to the following layers:
 								- `Chain`
 								- `Dense`
 								- `Conv`
 								- `Diagonal`
 								- `Maxout`
 								- `ConvTranspose`
 								- `DepthwiseConv`
 								- `CrossCor`
 								- `MaxPool`
 								- `MeanPool`
-												Added entry to docs for outdims

											
										
										
											2019-12-07 20:06:11 +00:00
 								```@docs
 								outdims
 								```