docs updates

2017-09-09 19:58:32 -04:00 · 2017-09-09 19:58:32 -04:00 · fedee95b14
commit fedee95b14
parent 366efa92ab
5 changed files with 124 additions and 6 deletions
--- a/docs/make.jl
+++ b/docs/make.jl
@ -10,12 +10,11 @@ makedocs(modules=[Flux],
                  "Models" =>
                    ["Basics" => "models/basics.md",
                     "Recurrence" => "models/recurrence.md",
-                     "Layers" => "models/layers.md"],
+                     "Layer Reference" => "models/layers.md"],
                  "Contributing & Help" => "contributing.md"])

 deploydocs(
   repo = "github.com/FluxML/Flux.jl.git",
-  #  modules = [Flux],
   target = "build",
   osname = "linux",
   julia = "0.6",
--- a/docs/src/contributing.md
+++ b/docs/src/contributing.md
@ -1,4 +1,4 @@
-# Contributing
+# Contributing & Help

 If you need help, please ask on the [Julia forum](https://discourse.julialang.org/), the [slack](https://discourse.julialang.org/t/announcing-a-julia-slack/4866) (channel #machine-learning), or Flux's [Gitter](https://gitter.im/FluxML/Lobby).

--- a/docs/src/models/basics.md
+++ b/docs/src/models/basics.md
@ -1,3 +1,5 @@
+# Model-Building Basics
+
 ## Taking Gradients

 Consider a simple linear regression, which tries to predict an output array `y` from an input `x`. (It's a good idea to follow this example in the Julia repl.)
@ -31,14 +33,14 @@ back!(l)
 ```julia
 grad(W)

-W.data .-= grad(W)
+W.data .-= 0.1grad(W)

 loss(x, y) # ~ 2.5
 ```

 The loss has decreased a little, meaning that our prediction `x` is closer to the target `y`. If we have some data we can already try [training the model](../training/training.html).

-All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, not all models look like this – they might have millions of parameters or complex control flow, and Flux provides ways to manage this complexity. Let's see what that looks like.
+All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can *look* very different – they might have millions of parameters or complex control flow, and there are ways to manage this complexity. Let's see what that looks like.

 ## Building Layers

--- a/docs/src/models/recurrence.md
+++ b/docs/src/models/recurrence.md
@ -0,0 +1,114 @@
+## Recurrent Cells
+
+In the simple feedforward case, our model `m` is a simple function from various inputs `xᵢ` to predictions `yᵢ`. (For example, each `x` might be an MNIST digit and each `y` a digit label.) Each prediction is completely independent of any others, and using the same `x` will always produce the same `y`.
+
+```julia
+y₁ = f(x₁)
+y₂ = f(x₂)
+y₃ = f(x₃)
+# ...
+```
+
+Recurrent networks introduce a *hidden state* that gets carried over each time we run the model. The model now takes the old `h` as an input, and produces a new `h` as output, each time we run it.
+
+```julia
+h = # ... initial state ...
+y₁, h = f(x₁, h)
+y₂, h = f(x₂, h)
+y₃, h = f(x₃, h)
+# ...
+```
+
+Information stored in `h` is preserved for the next prediction, allowing it to function as a kind of memory. This also means that the prediction made for a given `x` depends on all the inputs previously fed into the model.
+
+(This might be important if, for example, each `x` represents one word of a sentence; the model's interpretation of the word "bank" should change if the previous input was "river" rather than "investment".)
+
+Flux's RNN support closely follows this mathematical perspective. The most basic RNN is as close as possible to a standard `Dense` layer, and the output and hidden state are the same. By convention, the hidden state is the first input and output.
+
+```julia
+Wxh = randn(5, 10)
+Whh = randn(5, 5)
+b   = randn(5)
+
+function rnn(h, x)
+  h = tanh.(Wxh * x .+ Whh * h .+ b)
+  return h, h
+end
+
+x = rand(10) # dummy data
+h = rand(5)  # initial hidden state
+
+h, y = rnn(h, x)
+```
+
+If you run the last line a few times, you'll notice the output `y` changing slightly even though the input `x` is the same.
+
+We sometimes refer to functions like `rnn` above, which explicitly manage state, as recurrent *cells*. There are various recurrent cells available, which are documented in the [layer reference](layers.html). The hand-written example above can be replaced with:
+
+```julia
+using Flux
+
+m = Flux.RNNCell(10, 5)
+
+x = rand(10) # dummy data
+h = rand(5)  # initial hidden state
+
+h, y = rnn(h, x)
+```
+
+## Stateful Models
+
+For the most part, we don't want to manage hidden states ourselves, but to treat our models as being stateful. Flux provides the `Recur` wrapper to do this.
+
+```julia
+x = rand(10)
+h = rand(5)
+
+m = Flux.Recur(rnn, h)
+
+y = m(x)
+```
+
+The `Recur` wrapper stores the state between runs in the `m.state` field.
+
+If you use the `RNN(10, 5)` constructor – as opposed to `RNNCell` – you'll see that it's simply a wrapped cell.
+
+```julia
+julia> RNN(10, 5)
+Recur(RNNCell(Dense(15, 5)))
+```
+
+## Sequences
+
+Often we want to work with sequences of inputs, rather than individual `x`s.
+
+```julia
+seq = [rand(10) for i = 1:10]
+```
+
+With `Recur`, applying our model to each element of a sequence is trivial:
+
+```julia
+map(m, seq) # returns a list of 5-element vectors
+```
+
+To make this a bit more convenient, Flux has the `Seq` type. This is just a list, but tagged so that we know it's meant to be used as a sequence of data points.
+
+```julia
+seq = Seq([rand(10) for i = 1:10])
+m(seq) # returns a new Seq of length 10
+```
+
+When we apply the model `m` to a seq, it gets mapped over every item in the sequence in order. This is just like the code above, but often more convenient.
+
+## Truncating Gradients
+
+By default, calculating the gradients in a recurrent layer involves the entire history. For example, if we call the model on 100 inputs, calling `back!` will calculate the gradient for those 100 calls. If we then calculate another 10 inputs we have to calculate 110 gradients – this accumulates and quickly becomes expensive.
+
+To avoid this we can *truncate* the gradient calculation, forgetting the history.
+
+```julia
+truncate!(m)
+```
+
+Calling `truncate!` wipes the slate clean, so we can call the model with more inputs without building up an expensive gradient computation.
--- a/src/layers/basic.jl
+++ b/src/layers/basic.jl
@ -9,7 +9,7 @@ on a given input.

    m = Chain(Dense(10, 5), Dense(5, 2))
    x = rand(10)
-    m(x) = m[2](m[1](x))
+    m(x) == m[2](m[1](x))

 `Chain` also supports indexing and slicing, e.g. `m[2]` or `m[1:end-1]`.
 """
@ -42,6 +42,9 @@ end
 Creates a traditional `Dense` layer with parameters `W` and `b`.

    y = σ.(W * x .+ b)
+
+The input `x` must be a vector of length `in`, or a batch of vectors represented
+as an `in × N` matrix. The out `y` will be a vector or batch of length `in`.
 """
 struct Dense{F,S,T}
  σ::F