docs updates

2017-05-04 17:16:22 +01:00 · 2017-05-04 17:16:22 +01:00 · 70615ff7f2
commit 70615ff7f2
parent c66195f47b
3 changed files with 77 additions and 140 deletions
--- a/docs/src/models/basics.md
+++ b/docs/src/models/basics.md
@ -9,7 +9,7 @@ Flux's core feature is the `@net` macro, which adds some superpowers to regular
 f([1,2,3]) == [1,4,9]
 ```

-This behaves as expected, but we have some extra features. For example, we can convert the function to run on [TensorFlow](https://www.tensorflow.org/) or  [MXNet](https://github.com/dmlc/MXNet.jl):
+This behaves as expected, but we have some extra features. For example, we can convert the function to run on [TensorFlow](https://www.tensorflow.org/) or [MXNet](https://github.com/dmlc/MXNet.jl):

 ```julia
 f_mxnet = mxnet(f)
@ -21,13 +21,11 @@ Simples! Flux took care of a lot of boilerplate for us and just ran the multipli
 Using MXNet, we can get the gradient of the function, too:

 ```julia
-back!(f_mxnet, [1,1,1], [1,2,3]) == ([2.0, 4.0, 6.0])
+back!(f_mxnet, [1,1,1], [1,2,3]) == ([2.0, 4.0, 6.0],)
 ```

 `f` is effectively `x^2`, so the gradient is `2x` as expected.

-For TensorFlow users this may seem similar to building a graph as usual. The difference is that Julia code still behaves like Julia code. Error messages give you helpful stacktraces that pinpoint mistakes. You can step through the code in the debugger. The code runs when it's called, as usual, rather than running once to build the graph and then again to execute it.
-
 ## The Model

 The core concept in Flux is the *model*. This corresponds to what might be called a "layer" or "module" in other frameworks. A model is simply a differentiable function with parameters. Given a model `m` we can do things like:
@ -38,22 +36,72 @@ back!(m, Δ, x) # backpropogate the gradient `Δ` through `m`
 update!(m, η)  # update the parameters of `m` using the gradient
 ```

-We can implement a model however we like as long as it fits this interface. But as hinted above, `@net` is a particularly easy way to do it, as `@net` functions are models already.
+We can implement a model however we like as long as it fits this interface. But as hinted above, `@net` is a particularly easy way to do it, because it gives you these functions for free.

 ## Parameters

 Consider how we'd write a logistic regression. We just take the Julia code and add `@net`.

 ```julia
-W = randn(3,5)
-b = randn(3)
-@net logistic(x) = softmax(W * x + b)
+@net logistic(W, b, x) = softmax(x * W .+ b)

-x1 = rand(5) # [0.581466,0.606507,0.981732,0.488618,0.415414]
-y1 = logistic(x1) # [0.32676,0.0974173,0.575823]
+W = randn(10, 2)
+b = randn(1, 2)
+x = rand(1, 10) # [0.563 0.346 0.780  …] – fake data
+y = [1 0] # our desired classification of `x`
+
+ŷ = logistic(W, b, x) # [0.46 0.54]
 ```

-<!-- TODO -->
+The network takes a set of 10 features (`x`, a row vector) and produces a classification `ŷ`, equivalent to a probability of true vs false. `softmax` scales the output to sum to one, so that we can interpret it as a probability distribution.
+
+We can use MXNet and get gradients:
+
+```julia
+logisticm = mxnet(logistic)
+logisticm(W, b, x) # [0.46 0.54]
+back!(logisticm, [0.1 -0.1], W, b, x) # (dW, db, dx)
+```
+
+The gradient `[0.1 -0.1]` says that we want to increase `ŷ[1]` and decrease `ŷ[2]` to get closer to `y`. `back!` gives us the tweaks we need to make to each input (`W`, `b`, `x`) in order to do this. If we add these tweaks to `W` and `b` it will predict `ŷ` more accurately.
+
+Treating parameters like `W` and `b` as inputs can get unwieldy in larger networks. Since they are both global we can use them directly:
+
+```julia
+@net logistic(x) = softmax(x * W .+ b)
+```
+
+However, this gives us a problem: how do we get their gradients?
+
+Flux solves this with the `Param` wrapper:
+
+```julia
+W = param(randn(10, 2))
+b = param(randn(1, 2))
+@net logistic(x) = softmax(x * W .+ b)
+```
+
+This works as before, but now `W.x` stores the real value and `W.Δx` stores its gradient, so we don't have to manage it by hand. We can even use `update!` to apply the gradients automatically.
+
+```julia
+logisticm(x) # [0.46, 0.54]
+
+back!(logisticm, [-1 1], x)
+update!(logisticm, 0.1)
+
+logisticm(x) # [0.51, 0.49]
+```
+
+Our network got a little closer to the target `y`. Now we just need to repeat this millions of times.
+
+*Side note:* We obviously need a way to calculate the "tweak" `[0.1, -0.1]` automatically. We can use a loss function like *mean squared error* for this:
+
+```julia
+# How wrong is ŷ?
+mse([0.46, 0.54], [1, 0]) == 0.292
+# What change to `ŷ` will reduce the wrongness?
+back!(mse, -1, [0.46, 0.54], [1, 0]) == [0.54 -0.54]
+```

 ## Layers

@ -61,8 +109,8 @@ Bigger networks contain many affine transformations like `W * x + b`. We don't w

 ```julia
 function create_affine(in, out)
-  W = randn(out,in)
-  b = randn(out)
+  W = param(randn(out,in))
+  b = param(randn(out))
  @net x -> W * x + b
 end

@ -76,8 +124,8 @@ Flux has a [more powerful syntax](templates.html) for this pattern, but also pro
 affine1 = Affine(5, 5)
 affine2 = Affine(5, 5)

-softmax(affine1(x1)) # [0.167952, 0.186325, 0.176683, 0.238571, 0.23047]
-softmax(affine2(x1)) # [0.125361, 0.246448, 0.21966, 0.124596, 0.283935]
+softmax(affine1(x)) # [0.167952 0.186325 0.176683 0.238571 0.23047]
+softmax(affine2(x)) # [0.125361 0.246448 0.21966 0.124596 0.283935]
 ```

 ## Combining Layers
@ -104,8 +152,6 @@ mymodel3 = Chain(
  Affine(5, 5), softmax)
 ```

-You now know enough to take a look at the [logistic regression](../examples/logreg.md) example, if you haven't already.
-
 ## Dressed like a model

 We noted above that a model is a function with trainable parameters. Normal functions like `exp` are actually models too – they just happen to have 0 parameters. Flux doesn't care, and anywhere that you use one, you can use the other. For example, `Chain` will happily work with regular functions:
--- a/docs/src/models/recurrent.md
+++ b/docs/src/models/recurrent.md
@ -74,55 +74,3 @@ end
 ```

 The only unfamiliar part is that we have to define all of the parameters of the LSTM upfront, which adds a few lines at the beginning.
-
-Flux's very mathematical notation generalises well to handling more complex models. For example, [this neural translation model with alignment](https://arxiv.org/abs/1409.0473) can be fairly straightforwardly, and recognisably, translated from the paper into Flux code:
-
-```julia
-# A recurrent model which takes a token and returns a context-dependent
-# annotation.
-
-@net type Encoder
-  forward
-  backward
-  token -> hcat(forward(token), backward(token))
-end
-
-Encoder(in::Integer, out::Integer) =
-  Encoder(LSTM(in, out÷2), flip(LSTM(in, out÷2)))
-
-# A recurrent model which takes a sequence of annotations, attends, and returns
-# a predicted output token.
-
-@net type Decoder
-  attend
-  recur
-  state; y; N
-  function (anns)
-    energies = map(ann -> exp(attend(hcat(state{-1}, ann))[1]), seq(anns, N))
-    weights = energies./sum(energies)
-    ctx = sum(map((α, ann) -> α .* ann, weights, anns))
-    (_, state), y = recur((state{-1},y{-1}), ctx)
-    y
-  end
-end
-
-Decoder(in::Integer, out::Integer; N = 1) =
-  Decoder(Affine(in+out, 1),
-          unroll1(LSTM(in, out)),
-          param(zeros(1, out)), param(zeros(1, out)), N)
-
-# The model
-
-Nalpha  =  5 # The size of the input token vector
-Nphrase =  7 # The length of (padded) phrases
-Nhidden = 12 # The size of the hidden state
-
-encode = Encoder(Nalpha, Nhidden)
-decode = Chain(Decoder(Nhidden, Nhidden, N = Nphrase), Affine(Nhidden, Nalpha), softmax)
-
-model = Chain(
-  unroll(encode, Nphrase, stateful = false),
-  unroll(decode, Nphrase, stateful = false, seq = false))
-```
-
-Note that this model excercises some of the more advanced parts of the compiler and isn't stable for general use yet.
--- a/docs/src/models/templates.md
+++ b/docs/src/models/templates.md
@ -1,28 +1,24 @@
 # Model Templates

-*... Calculating Tax Expenses ...*
-
-So how does the `Affine` template work? We don't want to duplicate the code above whenever we need more than one affine layer:
+We mentioned that we could factor out the repetition of defining affine layers with something like:

 ```julia
-W₁, b₁ = randn(...)
-affine₁(x) = W₁*x + b₁
-W₂, b₂ = randn(...)
-affine₂(x) = W₂*x + b₂
-model = Chain(affine₁, affine₂)
+function create_affine(in, out)
+  W = param(randn(out,in))
+  b = param(randn(out))
+  @net x -> W * x + b
+end
 ```

-Here's one way we could solve this: just keep the parameters in a Julia type, and define how that type acts as a function:
+`@net type` syntax provides a shortcut for this:

 ```julia
-type MyAffine
+@net type MyAffine
  W
  b
+  x -> x * W + b
 end

-# Use the `MyAffine` layer as a model
-(l::MyAffine)(x) = l.W * x + l.b
-
 # Convenience constructor
 MyAffine(in::Integer, out::Integer) =
  MyAffine(randn(out, in), randn(out))
@ -32,21 +28,12 @@ model = Chain(MyAffine(5, 5), MyAffine(5, 5))
 model(x1) # [-1.54458,0.492025,0.88687,1.93834,-4.70062]
 ```

-This is much better: we can now make as many affine layers as we want. This is a very common pattern, so to make it more convenient we can use the `@net` macro:
+This is almost exactly how `Affine` is defined in Flux itself. Using `@net type` gives us some extra conveniences:

-```julia
-@net type MyAffine
-  W
-  b
-  x -> x * W + b
-end
-```
-
-The function provided, `x -> x * W + b`, will be used when `MyAffine` is used as a model; it's just a shorter way of defining the `(::MyAffine)(x)` method above. (You may notice that `W` and `x` have swapped order in the model; this is due to the way batching works, which will be covered in more detail later on.)
-
-However, `@net` does not simply save us some keystrokes; it's the secret sauce that makes everything else in Flux go. For example, it analyses the code for the forward function so that it can differentiate it or convert it to a TensorFlow graph.
-
-The above code is almost exactly how `Affine` is defined in Flux itself! There's no difference between "library-level" and "user-level" models, so making your code reusable doesn't involve a lot of extra complexity. Moreover, much more complex models than `Affine` are equally simple to define.
+* It creates default constructor `MyAffine(::AbstractArray, ::AbstractArray)` which initialises `param`s for us;
+* It subtypes `Flux.Model` to explicitly mark this as a model;
+* We can easily define custom constructors or instantiate `Affine` with arbitrary weights of our choosing;
+* We can dispatch on the `Affine` type, for example to override how it gets converted to MXNet, or to hook into shape inference.

 ## Models in templates

@ -63,20 +50,6 @@ The above code is almost exactly how `Affine` is defined in Flux itself! There's
 end
 ```

-Just as above, this is roughly equivalent to writing:
-
-```julia
-type TLP
-  first
-  second
-end
-
-function (self::TLP)(x)
-  l1 = σ(self.first(x))
-  l2 = softmax(self.second(l1))
-end
-```
-
 Clearly, the `first` and `second` parameters are not arrays here, but should be models themselves, and produce a result when called with an input array `x`. The `Affine` layer fits the bill, so we can instantiate `TLP` with two of them:

 ```julia
@ -94,36 +67,6 @@ Chain(
  Affine(20, 15), softmax)
 ```

-given that it's just a sequence of calls. For simple networks `Chain` is completely fine, although the `@net` version is more powerful as we can (for example) reuse the output `l1` more than once.
-
-## Constructors
-
-`Affine` has two array parameters, `W` and `b`. Just like any other Julia type, it's easy to instantiate an `Affine` layer with parameters of our choosing:
-
-```julia
-a = Affine(rand(10, 20), rand(20))
-```
-
-However, for convenience and to avoid errors, we'd probably rather specify the input and output dimension instead:
-
-```julia
-a = Affine(10, 20)
-```
-
-This is easy to implement using the usual Julia syntax for constructors:
-
-```julia
-Affine(in::Integer, out::Integer) =
-  Affine(randn(in, out), randn(1, out))
-```
-
-In practice, these constructors tend to take the parameter initialisation function as an argument so that it's more easily customisable, and use `Flux.initn` by default (which is equivalent to `randn(...)/100`). So `Affine`'s constructor really looks like this:
-
-```julia
-Affine(in::Integer, out::Integer; init = initn) =
-  Affine(init(in, out), init(1, out))
-```
-
 ## Supported syntax

 The syntax used to define a forward pass like `x -> x*W + b` behaves exactly like Julia code for the most part. However, it's important to remember that it's defining a dataflow graph, not a general Julia expression. In practice this means that anything side-effectful, or things like control flow and `println`s, won't work as expected. In future we'll continue to expand support for Julia syntax and features.