diff --git a/latest/apis/backends.html b/latest/apis/backends.html
index 8f76d9d5..c665c96b 100644
--- a/latest/apis/backends.html
+++ b/latest/apis/backends.html
@@ -150,7 +150,7 @@ Backends
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/apis/backends.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/apis/backends.md">
             <span class="fa">
 
             </span>
diff --git a/latest/apis/batching.html b/latest/apis/batching.html
index 960ba30c..3ec4f4f6 100644
--- a/latest/apis/batching.html
+++ b/latest/apis/batching.html
@@ -155,7 +155,7 @@ Batching
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/apis/batching.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/apis/batching.md">
             <span class="fa">
 
             </span>
diff --git a/latest/apis/storage.html b/latest/apis/storage.html
index 71db3cf0..eca52ecc 100644
--- a/latest/apis/storage.html
+++ b/latest/apis/storage.html
@@ -139,7 +139,7 @@ Storing Models
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/apis/storage.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/apis/storage.md">
             <span class="fa">
 
             </span>
diff --git a/latest/contributing.html b/latest/contributing.html
index c00b8b86..5ace262f 100644
--- a/latest/contributing.html
+++ b/latest/contributing.html
@@ -136,7 +136,7 @@ Contributing &amp; Help
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/contributing.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/contributing.md">
             <span class="fa">
 
             </span>
diff --git a/latest/examples/char-rnn.html b/latest/examples/char-rnn.html
index 74a24473..41af6884 100644
--- a/latest/examples/char-rnn.html
+++ b/latest/examples/char-rnn.html
@@ -139,7 +139,7 @@ Char RNN
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/examples/char-rnn.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/examples/char-rnn.md">
             <span class="fa">
 
             </span>
diff --git a/latest/examples/logreg.html b/latest/examples/logreg.html
index 621dfa68..ef7c6e7c 100644
--- a/latest/examples/logreg.html
+++ b/latest/examples/logreg.html
@@ -139,7 +139,7 @@ Simple MNIST
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/examples/logreg.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/examples/logreg.md">
             <span class="fa">
 
             </span>
diff --git a/latest/index.html b/latest/index.html
index f89c2c9f..a28871ad 100644
--- a/latest/index.html
+++ b/latest/index.html
@@ -147,7 +147,7 @@ Home
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/index.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/index.md">
             <span class="fa">
 
             </span>
@@ -204,10 +204,10 @@ The
         <a href="examples/logreg.html">
 examples
         </a>
- give a feel for high-level usage. This a great way to start if you&#39;re a relative newbie to machine learning or neural networks; you can get up and running running easily.
+ give a feel for high-level usage.
       </p>
       <p>
-If you have more experience with ML, or you just don&#39;t want to see 
+If you want to know why Flux is unique, or just don&#39;t want to see 
         <em>
 those digits
         </em>
@@ -215,7 +215,14 @@ those digits
         <a href="models/basics.html">
 model building guide
         </a>
- instead. The guide attempts to show how Flux&#39;s abstractions are built up and why it&#39;s powerful, but it&#39;s not all necessary to get started.
+ instead.
+      </p>
+      <p>
+Flux is meant to be played with. These docs have lots of code snippets; try them out in  
+        <a href="http://junolab.org">
+Juno
+        </a>
+!
       </p>
       <h2>
         <a class="nav-anchor" id="Installation-1" href="#Installation-1">
diff --git a/latest/internals.html b/latest/internals.html
index 129eb08e..643fd070 100644
--- a/latest/internals.html
+++ b/latest/internals.html
@@ -136,7 +136,7 @@ Internals
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/internals.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/internals.md">
             <span class="fa">
 
             </span>
diff --git a/latest/models/basics.html b/latest/models/basics.html
index bc9364bd..16d402c9 100644
--- a/latest/models/basics.html
+++ b/latest/models/basics.html
@@ -170,7 +170,7 @@ Model Building Basics
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/models/basics.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/models/basics.md">
             <span class="fa">
 
             </span>
@@ -203,7 +203,7 @@ This behaves as expected, but we have some extra features. For example, we can c
         <a href="https://www.tensorflow.org/">
 TensorFlow
         </a>
- or  
+ or 
         <a href="https://github.com/dmlc/MXNet.jl">
 MXNet
         </a>
@@ -217,7 +217,7 @@ Simples! Flux took care of a lot of boilerplate for us and just ran the multipli
       <p>
 Using MXNet, we can get the gradient of the function, too:
       </p>
-<pre><code class="language-julia">back!(f_mxnet, [1,1,1], [1,2,3]) == ([2.0, 4.0, 6.0])</code></pre>
+<pre><code class="language-julia">back!(f_mxnet, [1,1,1], [1,2,3]) == ([2.0, 4.0, 6.0],)</code></pre>
       <p>
 <code>f</code>
  is effectively 
@@ -225,9 +225,6 @@ Using MXNet, we can get the gradient of the function, too:
 , so the gradient is 
 <code>2x</code>
  as expected.
-      </p>
-      <p>
-For TensorFlow users this may seem similar to building a graph as usual. The difference is that Julia code still behaves like Julia code. Error messages give you helpful stacktraces that pinpoint mistakes. You can step through the code in the debugger. The code runs when it&#39;s called, as usual, rather than running once to build the graph and then again to execute it.
       </p>
       <h2>
         <a class="nav-anchor" id="The-Model-1" href="#The-Model-1">
@@ -249,9 +246,7 @@ update!(m, η)  # update the parameters of `m` using the gradient</code></pre>
       <p>
 We can implement a model however we like as long as it fits this interface. But as hinted above, 
 <code>@net</code>
- is a particularly easy way to do it, as 
-<code>@net</code>
- functions are models already.
+ is a particularly easy way to do it, because it gives you these functions for free.
       </p>
       <h2>
         <a class="nav-anchor" id="Parameters-1" href="#Parameters-1">
@@ -263,19 +258,109 @@ Consider how we&#39;d write a logistic regression. We just take the Julia code a
 <code>@net</code>
 .
       </p>
-<pre><code class="language-julia">W = randn(3,5)
-b = randn(3)
-@net logistic(x) = softmax(W * x + b)
+<pre><code class="language-julia">@net logistic(W, b, x) = softmax(x * W .+ b)
 
-x1 = rand(5) # [0.581466,0.606507,0.981732,0.488618,0.415414]
-y1 = logistic(x1) # [0.32676,0.0974173,0.575823]</code></pre>
+W = randn(10, 2)
+b = randn(1, 2)
+x = rand(1, 10) # [0.563 0.346 0.780  …] – fake data
+y = [1 0] # our desired classification of `x`
+
+ŷ = logistic(W, b, x) # [0.46 0.54]</code></pre>
       <p>
-&lt;!
-–
- TODO 
-–
-&gt;
+The network takes a set of 10 features (
+<code>x</code>
+, a row vector) and produces a classification 
+<code>ŷ</code>
+, equivalent to a probability of true vs false. 
+<code>softmax</code>
+ scales the output to sum to one, so that we can interpret it as a probability distribution.
       </p>
+      <p>
+We can use MXNet and get gradients:
+      </p>
+<pre><code class="language-julia">logisticm = mxnet(logistic)
+logisticm(W, b, x) # [0.46 0.54]
+back!(logisticm, [0.1 -0.1], W, b, x) # (dW, db, dx)</code></pre>
+      <p>
+The gradient 
+<code>[0.1 -0.1]</code>
+ says that we want to increase 
+<code>ŷ[1]</code>
+ and decrease 
+<code>ŷ[2]</code>
+ to get closer to 
+<code>y</code>
+. 
+<code>back!</code>
+ gives us the tweaks we need to make to each input (
+<code>W</code>
+, 
+<code>b</code>
+, 
+<code>x</code>
+) in order to do this. If we add these tweaks to 
+<code>W</code>
+ and 
+<code>b</code>
+ it will predict 
+<code>ŷ</code>
+ more accurately.
+      </p>
+      <p>
+Treating parameters like 
+<code>W</code>
+ and 
+<code>b</code>
+ as inputs can get unwieldy in larger networks. Since they are both global we can use them directly:
+      </p>
+<pre><code class="language-julia">@net logistic(x) = softmax(x * W .+ b)</code></pre>
+      <p>
+However, this gives us a problem: how do we get their gradients?
+      </p>
+      <p>
+Flux solves this with the 
+<code>Param</code>
+ wrapper:
+      </p>
+<pre><code class="language-julia">W = param(randn(10, 2))
+b = param(randn(1, 2))
+@net logistic(x) = softmax(x * W .+ b)</code></pre>
+      <p>
+This works as before, but now 
+<code>W.x</code>
+ stores the real value and 
+<code>W.Δx</code>
+ stores its gradient, so we don&#39;t have to manage it by hand. We can even use 
+<code>update!</code>
+ to apply the gradients automatically.
+      </p>
+<pre><code class="language-julia">logisticm(x) # [0.46, 0.54]
+
+back!(logisticm, [-1 1], x)
+update!(logisticm, 0.1)
+
+logisticm(x) # [0.51, 0.49]</code></pre>
+      <p>
+Our network got a little closer to the target 
+<code>y</code>
+. Now we just need to repeat this millions of times.
+      </p>
+      <p>
+        <em>
+Side note:
+        </em>
+ We obviously need a way to calculate the &quot;tweak&quot; 
+<code>[0.1, -0.1]</code>
+ automatically. We can use a loss function like 
+        <em>
+mean squared error
+        </em>
+ for this:
+      </p>
+<pre><code class="language-julia"># How wrong is ŷ?
+mse([0.46, 0.54], [1, 0]) == 0.292
+# What change to `ŷ` will reduce the wrongness?
+back!(mse, -1, [0.46, 0.54], [1, 0]) == [0.54 -0.54]</code></pre>
       <h2>
         <a class="nav-anchor" id="Layers-1" href="#Layers-1">
 Layers
@@ -287,8 +372,8 @@ Bigger networks contain many affine transformations like
 . We don&#39;t want to write out the definition every time we use it. Instead, we can factor this out by making a function that produces models:
       </p>
 <pre><code class="language-julia">function create_affine(in, out)
-  W = randn(out,in)
-  b = randn(out)
+  W = param(randn(out,in))
+  b = param(randn(out))
   @net x -&gt; W * x + b
 end
 
@@ -304,8 +389,8 @@ more powerful syntax
 <pre><code class="language-julia">affine1 = Affine(5, 5)
 affine2 = Affine(5, 5)
 
-softmax(affine1(x1)) # [0.167952, 0.186325, 0.176683, 0.238571, 0.23047]
-softmax(affine2(x1)) # [0.125361, 0.246448, 0.21966, 0.124596, 0.283935]</code></pre>
+softmax(affine1(x)) # [0.167952 0.186325 0.176683 0.238571 0.23047]
+softmax(affine2(x)) # [0.125361 0.246448 0.21966 0.124596 0.283935]</code></pre>
       <h2>
         <a class="nav-anchor" id="Combining-Layers-1" href="#Combining-Layers-1">
 Combining Layers
@@ -332,13 +417,6 @@ mymodel2(x2) # [0.187935, 0.232237, 0.169824, 0.230589, 0.179414]</code></pre>
 <pre><code class="language-julia">mymodel3 = Chain(
   Affine(5, 5), σ,
   Affine(5, 5), softmax)</code></pre>
-      <p>
-You now know enough to take a look at the 
-        <a href="../examples/logreg.html">
-logistic regression
-        </a>
- example, if you haven&#39;t already.
-      </p>
       <h2>
         <a class="nav-anchor" id="Dressed-like-a-model-1" href="#Dressed-like-a-model-1">
 Dressed like a model
diff --git a/latest/models/debugging.html b/latest/models/debugging.html
index b6cdad9f..6e2f7bcb 100644
--- a/latest/models/debugging.html
+++ b/latest/models/debugging.html
@@ -139,7 +139,7 @@ Debugging
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/models/debugging.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/models/debugging.md">
             <span class="fa">
 
             </span>
diff --git a/latest/models/recurrent.html b/latest/models/recurrent.html
index aa080b22..5bd704d2 100644
--- a/latest/models/recurrent.html
+++ b/latest/models/recurrent.html
@@ -139,7 +139,7 @@ Recurrence
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/models/recurrent.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/models/recurrent.md">
             <span class="fa">
 
             </span>
@@ -244,61 +244,6 @@ equations
 end</code></pre>
       <p>
 The only unfamiliar part is that we have to define all of the parameters of the LSTM upfront, which adds a few lines at the beginning.
-      </p>
-      <p>
-Flux&#39;s very mathematical notation generalises well to handling more complex models. For example, 
-        <a href="https://arxiv.org/abs/1409.0473">
-this neural translation model with alignment
-        </a>
- can be fairly straightforwardly, and recognisably, translated from the paper into Flux code:
-      </p>
-<pre><code class="language-julia"># A recurrent model which takes a token and returns a context-dependent
-# annotation.
-
-@net type Encoder
-  forward
-  backward
-  token -&gt; hcat(forward(token), backward(token))
-end
-
-Encoder(in::Integer, out::Integer) =
-  Encoder(LSTM(in, out÷2), flip(LSTM(in, out÷2)))
-
-# A recurrent model which takes a sequence of annotations, attends, and returns
-# a predicted output token.
-
-@net type Decoder
-  attend
-  recur
-  state; y; N
-  function (anns)
-    energies = map(ann -&gt; exp(attend(hcat(state{-1}, ann))[1]), seq(anns, N))
-    weights = energies./sum(energies)
-    ctx = sum(map((α, ann) -&gt; α .* ann, weights, anns))
-    (_, state), y = recur((state{-1},y{-1}), ctx)
-    y
-  end
-end
-
-Decoder(in::Integer, out::Integer; N = 1) =
-  Decoder(Affine(in+out, 1),
-          unroll1(LSTM(in, out)),
-          param(zeros(1, out)), param(zeros(1, out)), N)
-
-# The model
-
-Nalpha  =  5 # The size of the input token vector
-Nphrase =  7 # The length of (padded) phrases
-Nhidden = 12 # The size of the hidden state
-
-encode = Encoder(Nalpha, Nhidden)
-decode = Chain(Decoder(Nhidden, Nhidden, N = Nphrase), Affine(Nhidden, Nalpha), softmax)
-
-model = Chain(
-  unroll(encode, Nphrase, stateful = false),
-  unroll(decode, Nphrase, stateful = false, seq = false))</code></pre>
-      <p>
-Note that this model excercises some of the more advanced parts of the compiler and isn&#39;t stable for general use yet.
       </p>
       <footer>
         <hr/>
diff --git a/latest/models/templates.html b/latest/models/templates.html
index da6484e4..cd36e152 100644
--- a/latest/models/templates.html
+++ b/latest/models/templates.html
@@ -67,11 +67,6 @@ Model Templates
 Models in templates
                   </a>
                 </li>
-                <li>
-                  <a class="toctext" href="#Constructors-1">
-Constructors
-                  </a>
-                </li>
                 <li>
                   <a class="toctext" href="#Supported-syntax-1">
 Supported syntax
@@ -155,7 +150,7 @@ Model Templates
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/5be9ce45d8cba21cdcb96fa1eed0371508d9f44c/docs/src/models/templates.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/70615ff7f22e88cae79c8d8f88033f589434c39f/docs/src/models/templates.md">
             <span class="fa">
 
             </span>
@@ -170,31 +165,23 @@ Model Templates
         </a>
       </h1>
       <p>
-        <em>
-... Calculating Tax Expenses ...
-        </em>
+We mentioned that we could factor out the repetition of defining affine layers with something like:
       </p>
+<pre><code class="language-julia">function create_affine(in, out)
+  W = param(randn(out,in))
+  b = param(randn(out))
+  @net x -&gt; W * x + b
+end</code></pre>
       <p>
-So how does the 
-<code>Affine</code>
- template work? We don&#39;t want to duplicate the code above whenever we need more than one affine layer:
+<code>@net type</code>
+ syntax provides a shortcut for this:
       </p>
-<pre><code class="language-julia">W₁, b₁ = randn(...)
-affine₁(x) = W₁*x + b₁
-W₂, b₂ = randn(...)
-affine₂(x) = W₂*x + b₂
-model = Chain(affine₁, affine₂)</code></pre>
-      <p>
-Here&#39;s one way we could solve this: just keep the parameters in a Julia type, and define how that type acts as a function:
-      </p>
-<pre><code class="language-julia">type MyAffine
+<pre><code class="language-julia">@net type MyAffine
   W
   b
+  x -&gt; x * W + b
 end
 
-# Use the `MyAffine` layer as a model
-(l::MyAffine)(x) = l.W * x + l.b
-
 # Convenience constructor
 MyAffine(in::Integer, out::Integer) =
   MyAffine(randn(out, in), randn(out))
@@ -203,40 +190,44 @@ model = Chain(MyAffine(5, 5), MyAffine(5, 5))
 
 model(x1) # [-1.54458,0.492025,0.88687,1.93834,-4.70062]</code></pre>
       <p>
-This is much better: we can now make as many affine layers as we want. This is a very common pattern, so to make it more convenient we can use the 
-<code>@net</code>
- macro:
-      </p>
-<pre><code class="language-julia">@net type MyAffine
-  W
-  b
-  x -&gt; x * W + b
-end</code></pre>
-      <p>
-The function provided, 
-<code>x -&gt; x * W + b</code>
-, will be used when 
-<code>MyAffine</code>
- is used as a model; it&#39;s just a shorter way of defining the 
-<code>(::MyAffine)(x)</code>
- method above. (You may notice that 
-<code>W</code>
- and 
-<code>x</code>
- have swapped order in the model; this is due to the way batching works, which will be covered in more detail later on.)
-      </p>
-      <p>
-However, 
-<code>@net</code>
- does not simply save us some keystrokes; it&#39;s the secret sauce that makes everything else in Flux go. For example, it analyses the code for the forward function so that it can differentiate it or convert it to a TensorFlow graph.
-      </p>
-      <p>
-The above code is almost exactly how 
+This is almost exactly how 
 <code>Affine</code>
- is defined in Flux itself! There&#39;s no difference between &quot;library-level&quot; and &quot;user-level&quot; models, so making your code reusable doesn&#39;t involve a lot of extra complexity. Moreover, much more complex models than 
-<code>Affine</code>
- are equally simple to define.
+ is defined in Flux itself. Using 
+<code>@net type</code>
+ gives us some extra conveniences:
       </p>
+      <ul>
+        <li>
+          <p>
+It creates default constructor 
+<code>MyAffine(::AbstractArray, ::AbstractArray)</code>
+ which initialises 
+<code>param</code>
+s for us;
+          </p>
+        </li>
+        <li>
+          <p>
+It subtypes 
+<code>Flux.Model</code>
+ to explicitly mark this as a model;
+          </p>
+        </li>
+        <li>
+          <p>
+We can easily define custom constructors or instantiate 
+<code>Affine</code>
+ with arbitrary weights of our choosing;
+          </p>
+        </li>
+        <li>
+          <p>
+We can dispatch on the 
+<code>Affine</code>
+ type, for example to override how it gets converted to MXNet, or to hook into shape inference.
+          </p>
+        </li>
+      </ul>
       <h2>
         <a class="nav-anchor" id="Models-in-templates-1" href="#Models-in-templates-1">
 Models in templates
@@ -255,18 +246,6 @@ Models in templates
   end
 end</code></pre>
       <p>
-Just as above, this is roughly equivalent to writing:
-      </p>
-<pre><code class="language-julia">type TLP
-  first
-  second
-end
-
-function (self::TLP)(x)
-  l1 = σ(self.first(x))
-  l2 = softmax(self.second(l1))
-end</code></pre>
-      <p>
 Clearly, the 
 <code>first</code>
  and 
@@ -289,51 +268,6 @@ You may recognise this as being equivalent to
 <pre><code class="language-julia">Chain(
   Affine(10, 20), σ
   Affine(20, 15), softmax)</code></pre>
-      <p>
-given that it&#39;s just a sequence of calls. For simple networks 
-<code>Chain</code>
- is completely fine, although the 
-<code>@net</code>
- version is more powerful as we can (for example) reuse the output 
-<code>l1</code>
- more than once.
-      </p>
-      <h2>
-        <a class="nav-anchor" id="Constructors-1" href="#Constructors-1">
-Constructors
-        </a>
-      </h2>
-      <p>
-<code>Affine</code>
- has two array parameters, 
-<code>W</code>
- and 
-<code>b</code>
-. Just like any other Julia type, it&#39;s easy to instantiate an 
-<code>Affine</code>
- layer with parameters of our choosing:
-      </p>
-<pre><code class="language-julia">a = Affine(rand(10, 20), rand(20))</code></pre>
-      <p>
-However, for convenience and to avoid errors, we&#39;d probably rather specify the input and output dimension instead:
-      </p>
-<pre><code class="language-julia">a = Affine(10, 20)</code></pre>
-      <p>
-This is easy to implement using the usual Julia syntax for constructors:
-      </p>
-<pre><code class="language-julia">Affine(in::Integer, out::Integer) =
-  Affine(randn(in, out), randn(1, out))</code></pre>
-      <p>
-In practice, these constructors tend to take the parameter initialisation function as an argument so that it&#39;s more easily customisable, and use 
-<code>Flux.initn</code>
- by default (which is equivalent to 
-<code>randn(...)/100</code>
-). So 
-<code>Affine</code>
-&#39;s constructor really looks like this:
-      </p>
-<pre><code class="language-julia">Affine(in::Integer, out::Integer; init = initn) =
-  Affine(init(in, out), init(1, out))</code></pre>
       <h2>
         <a class="nav-anchor" id="Supported-syntax-1" href="#Supported-syntax-1">
 Supported syntax
diff --git a/latest/search_index.js b/latest/search_index.js
index b50c3096..9dd099ac 100644
--- a/latest/search_index.js
+++ b/latest/search_index.js
@@ -21,7 +21,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Home",
     "title": "Where do I start?",
     "category": "section",
-    "text": "... Charging Ion Capacitors ...The examples give a feel for high-level usage. This a great way to start if you're a relative newbie to machine learning or neural networks; you can get up and running running easily.If you have more experience with ML, or you just don't want to see those digits again, check out the model building guide instead. The guide attempts to show how Flux's abstractions are built up and why it's powerful, but it's not all necessary to get started."
+    "text": "... Charging Ion Capacitors ...The examples give a feel for high-level usage.If you want to know why Flux is unique, or just don't want to see those digits again, check out the model building guide instead.Flux is meant to be played with. These docs have lots of code snippets; try them out in  Juno!"
 },
 
 {
@@ -53,7 +53,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Model Building Basics",
     "title": "Net Functions",
     "category": "section",
-    "text": "Flux's core feature is the @net macro, which adds some superpowers to regular ol' Julia functions. Consider this simple function with the @net annotation applied:@net f(x) = x .* x\nf([1,2,3]) == [1,4,9]This behaves as expected, but we have some extra features. For example, we can convert the function to run on TensorFlow or  MXNet:f_mxnet = mxnet(f)\nf_mxnet([1,2,3]) == [1.0, 4.0, 9.0]Simples! Flux took care of a lot of boilerplate for us and just ran the multiplication on MXNet. MXNet can optimise this code for us, taking advantage of parallelism or running the code on a GPU.Using MXNet, we can get the gradient of the function, too:back!(f_mxnet, [1,1,1], [1,2,3]) == ([2.0, 4.0, 6.0])f is effectively x^2, so the gradient is 2x as expected.For TensorFlow users this may seem similar to building a graph as usual. The difference is that Julia code still behaves like Julia code. Error messages give you helpful stacktraces that pinpoint mistakes. You can step through the code in the debugger. The code runs when it's called, as usual, rather than running once to build the graph and then again to execute it."
+    "text": "Flux's core feature is the @net macro, which adds some superpowers to regular ol' Julia functions. Consider this simple function with the @net annotation applied:@net f(x) = x .* x\nf([1,2,3]) == [1,4,9]This behaves as expected, but we have some extra features. For example, we can convert the function to run on TensorFlow or MXNet:f_mxnet = mxnet(f)\nf_mxnet([1,2,3]) == [1.0, 4.0, 9.0]Simples! Flux took care of a lot of boilerplate for us and just ran the multiplication on MXNet. MXNet can optimise this code for us, taking advantage of parallelism or running the code on a GPU.Using MXNet, we can get the gradient of the function, too:back!(f_mxnet, [1,1,1], [1,2,3]) == ([2.0, 4.0, 6.0],)f is effectively x^2, so the gradient is 2x as expected."
 },
 
 {
@@ -61,7 +61,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Model Building Basics",
     "title": "The Model",
     "category": "section",
-    "text": "The core concept in Flux is the model. This corresponds to what might be called a \"layer\" or \"module\" in other frameworks. A model is simply a differentiable function with parameters. Given a model m we can do things like:m(x)           # See what the model does to an input vector `x`\nback!(m, Δ, x) # backpropogate the gradient `Δ` through `m`\nupdate!(m, η)  # update the parameters of `m` using the gradientWe can implement a model however we like as long as it fits this interface. But as hinted above, @net is a particularly easy way to do it, as @net functions are models already."
+    "text": "The core concept in Flux is the model. This corresponds to what might be called a \"layer\" or \"module\" in other frameworks. A model is simply a differentiable function with parameters. Given a model m we can do things like:m(x)           # See what the model does to an input vector `x`\nback!(m, Δ, x) # backpropogate the gradient `Δ` through `m`\nupdate!(m, η)  # update the parameters of `m` using the gradientWe can implement a model however we like as long as it fits this interface. But as hinted above, @net is a particularly easy way to do it, because it gives you these functions for free."
 },
 
 {
@@ -69,7 +69,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Model Building Basics",
     "title": "Parameters",
     "category": "section",
-    "text": "Consider how we'd write a logistic regression. We just take the Julia code and add @net.W = randn(3,5)\nb = randn(3)\n@net logistic(x) = softmax(W * x + b)\n\nx1 = rand(5) # [0.581466,0.606507,0.981732,0.488618,0.415414]\ny1 = logistic(x1) # [0.32676,0.0974173,0.575823]<!– TODO –>"
+    "text": "Consider how we'd write a logistic regression. We just take the Julia code and add @net.@net logistic(W, b, x) = softmax(x * W .+ b)\n\nW = randn(10, 2)\nb = randn(1, 2)\nx = rand(1, 10) # [0.563 0.346 0.780  …] – fake data\ny = [1 0] # our desired classification of `x`\n\nŷ = logistic(W, b, x) # [0.46 0.54]The network takes a set of 10 features (x, a row vector) and produces a classification ŷ, equivalent to a probability of true vs false. softmax scales the output to sum to one, so that we can interpret it as a probability distribution.We can use MXNet and get gradients:logisticm = mxnet(logistic)\nlogisticm(W, b, x) # [0.46 0.54]\nback!(logisticm, [0.1 -0.1], W, b, x) # (dW, db, dx)The gradient [0.1 -0.1] says that we want to increase ŷ[1] and decrease ŷ[2] to get closer to y. back! gives us the tweaks we need to make to each input (W, b, x) in order to do this. If we add these tweaks to W and b it will predict ŷ more accurately.Treating parameters like W and b as inputs can get unwieldy in larger networks. Since they are both global we can use them directly:@net logistic(x) = softmax(x * W .+ b)However, this gives us a problem: how do we get their gradients?Flux solves this with the Param wrapper:W = param(randn(10, 2))\nb = param(randn(1, 2))\n@net logistic(x) = softmax(x * W .+ b)This works as before, but now W.x stores the real value and W.Δx stores its gradient, so we don't have to manage it by hand. We can even use update! to apply the gradients automatically.logisticm(x) # [0.46, 0.54]\n\nback!(logisticm, [-1 1], x)\nupdate!(logisticm, 0.1)\n\nlogisticm(x) # [0.51, 0.49]Our network got a little closer to the target y. Now we just need to repeat this millions of times.Side note: We obviously need a way to calculate the \"tweak\" [0.1, -0.1] automatically. We can use a loss function like mean squared error for this:# How wrong is ŷ?\nmse([0.46, 0.54], [1, 0]) == 0.292\n# What change to `ŷ` will reduce the wrongness?\nback!(mse, -1, [0.46, 0.54], [1, 0]) == [0.54 -0.54]"
 },
 
 {
@@ -77,7 +77,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Model Building Basics",
     "title": "Layers",
     "category": "section",
-    "text": "Bigger networks contain many affine transformations like W * x + b. We don't want to write out the definition every time we use it. Instead, we can factor this out by making a function that produces models:function create_affine(in, out)\n  W = randn(out,in)\n  b = randn(out)\n  @net x -> W * x + b\nend\n\naffine1 = create_affine(3,2)\naffine1([1,2,3])Flux has a more powerful syntax for this pattern, but also provides a bunch of layers out of the box. So we can instead write:affine1 = Affine(5, 5)\naffine2 = Affine(5, 5)\n\nsoftmax(affine1(x1)) # [0.167952, 0.186325, 0.176683, 0.238571, 0.23047]\nsoftmax(affine2(x1)) # [0.125361, 0.246448, 0.21966, 0.124596, 0.283935]"
+    "text": "Bigger networks contain many affine transformations like W * x + b. We don't want to write out the definition every time we use it. Instead, we can factor this out by making a function that produces models:function create_affine(in, out)\n  W = param(randn(out,in))\n  b = param(randn(out))\n  @net x -> W * x + b\nend\n\naffine1 = create_affine(3,2)\naffine1([1,2,3])Flux has a more powerful syntax for this pattern, but also provides a bunch of layers out of the box. So we can instead write:affine1 = Affine(5, 5)\naffine2 = Affine(5, 5)\n\nsoftmax(affine1(x)) # [0.167952 0.186325 0.176683 0.238571 0.23047]\nsoftmax(affine2(x)) # [0.125361 0.246448 0.21966 0.124596 0.283935]"
 },
 
 {
@@ -85,7 +85,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Model Building Basics",
     "title": "Combining Layers",
     "category": "section",
-    "text": "A more complex model usually involves many basic layers like affine, where we use the output of one layer as the input to the next:mymodel1(x) = softmax(affine2(σ(affine1(x))))\nmymodel1(x1) # [0.187935, 0.232237, 0.169824, 0.230589, 0.179414]This syntax is again a little unwieldy for larger networks, so Flux provides another template of sorts to create the function for us:mymodel2 = Chain(affine1, σ, affine2, softmax)\nmymodel2(x2) # [0.187935, 0.232237, 0.169824, 0.230589, 0.179414]mymodel2 is exactly equivalent to mymodel1 because it simply calls the provided functions in sequence. We don't have to predefine the affine layers and can also write this as:mymodel3 = Chain(\n  Affine(5, 5), σ,\n  Affine(5, 5), softmax)You now know enough to take a look at the logistic regression example, if you haven't already."
+    "text": "A more complex model usually involves many basic layers like affine, where we use the output of one layer as the input to the next:mymodel1(x) = softmax(affine2(σ(affine1(x))))\nmymodel1(x1) # [0.187935, 0.232237, 0.169824, 0.230589, 0.179414]This syntax is again a little unwieldy for larger networks, so Flux provides another template of sorts to create the function for us:mymodel2 = Chain(affine1, σ, affine2, softmax)\nmymodel2(x2) # [0.187935, 0.232237, 0.169824, 0.230589, 0.179414]mymodel2 is exactly equivalent to mymodel1 because it simply calls the provided functions in sequence. We don't have to predefine the affine layers and can also write this as:mymodel3 = Chain(\n  Affine(5, 5), σ,\n  Affine(5, 5), softmax)"
 },
 
 {
@@ -109,7 +109,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Model Templates",
     "title": "Model Templates",
     "category": "section",
-    "text": "... Calculating Tax Expenses ...So how does the Affine template work? We don't want to duplicate the code above whenever we need more than one affine layer:W₁, b₁ = randn(...)\naffine₁(x) = W₁*x + b₁\nW₂, b₂ = randn(...)\naffine₂(x) = W₂*x + b₂\nmodel = Chain(affine₁, affine₂)Here's one way we could solve this: just keep the parameters in a Julia type, and define how that type acts as a function:type MyAffine\n  W\n  b\nend\n\n# Use the `MyAffine` layer as a model\n(l::MyAffine)(x) = l.W * x + l.b\n\n# Convenience constructor\nMyAffine(in::Integer, out::Integer) =\n  MyAffine(randn(out, in), randn(out))\n\nmodel = Chain(MyAffine(5, 5), MyAffine(5, 5))\n\nmodel(x1) # [-1.54458,0.492025,0.88687,1.93834,-4.70062]This is much better: we can now make as many affine layers as we want. This is a very common pattern, so to make it more convenient we can use the @net macro:@net type MyAffine\n  W\n  b\n  x -> x * W + b\nendThe function provided, x -> x * W + b, will be used when MyAffine is used as a model; it's just a shorter way of defining the (::MyAffine)(x) method above. (You may notice that W and x have swapped order in the model; this is due to the way batching works, which will be covered in more detail later on.)However, @net does not simply save us some keystrokes; it's the secret sauce that makes everything else in Flux go. For example, it analyses the code for the forward function so that it can differentiate it or convert it to a TensorFlow graph.The above code is almost exactly how Affine is defined in Flux itself! There's no difference between \"library-level\" and \"user-level\" models, so making your code reusable doesn't involve a lot of extra complexity. Moreover, much more complex models than Affine are equally simple to define."
+    "text": "We mentioned that we could factor out the repetition of defining affine layers with something like:function create_affine(in, out)\n  W = param(randn(out,in))\n  b = param(randn(out))\n  @net x -> W * x + b\nend@net type syntax provides a shortcut for this:@net type MyAffine\n  W\n  b\n  x -> x * W + b\nend\n\n# Convenience constructor\nMyAffine(in::Integer, out::Integer) =\n  MyAffine(randn(out, in), randn(out))\n\nmodel = Chain(MyAffine(5, 5), MyAffine(5, 5))\n\nmodel(x1) # [-1.54458,0.492025,0.88687,1.93834,-4.70062]This is almost exactly how Affine is defined in Flux itself. Using @net type gives us some extra conveniences:It creates default constructor MyAffine(::AbstractArray, ::AbstractArray) which initialises params for us;\nIt subtypes Flux.Model to explicitly mark this as a model;\nWe can easily define custom constructors or instantiate Affine with arbitrary weights of our choosing;\nWe can dispatch on the Affine type, for example to override how it gets converted to MXNet, or to hook into shape inference."
 },
 
 {
@@ -117,15 +117,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Model Templates",
     "title": "Models in templates",
     "category": "section",
-    "text": "@net models can contain sub-models as well as just array parameters:@net type TLP\n  first\n  second\n  function (x)\n    l1 = σ(first(x))\n    l2 = softmax(second(l1))\n  end\nendJust as above, this is roughly equivalent to writing:type TLP\n  first\n  second\nend\n\nfunction (self::TLP)(x)\n  l1 = σ(self.first(x))\n  l2 = softmax(self.second(l1))\nendClearly, the first and second parameters are not arrays here, but should be models themselves, and produce a result when called with an input array x. The Affine layer fits the bill, so we can instantiate TLP with two of them:model = TLP(Affine(10, 20),\n            Affine(20, 15))\nx1 = rand(20)\nmodel(x1) # [0.057852,0.0409741,0.0609625,0.0575354 ...You may recognise this as being equivalent toChain(\n  Affine(10, 20), σ\n  Affine(20, 15), softmax)given that it's just a sequence of calls. For simple networks Chain is completely fine, although the @net version is more powerful as we can (for example) reuse the output l1 more than once."
-},
-
-{
-    "location": "models/templates.html#Constructors-1",
-    "page": "Model Templates",
-    "title": "Constructors",
-    "category": "section",
-    "text": "Affine has two array parameters, W and b. Just like any other Julia type, it's easy to instantiate an Affine layer with parameters of our choosing:a = Affine(rand(10, 20), rand(20))However, for convenience and to avoid errors, we'd probably rather specify the input and output dimension instead:a = Affine(10, 20)This is easy to implement using the usual Julia syntax for constructors:Affine(in::Integer, out::Integer) =\n  Affine(randn(in, out), randn(1, out))In practice, these constructors tend to take the parameter initialisation function as an argument so that it's more easily customisable, and use Flux.initn by default (which is equivalent to randn(...)/100). So Affine's constructor really looks like this:Affine(in::Integer, out::Integer; init = initn) =\n  Affine(init(in, out), init(1, out))"
+    "text": "@net models can contain sub-models as well as just array parameters:@net type TLP\n  first\n  second\n  function (x)\n    l1 = σ(first(x))\n    l2 = softmax(second(l1))\n  end\nendClearly, the first and second parameters are not arrays here, but should be models themselves, and produce a result when called with an input array x. The Affine layer fits the bill, so we can instantiate TLP with two of them:model = TLP(Affine(10, 20),\n            Affine(20, 15))\nx1 = rand(20)\nmodel(x1) # [0.057852,0.0409741,0.0609625,0.0575354 ...You may recognise this as being equivalent toChain(\n  Affine(10, 20), σ\n  Affine(20, 15), softmax)"
 },
 
 {
@@ -149,7 +141,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Recurrence",
     "title": "Recurrent Models",
     "category": "section",
-    "text": "Recurrence is a first-class feature in Flux and recurrent models are very easy to build and use. Recurrences are often illustrated as cycles or self-dependencies in the graph; they can also be thought of as a hidden output from / input to the network. For example, for a sequence of inputs x1, x2, x3 ... we produce predictions as follows:y1 = f(W, x1) # `f` is the model, `W` represents the parameters\ny2 = f(W, x2)\ny3 = f(W, x3)\n...Each evaluation is independent and the prediction made for a given input will always be the same. That makes a lot of sense for, say, MNIST images, but less sense when predicting a sequence. For that case we introduce the hidden state:y1, s = f(W, x1, s)\ny2, s = f(W, x2, s)\ny3, s = f(W, x3, s)\n...The state s allows the prediction to depend not only on the current input x but also on the history of past inputs.The simplest recurrent network looks as follows in Flux, and it should be familiar if you've seen the equations defining an RNN before:@net type Recurrent\n  Wxy; Wyy; by\n  y\n  function (x)\n    y = tanh( x * Wxy + y{-1} * Wyy + by )\n  end\nendThe only difference from a regular feed-forward layer is that we create a variable y which is defined as depending on itself. The y{-1} syntax means \"take the value of y from the previous run of the network\".Using recurrent layers is straightforward and no different feedforward ones in terms of the Chain macro etc. For example:model = Chain(\n    Affine(784, 20), σ\n    Recurrent(20, 30),\n    Recurrent(30, 15))Before using the model we need to unroll it. This happens with the unroll function:unroll(model, 20)This call creates an unrolled, feed-forward version of the model which accepts N (= 20) inputs and generates N predictions at a time. Essentially, the model is replicated N times and Flux ties the hidden outputs y to hidden inputs.Here's a more complex recurrent layer, an LSTM, and again it should be familiar if you've seen the equations:@net type LSTM\n  Wxf; Wyf; bf\n  Wxi; Wyi; bi\n  Wxo; Wyo; bo\n  Wxc; Wyc; bc\n  y; state\n  function (x)\n    # Gates\n    forget = σ( x * Wxf + y{-1} * Wyf + bf )\n    input  = σ( x * Wxi + y{-1} * Wyi + bi )\n    output = σ( x * Wxo + y{-1} * Wyo + bo )\n    # State update and output\n    state′ = tanh( x * Wxc + y{-1} * Wyc + bc )\n    state  = forget .* state{-1} + input .* state′\n    y = output .* tanh(state)\n  end\nendThe only unfamiliar part is that we have to define all of the parameters of the LSTM upfront, which adds a few lines at the beginning.Flux's very mathematical notation generalises well to handling more complex models. For example, this neural translation model with alignment can be fairly straightforwardly, and recognisably, translated from the paper into Flux code:# A recurrent model which takes a token and returns a context-dependent\n# annotation.\n\n@net type Encoder\n  forward\n  backward\n  token -> hcat(forward(token), backward(token))\nend\n\nEncoder(in::Integer, out::Integer) =\n  Encoder(LSTM(in, out÷2), flip(LSTM(in, out÷2)))\n\n# A recurrent model which takes a sequence of annotations, attends, and returns\n# a predicted output token.\n\n@net type Decoder\n  attend\n  recur\n  state; y; N\n  function (anns)\n    energies = map(ann -> exp(attend(hcat(state{-1}, ann))[1]), seq(anns, N))\n    weights = energies./sum(energies)\n    ctx = sum(map((α, ann) -> α .* ann, weights, anns))\n    (_, state), y = recur((state{-1},y{-1}), ctx)\n    y\n  end\nend\n\nDecoder(in::Integer, out::Integer; N = 1) =\n  Decoder(Affine(in+out, 1),\n          unroll1(LSTM(in, out)),\n          param(zeros(1, out)), param(zeros(1, out)), N)\n\n# The model\n\nNalpha  =  5 # The size of the input token vector\nNphrase =  7 # The length of (padded) phrases\nNhidden = 12 # The size of the hidden state\n\nencode = Encoder(Nalpha, Nhidden)\ndecode = Chain(Decoder(Nhidden, Nhidden, N = Nphrase), Affine(Nhidden, Nalpha), softmax)\n\nmodel = Chain(\n  unroll(encode, Nphrase, stateful = false),\n  unroll(decode, Nphrase, stateful = false, seq = false))Note that this model excercises some of the more advanced parts of the compiler and isn't stable for general use yet."
+    "text": "Recurrence is a first-class feature in Flux and recurrent models are very easy to build and use. Recurrences are often illustrated as cycles or self-dependencies in the graph; they can also be thought of as a hidden output from / input to the network. For example, for a sequence of inputs x1, x2, x3 ... we produce predictions as follows:y1 = f(W, x1) # `f` is the model, `W` represents the parameters\ny2 = f(W, x2)\ny3 = f(W, x3)\n...Each evaluation is independent and the prediction made for a given input will always be the same. That makes a lot of sense for, say, MNIST images, but less sense when predicting a sequence. For that case we introduce the hidden state:y1, s = f(W, x1, s)\ny2, s = f(W, x2, s)\ny3, s = f(W, x3, s)\n...The state s allows the prediction to depend not only on the current input x but also on the history of past inputs.The simplest recurrent network looks as follows in Flux, and it should be familiar if you've seen the equations defining an RNN before:@net type Recurrent\n  Wxy; Wyy; by\n  y\n  function (x)\n    y = tanh( x * Wxy + y{-1} * Wyy + by )\n  end\nendThe only difference from a regular feed-forward layer is that we create a variable y which is defined as depending on itself. The y{-1} syntax means \"take the value of y from the previous run of the network\".Using recurrent layers is straightforward and no different feedforward ones in terms of the Chain macro etc. For example:model = Chain(\n    Affine(784, 20), σ\n    Recurrent(20, 30),\n    Recurrent(30, 15))Before using the model we need to unroll it. This happens with the unroll function:unroll(model, 20)This call creates an unrolled, feed-forward version of the model which accepts N (= 20) inputs and generates N predictions at a time. Essentially, the model is replicated N times and Flux ties the hidden outputs y to hidden inputs.Here's a more complex recurrent layer, an LSTM, and again it should be familiar if you've seen the equations:@net type LSTM\n  Wxf; Wyf; bf\n  Wxi; Wyi; bi\n  Wxo; Wyo; bo\n  Wxc; Wyc; bc\n  y; state\n  function (x)\n    # Gates\n    forget = σ( x * Wxf + y{-1} * Wyf + bf )\n    input  = σ( x * Wxi + y{-1} * Wyi + bi )\n    output = σ( x * Wxo + y{-1} * Wyo + bo )\n    # State update and output\n    state′ = tanh( x * Wxc + y{-1} * Wyc + bc )\n    state  = forget .* state{-1} + input .* state′\n    y = output .* tanh(state)\n  end\nendThe only unfamiliar part is that we have to define all of the parameters of the LSTM upfront, which adds a few lines at the beginning."
 },
 
 {