diff --git a/latest/apis/backends.html b/latest/apis/backends.html
index 3e536d21..b0f43da5 100644
--- a/latest/apis/backends.html
+++ b/latest/apis/backends.html
@@ -150,7 +150,7 @@ Backends
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/apis/backends.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/apis/backends.md">
             <span class="fa">
 
             </span>
diff --git a/latest/apis/batching.html b/latest/apis/batching.html
index 82657715..eb013471 100644
--- a/latest/apis/batching.html
+++ b/latest/apis/batching.html
@@ -155,7 +155,7 @@ Batching
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/apis/batching.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/apis/batching.md">
             <span class="fa">
 
             </span>
@@ -197,7 +197,7 @@ Batches are represented the way we
         <em>
 think
         </em>
- about them; as an list of data points. We can do all the usual array operations with them, including getting the first with 
+ about them; as a list of data points. We can do all the usual array operations with them, including getting the first with 
 <code>xs[1]</code>
 , iterating over them and so on. The trick is that under the hood, the data is batched into a single array:
       </p>
diff --git a/latest/apis/storage.html b/latest/apis/storage.html
index 1f8b65ee..18b26fdc 100644
--- a/latest/apis/storage.html
+++ b/latest/apis/storage.html
@@ -139,7 +139,7 @@ Storing Models
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/apis/storage.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/apis/storage.md">
             <span class="fa">
 
             </span>
diff --git a/latest/contributing.html b/latest/contributing.html
index 0cb166d6..f5dd1392 100644
--- a/latest/contributing.html
+++ b/latest/contributing.html
@@ -136,7 +136,7 @@ Contributing &amp; Help
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/contributing.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/contributing.md">
             <span class="fa">
 
             </span>
diff --git a/latest/examples/char-rnn.html b/latest/examples/char-rnn.html
index a7bfe4bb..a38de58c 100644
--- a/latest/examples/char-rnn.html
+++ b/latest/examples/char-rnn.html
@@ -139,7 +139,7 @@ Char RNN
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/examples/char-rnn.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/examples/char-rnn.md">
             <span class="fa">
 
             </span>
@@ -174,16 +174,21 @@ Firstly, we define up front how many steps we want to unroll the RNN, and the nu
 <pre><code class="language-julia">nunroll = 50
 nbatch = 50
 
-getseqs(chars, alphabet) = sequences((onehot(Float32, char, alphabet) for char in chars), nunroll)
-getbatches(chars, alphabet) = batches((getseqs(part, alphabet) for part in chunk(chars, nbatch))...)</code></pre>
+getseqs(chars, alphabet) =
+  sequences((onehot(Float32, char, alphabet) for char in chars), nunroll)
+getbatches(chars, alphabet) =
+  batches((getseqs(part, alphabet) for part in chunk(chars, nbatch))...)</code></pre>
       <p>
 Because we want the RNN to predict the next letter at each iteration, our target data is simply our input data offset by one. For example, if the input is &quot;The quick brown fox&quot;, the target will be &quot;he quick brown fox &quot;. Each letter is one-hot encoded and sequences are batched together to create the training data.
       </p>
-<pre><code class="language-julia">input = readstring(&quot;shakespeare_input.txt&quot;)
+<pre><code class="language-julia">input = readstring(&quot;shakespeare_input.txt&quot;);
 alphabet = unique(input)
 N = length(alphabet)
 
-Xs, Ys = getbatches(input, alphabet), getbatches(input[2:end], alphabet)</code></pre>
+# An iterator of (input, output) pairs
+train = zip(getbatches(input, alphabet), getbatches(input[2:end], alphabet))
+# We will evaluate the loss on a particular batch to monitor the training.
+eval = tobatch.(first(drop(train, 5)))</code></pre>
       <p>
 Creating the model and training it is straightforward:
       </p>
@@ -196,7 +201,11 @@ Creating the model and training it is straightforward:
 
 m = tf(unroll(model, nunroll))
 
-@time Flux.train!(m, Xs, Ys, η = 0.1, epoch = 1)</code></pre>
+# Call this to see how the model is doing
+evalcb = () -&gt; @show logloss(m(eval[1]), eval[2])
+
+@time Flux.train!(m, train, η = 0.1, loss = logloss, cb = [evalcb])
+</code></pre>
       <p>
 Finally, we can sample the model. For sampling we remove the 
 <code>softmax</code>
@@ -204,9 +213,9 @@ Finally, we can sample the model. For sampling we remove the
       </p>
 <pre><code class="language-julia">function sample(model, n, temp = 1)
   s = [rand(alphabet)]
-  m = tf(unroll(model, 1))
-  for i = 1:n
-    push!(s, wsample(alphabet, softmax(m(Seq((onehot(Float32, s[end], alphabet),)))[1]./temp)))
+  m = unroll1(model)
+  for i = 1:n-1
+    push!(s, wsample(alphabet, softmax(m(unsqueeze(onehot(s[end], alphabet)))./temp)[1,:]))
   end
   return string(s...)
 end
diff --git a/latest/examples/logreg.html b/latest/examples/logreg.html
index d9b1ceaf..5fd4efc3 100644
--- a/latest/examples/logreg.html
+++ b/latest/examples/logreg.html
@@ -139,7 +139,7 @@ Simple MNIST
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/examples/logreg.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/examples/logreg.md">
             <span class="fa">
 
             </span>
@@ -160,6 +160,7 @@ This walkthrough example will take you through writing a multi-layer perceptron
 First, we load the data using the MNIST package:
       </p>
 <pre><code class="language-julia">using Flux, MNIST
+using Flux: accuracy
 
 data = [(trainfeatures(i), onehot(trainlabel(i), 0:9)) for i = 1:60_000]
 train = data[1:50_000]
@@ -190,7 +191,7 @@ Otherwise, the format of the data is simple enough, it&#39;s just a list of tupl
       <p>
 Now we define our model, which will simply be a function from one to the other.
       </p>
-<pre><code class="language-julia">m = Chain(
+<pre><code class="language-julia">m = @Chain(
   Input(784),
   Affine(128), relu,
   Affine( 64), relu,
@@ -200,7 +201,7 @@ model = mxnet(m) # Convert to MXNet</code></pre>
       <p>
 We can try this out on our data already:
       </p>
-<pre><code class="language-julia">julia&gt; model(data[1][1])
+<pre><code class="language-julia">julia&gt; model(tobatch(data[1][1]))
 10-element Array{Float64,1}:
  0.10614  
  0.0850447
@@ -209,7 +210,8 @@ We can try this out on our data already:
       <p>
 The model gives a probability of about 0.1 to each class – which is a way of saying, &quot;I have no idea&quot;. This isn&#39;t too surprising as we haven&#39;t shown it any data yet. This is easy to fix:
       </p>
-<pre><code class="language-julia">Flux.train!(model, train, test, η = 1e-4)</code></pre>
+<pre><code class="language-julia">Flux.train!(model, train, η = 1e-3,
+            cb = [()-&gt;@show accuracy(m, test)])</code></pre>
       <p>
 The training step takes about 5 minutes (to make it faster we can do smarter things like batching). If you run this code in Juno, you&#39;ll see a progress meter, which you can hover over to see the remaining computation time.
       </p>
@@ -231,7 +233,7 @@ Notice the class at 93%, suggesting our model is very confident about this image
 <pre><code class="language-julia">julia&gt; onecold(data[1][2], 0:9)
 5
 
-julia&gt; onecold(model(data[1][1]), 0:9)
+julia&gt; onecold(model(tobatch(data[1][1])), 0:9)
 5</code></pre>
       <p>
 Success!
diff --git a/latest/index.html b/latest/index.html
index ea560455..709b7ce7 100644
--- a/latest/index.html
+++ b/latest/index.html
@@ -147,7 +147,7 @@ Home
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/index.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/index.md">
             <span class="fa">
 
             </span>
@@ -169,6 +169,16 @@ Flux aims to be an intuitive and powerful notation, close to the mathematics, th
       </p>
       <p>
 So what&#39;s the catch? Flux is at an early &quot;working prototype&quot; stage; many things work but the API is still in a state of... well, it might change. If you&#39;re interested to find out what works, read on!
+      </p>
+      <p>
+        <strong>
+Note:
+        </strong>
+ If you&#39;re using Julia v0.5 please see 
+        <a href="http://mikeinnes.github.io/Flux.jl/v0.1.1/">
+this version
+        </a>
+ of the docs instead.
       </p>
       <h2>
         <a class="nav-anchor" id="Where-do-I-start?-1" href="#Where-do-I-start?-1">
@@ -233,6 +243,12 @@ TensorFlow
       </p>
 <pre><code class="language-julia">Pkg.add(&quot;MXNet&quot;) # or &quot;TensorFlow&quot;
 Pkg.test(&quot;Flux&quot;) # Make sure everything installed properly</code></pre>
+      <p>
+        <strong>
+Note:
+        </strong>
+ TensorFlow integration may not work properly on Julia v0.6 yet.
+      </p>
       <footer>
         <hr/>
         <a class="next" href="models/basics.html">
diff --git a/latest/internals.html b/latest/internals.html
index 4af2154d..f62d31ac 100644
--- a/latest/internals.html
+++ b/latest/internals.html
@@ -136,7 +136,7 @@ Internals
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/internals.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/internals.md">
             <span class="fa">
 
             </span>
diff --git a/latest/models/basics.html b/latest/models/basics.html
index 5a31a3e3..8d666658 100644
--- a/latest/models/basics.html
+++ b/latest/models/basics.html
@@ -155,7 +155,7 @@ Model Building Basics
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/models/basics.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/models/basics.md">
             <span class="fa">
 
             </span>
diff --git a/latest/models/debugging.html b/latest/models/debugging.html
index 87048cd2..d7a4172e 100644
--- a/latest/models/debugging.html
+++ b/latest/models/debugging.html
@@ -139,7 +139,7 @@ Debugging
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/models/debugging.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/models/debugging.md">
             <span class="fa">
 
             </span>
diff --git a/latest/models/recurrent.html b/latest/models/recurrent.html
index 1d8efc41..3405f802 100644
--- a/latest/models/recurrent.html
+++ b/latest/models/recurrent.html
@@ -139,7 +139,7 @@ Recurrence
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/models/recurrent.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/models/recurrent.md">
             <span class="fa">
 
             </span>
@@ -199,7 +199,7 @@ The only difference from a regular feed-forward layer is that we create a variab
  from the previous run of the network&quot;.
       </p>
       <p>
-Using recurrent layers is straightforward and no different feedforard ones in terms of the 
+Using recurrent layers is straightforward and no different feedforward ones in terms of the 
 <code>Chain</code>
  macro etc. For example:
       </p>
diff --git a/latest/models/templates.html b/latest/models/templates.html
index fc4fb854..dbc9ada7 100644
--- a/latest/models/templates.html
+++ b/latest/models/templates.html
@@ -155,7 +155,7 @@ Model Templates
               </a>
             </li>
           </ul>
-          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/d27161616200a4289587088075f7305673abb49b/docs/src/models/templates.md">
+          <a class="edit-page" href="https://github.com/MikeInnes/Flux.jl/tree/124bcf5e470c62aa17351d51be3ebe2667b6f1c1/docs/src/models/templates.md">
             <span class="fa">
 
             </span>
diff --git a/latest/search_index.js b/latest/search_index.js
index be39c251..df807d86 100644
--- a/latest/search_index.js
+++ b/latest/search_index.js
@@ -13,7 +13,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Home",
     "title": "Flux",
     "category": "section",
-    "text": "Flux is a high-level interface for machine learning, implemented in Julia.Flux aims to be an intuitive and powerful notation, close to the mathematics, that provides advanced features like auto-unrolling and closures. Simple models are trivial, while the most complex architectures are tractable, taking orders of magnitude less code than in other frameworks. Meanwhile, the Flux compiler provides excellent error messages and tools for debugging when things go wrong.So what's the catch? Flux is at an early \"working prototype\" stage; many things work but the API is still in a state of... well, it might change. If you're interested to find out what works, read on!"
+    "text": "Flux is a high-level interface for machine learning, implemented in Julia.Flux aims to be an intuitive and powerful notation, close to the mathematics, that provides advanced features like auto-unrolling and closures. Simple models are trivial, while the most complex architectures are tractable, taking orders of magnitude less code than in other frameworks. Meanwhile, the Flux compiler provides excellent error messages and tools for debugging when things go wrong.So what's the catch? Flux is at an early \"working prototype\" stage; many things work but the API is still in a state of... well, it might change. If you're interested to find out what works, read on!Note: If you're using Julia v0.5 please see this version of the docs instead."
 },
 
 {
@@ -29,7 +29,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Home",
     "title": "Installation",
     "category": "section",
-    "text": "... Charging Ion Capacitors ...Pkg.update()\nPkg.add(\"Flux.jl\")You'll also need a backend to run real training, if you don't have one already. Choose from MXNet or TensorFlow (MXNet is the recommended option if you're not sure):Pkg.add(\"MXNet\") # or \"TensorFlow\"\nPkg.test(\"Flux\") # Make sure everything installed properly"
+    "text": "... Charging Ion Capacitors ...Pkg.update()\nPkg.add(\"Flux.jl\")You'll also need a backend to run real training, if you don't have one already. Choose from MXNet or TensorFlow (MXNet is the recommended option if you're not sure):Pkg.add(\"MXNet\") # or \"TensorFlow\"\nPkg.test(\"Flux\") # Make sure everything installed properlyNote: TensorFlow integration may not work properly on Julia v0.6 yet."
 },
 
 {
@@ -125,7 +125,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Recurrence",
     "title": "Recurrent Models",
     "category": "section",
-    "text": "Recurrence is a first-class feature in Flux and recurrent models are very easy to build and use. Recurrences are often illustrated as cycles or self-dependencies in the graph; they can also be thought of as a hidden output from / input to the network. For example, for a sequence of inputs x1, x2, x3 ... we produce predictions as follows:y1 = f(W, x1) # `f` is the model, `W` represents the parameters\ny2 = f(W, x2)\ny3 = f(W, x3)\n...Each evaluation is independent and the prediction made for a given input will always be the same. That makes a lot of sense for, say, MNIST images, but less sense when predicting a sequence. For that case we introduce the hidden state:y1, s = f(W, x1, s)\ny2, s = f(W, x2, s)\ny3, s = f(W, x3, s)\n...The state s allows the prediction to depend not only on the current input x but also on the history of past inputs.The simplest recurrent network looks as follows in Flux, and it should be familiar if you've seen the equations defining an RNN before:@net type Recurrent\n  Wxy; Wyy; by\n  y\n  function (x)\n    y = tanh( x * Wxy + y{-1} * Wyy + by )\n  end\nendThe only difference from a regular feed-forward layer is that we create a variable y which is defined as depending on itself. The y{-1} syntax means \"take the value of y from the previous run of the network\".Using recurrent layers is straightforward and no different feedforard ones in terms of the Chain macro etc. For example:model = Chain(\n    Affine(784, 20), σ\n    Recurrent(20, 30),\n    Recurrent(30, 15))Before using the model we need to unroll it. This happens with the unroll function:unroll(model, 20)This call creates an unrolled, feed-forward version of the model which accepts N (= 20) inputs and generates N predictions at a time. Essentially, the model is replicated N times and Flux ties the hidden outputs y to hidden inputs.Here's a more complex recurrent layer, an LSTM, and again it should be familiar if you've seen the equations:@net type LSTM\n  Wxf; Wyf; bf\n  Wxi; Wyi; bi\n  Wxo; Wyo; bo\n  Wxc; Wyc; bc\n  y; state\n  function (x)\n    # Gates\n    forget = σ( x * Wxf + y{-1} * Wyf + bf )\n    input  = σ( x * Wxi + y{-1} * Wyi + bi )\n    output = σ( x * Wxo + y{-1} * Wyo + bo )\n    # State update and output\n    state′ = tanh( x * Wxc + y{-1} * Wyc + bc )\n    state  = forget .* state{-1} + input .* state′\n    y = output .* tanh(state)\n  end\nendThe only unfamiliar part is that we have to define all of the parameters of the LSTM upfront, which adds a few lines at the beginning.Flux's very mathematical notation generalises well to handling more complex models. For example, this neural translation model with alignment can be fairly straightforwardly, and recognisably, translated from the paper into Flux code:# A recurrent model which takes a token and returns a context-dependent\n# annotation.\n\n@net type Encoder\n  forward\n  backward\n  token -> hcat(forward(token), backward(token))\nend\n\nEncoder(in::Integer, out::Integer) =\n  Encoder(LSTM(in, out÷2), flip(LSTM(in, out÷2)))\n\n# A recurrent model which takes a sequence of annotations, attends, and returns\n# a predicted output token.\n\n@net type Decoder\n  attend\n  recur\n  state; y; N\n  function (anns)\n    energies = map(ann -> exp(attend(hcat(state{-1}, ann))[1]), seq(anns, N))\n    weights = energies./sum(energies)\n    ctx = sum(map((α, ann) -> α .* ann, weights, anns))\n    (_, state), y = recur((state{-1},y{-1}), ctx)\n    y\n  end\nend\n\nDecoder(in::Integer, out::Integer; N = 1) =\n  Decoder(Affine(in+out, 1),\n          unroll1(LSTM(in, out)),\n          param(zeros(1, out)), param(zeros(1, out)), N)\n\n# The model\n\nNalpha  =  5 # The size of the input token vector\nNphrase =  7 # The length of (padded) phrases\nNhidden = 12 # The size of the hidden state\n\nencode = Encoder(Nalpha, Nhidden)\ndecode = Chain(Decoder(Nhidden, Nhidden, N = Nphrase), Affine(Nhidden, Nalpha), softmax)\n\nmodel = Chain(\n  unroll(encode, Nphrase, stateful = false),\n  unroll(decode, Nphrase, stateful = false, seq = false))Note that this model excercises some of the more advanced parts of the compiler and isn't stable for general use yet."
+    "text": "Recurrence is a first-class feature in Flux and recurrent models are very easy to build and use. Recurrences are often illustrated as cycles or self-dependencies in the graph; they can also be thought of as a hidden output from / input to the network. For example, for a sequence of inputs x1, x2, x3 ... we produce predictions as follows:y1 = f(W, x1) # `f` is the model, `W` represents the parameters\ny2 = f(W, x2)\ny3 = f(W, x3)\n...Each evaluation is independent and the prediction made for a given input will always be the same. That makes a lot of sense for, say, MNIST images, but less sense when predicting a sequence. For that case we introduce the hidden state:y1, s = f(W, x1, s)\ny2, s = f(W, x2, s)\ny3, s = f(W, x3, s)\n...The state s allows the prediction to depend not only on the current input x but also on the history of past inputs.The simplest recurrent network looks as follows in Flux, and it should be familiar if you've seen the equations defining an RNN before:@net type Recurrent\n  Wxy; Wyy; by\n  y\n  function (x)\n    y = tanh( x * Wxy + y{-1} * Wyy + by )\n  end\nendThe only difference from a regular feed-forward layer is that we create a variable y which is defined as depending on itself. The y{-1} syntax means \"take the value of y from the previous run of the network\".Using recurrent layers is straightforward and no different feedforward ones in terms of the Chain macro etc. For example:model = Chain(\n    Affine(784, 20), σ\n    Recurrent(20, 30),\n    Recurrent(30, 15))Before using the model we need to unroll it. This happens with the unroll function:unroll(model, 20)This call creates an unrolled, feed-forward version of the model which accepts N (= 20) inputs and generates N predictions at a time. Essentially, the model is replicated N times and Flux ties the hidden outputs y to hidden inputs.Here's a more complex recurrent layer, an LSTM, and again it should be familiar if you've seen the equations:@net type LSTM\n  Wxf; Wyf; bf\n  Wxi; Wyi; bi\n  Wxo; Wyo; bo\n  Wxc; Wyc; bc\n  y; state\n  function (x)\n    # Gates\n    forget = σ( x * Wxf + y{-1} * Wyf + bf )\n    input  = σ( x * Wxi + y{-1} * Wyi + bi )\n    output = σ( x * Wxo + y{-1} * Wyo + bo )\n    # State update and output\n    state′ = tanh( x * Wxc + y{-1} * Wyc + bc )\n    state  = forget .* state{-1} + input .* state′\n    y = output .* tanh(state)\n  end\nendThe only unfamiliar part is that we have to define all of the parameters of the LSTM upfront, which adds a few lines at the beginning.Flux's very mathematical notation generalises well to handling more complex models. For example, this neural translation model with alignment can be fairly straightforwardly, and recognisably, translated from the paper into Flux code:# A recurrent model which takes a token and returns a context-dependent\n# annotation.\n\n@net type Encoder\n  forward\n  backward\n  token -> hcat(forward(token), backward(token))\nend\n\nEncoder(in::Integer, out::Integer) =\n  Encoder(LSTM(in, out÷2), flip(LSTM(in, out÷2)))\n\n# A recurrent model which takes a sequence of annotations, attends, and returns\n# a predicted output token.\n\n@net type Decoder\n  attend\n  recur\n  state; y; N\n  function (anns)\n    energies = map(ann -> exp(attend(hcat(state{-1}, ann))[1]), seq(anns, N))\n    weights = energies./sum(energies)\n    ctx = sum(map((α, ann) -> α .* ann, weights, anns))\n    (_, state), y = recur((state{-1},y{-1}), ctx)\n    y\n  end\nend\n\nDecoder(in::Integer, out::Integer; N = 1) =\n  Decoder(Affine(in+out, 1),\n          unroll1(LSTM(in, out)),\n          param(zeros(1, out)), param(zeros(1, out)), N)\n\n# The model\n\nNalpha  =  5 # The size of the input token vector\nNphrase =  7 # The length of (padded) phrases\nNhidden = 12 # The size of the hidden state\n\nencode = Encoder(Nalpha, Nhidden)\ndecode = Chain(Decoder(Nhidden, Nhidden, N = Nphrase), Affine(Nhidden, Nalpha), softmax)\n\nmodel = Chain(\n  unroll(encode, Nphrase, stateful = false),\n  unroll(decode, Nphrase, stateful = false, seq = false))Note that this model excercises some of the more advanced parts of the compiler and isn't stable for general use yet."
 },
 
 {
@@ -165,7 +165,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Batching",
     "title": "Basics",
     "category": "section",
-    "text": "Existing machine learning frameworks and libraries represent batching, and other properties of data, only implicitly. Your machine learning data is a large N-dimensional array, which may have a shape like:100 × 50 × 256 × 256Typically, this might represent that you have (say) a batch of 100 samples, where each sample is a 50-long sequence of 256×256 images. This is great for performance, but array operations often become much more cumbersome as a result. Especially if you manipulate dimensions at runtime as an optimisation, debugging models can become extremely fiddly, with a proliferation of X × Y × Z arrays and no information about where they came from.Flux introduces a new approach where the batch dimension is represented explicitly as part of the data. For example:julia> xs = Batch([[1,2,3], [4,5,6]])\n2-element Batch of Vector{Int64}:\n [1,2,3]\n [4,5,6]Batches are represented the way we think about them; as an list of data points. We can do all the usual array operations with them, including getting the first with xs[1], iterating over them and so on. The trick is that under the hood, the data is batched into a single array:julia> rawbatch(xs)\n2×3 Array{Int64,2}:\n 1  2  3\n 4  5  6When we put a Batch object into a model, the model is ultimately working with a single array, which means there's no performance overhead and we get the full benefit of standard batching.Turning a set of vectors into a matrix is fairly easy anyway, so what's the big deal? Well, it gets more interesting as we start working with more complex data. Say we were working with 4×4 images:julia> xs = Batch([[1 2; 3 4], [5 6; 7 8]])\n2-element Flux.Batch of Array{Int64,2}:\n [1 2; 3 4]\n [5 6; 7 8]The raw batch array is much messier, and harder to recognise:julia> rawbatch(xs)\n2×2×2 Array{Int64,3}:\n[:, :, 1] =\n 1  3\n 5  7\n\n[:, :, 2] =\n 2  4\n 6  8Furthermore, because the batches acts like a list of arrays, we can use simple and familiar operations on it:julia> map(flatten, xs)\n2-element Array{Array{Int64,1},1}:\n [1,3,2,4]\n [5,7,6,8]flatten is simple enough over a single data point, but flattening a batched data set is more complex and you end up needing arcane array operations like mapslices. A Batch can just handle this for you for free, and more importantly it ensures that your operations are correct – that you haven't mixed up your batch and data dimensions, or used the wrong array op, and so on."
+    "text": "Existing machine learning frameworks and libraries represent batching, and other properties of data, only implicitly. Your machine learning data is a large N-dimensional array, which may have a shape like:100 × 50 × 256 × 256Typically, this might represent that you have (say) a batch of 100 samples, where each sample is a 50-long sequence of 256×256 images. This is great for performance, but array operations often become much more cumbersome as a result. Especially if you manipulate dimensions at runtime as an optimisation, debugging models can become extremely fiddly, with a proliferation of X × Y × Z arrays and no information about where they came from.Flux introduces a new approach where the batch dimension is represented explicitly as part of the data. For example:julia> xs = Batch([[1,2,3], [4,5,6]])\n2-element Batch of Vector{Int64}:\n [1,2,3]\n [4,5,6]Batches are represented the way we think about them; as a list of data points. We can do all the usual array operations with them, including getting the first with xs[1], iterating over them and so on. The trick is that under the hood, the data is batched into a single array:julia> rawbatch(xs)\n2×3 Array{Int64,2}:\n 1  2  3\n 4  5  6When we put a Batch object into a model, the model is ultimately working with a single array, which means there's no performance overhead and we get the full benefit of standard batching.Turning a set of vectors into a matrix is fairly easy anyway, so what's the big deal? Well, it gets more interesting as we start working with more complex data. Say we were working with 4×4 images:julia> xs = Batch([[1 2; 3 4], [5 6; 7 8]])\n2-element Flux.Batch of Array{Int64,2}:\n [1 2; 3 4]\n [5 6; 7 8]The raw batch array is much messier, and harder to recognise:julia> rawbatch(xs)\n2×2×2 Array{Int64,3}:\n[:, :, 1] =\n 1  3\n 5  7\n\n[:, :, 2] =\n 2  4\n 6  8Furthermore, because the batches acts like a list of arrays, we can use simple and familiar operations on it:julia> map(flatten, xs)\n2-element Array{Array{Int64,1},1}:\n [1,3,2,4]\n [5,7,6,8]flatten is simple enough over a single data point, but flattening a batched data set is more complex and you end up needing arcane array operations like mapslices. A Batch can just handle this for you for free, and more importantly it ensures that your operations are correct – that you haven't mixed up your batch and data dimensions, or used the wrong array op, and so on."
 },
 
 {
@@ -245,7 +245,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Simple MNIST",
     "title": "Recognising MNIST Digits",
     "category": "section",
-    "text": "This walkthrough example will take you through writing a multi-layer perceptron that classifies MNIST digits with high accuracy.First, we load the data using the MNIST package:using Flux, MNIST\n\ndata = [(trainfeatures(i), onehot(trainlabel(i), 0:9)) for i = 1:60_000]\ntrain = data[1:50_000]\ntest = data[50_001:60_000]The only Flux-specific function here is onehot, which takes a class label and turns it into a one-hot-encoded vector that we can use for training. For example:julia> onehot(:b, [:a, :b, :c])\n3-element Array{Int64,1}:\n 0\n 1\n 0Otherwise, the format of the data is simple enough, it's just a list of tuples from input to output. For example:julia> data[1]\n([0.0,0.0,0.0, … 0.0,0.0,0.0],[0,0,0,0,0,1,0,0,0,0])data[1][1] is a 28*28 == 784 length vector (mostly zeros due to the black background) and data[1][2] is its classification.Now we define our model, which will simply be a function from one to the other.m = Chain(\n  Input(784),\n  Affine(128), relu,\n  Affine( 64), relu,\n  Affine( 10), softmax)\n\nmodel = mxnet(m) # Convert to MXNetWe can try this out on our data already:julia> model(data[1][1])\n10-element Array{Float64,1}:\n 0.10614  \n 0.0850447\n 0.101474\n ...The model gives a probability of about 0.1 to each class – which is a way of saying, \"I have no idea\". This isn't too surprising as we haven't shown it any data yet. This is easy to fix:Flux.train!(model, train, test, η = 1e-4)The training step takes about 5 minutes (to make it faster we can do smarter things like batching). If you run this code in Juno, you'll see a progress meter, which you can hover over to see the remaining computation time.Towards the end of the training process, Flux will have reported that the accuracy of the model is now about 90%. We can try it on our data again:10-element Array{Float32,1}:\n ...\n 5.11423f-7\n 0.9354     \n 3.1033f-5  \n 0.000127077\n ...Notice the class at 93%, suggesting our model is very confident about this image. We can use onecold to compare the true and predicted classes:julia> onecold(data[1][2], 0:9)\n5\n\njulia> onecold(model(data[1][1]), 0:9)\n5Success!"
+    "text": "This walkthrough example will take you through writing a multi-layer perceptron that classifies MNIST digits with high accuracy.First, we load the data using the MNIST package:using Flux, MNIST\nusing Flux: accuracy\n\ndata = [(trainfeatures(i), onehot(trainlabel(i), 0:9)) for i = 1:60_000]\ntrain = data[1:50_000]\ntest = data[50_001:60_000]The only Flux-specific function here is onehot, which takes a class label and turns it into a one-hot-encoded vector that we can use for training. For example:julia> onehot(:b, [:a, :b, :c])\n3-element Array{Int64,1}:\n 0\n 1\n 0Otherwise, the format of the data is simple enough, it's just a list of tuples from input to output. For example:julia> data[1]\n([0.0,0.0,0.0, … 0.0,0.0,0.0],[0,0,0,0,0,1,0,0,0,0])data[1][1] is a 28*28 == 784 length vector (mostly zeros due to the black background) and data[1][2] is its classification.Now we define our model, which will simply be a function from one to the other.m = @Chain(\n  Input(784),\n  Affine(128), relu,\n  Affine( 64), relu,\n  Affine( 10), softmax)\n\nmodel = mxnet(m) # Convert to MXNetWe can try this out on our data already:julia> model(tobatch(data[1][1]))\n10-element Array{Float64,1}:\n 0.10614  \n 0.0850447\n 0.101474\n ...The model gives a probability of about 0.1 to each class – which is a way of saying, \"I have no idea\". This isn't too surprising as we haven't shown it any data yet. This is easy to fix:Flux.train!(model, train, η = 1e-3,\n            cb = [()->@show accuracy(m, test)])The training step takes about 5 minutes (to make it faster we can do smarter things like batching). If you run this code in Juno, you'll see a progress meter, which you can hover over to see the remaining computation time.Towards the end of the training process, Flux will have reported that the accuracy of the model is now about 90%. We can try it on our data again:10-element Array{Float32,1}:\n ...\n 5.11423f-7\n 0.9354     \n 3.1033f-5  \n 0.000127077\n ...Notice the class at 93%, suggesting our model is very confident about this image. We can use onecold to compare the true and predicted classes:julia> onecold(data[1][2], 0:9)\n5\n\njulia> onecold(model(tobatch(data[1][1])), 0:9)\n5Success!"
 },
 
 {
@@ -261,7 +261,7 @@ var documenterSearchIndex = {"docs": [
     "page": "Char RNN",
     "title": "Char RNN",
     "category": "section",
-    "text": "This walkthrough will take you through a model like that used in Karpathy's 2015 blog post, which can learn to generate text in the style of Shakespeare (or whatever else you may use as input). shakespeare_input.txt is here.using Flux\nimport StatsBase: wsampleFirstly, we define up front how many steps we want to unroll the RNN, and the number of data points to batch together. Then we create some functions to prepare our data, using Flux's built-in utilities.nunroll = 50\nnbatch = 50\n\ngetseqs(chars, alphabet) = sequences((onehot(Float32, char, alphabet) for char in chars), nunroll)\ngetbatches(chars, alphabet) = batches((getseqs(part, alphabet) for part in chunk(chars, nbatch))...)Because we want the RNN to predict the next letter at each iteration, our target data is simply our input data offset by one. For example, if the input is \"The quick brown fox\", the target will be \"he quick brown fox \". Each letter is one-hot encoded and sequences are batched together to create the training data.input = readstring(\"shakespeare_input.txt\")\nalphabet = unique(input)\nN = length(alphabet)\n\nXs, Ys = getbatches(input, alphabet), getbatches(input[2:end], alphabet)Creating the model and training it is straightforward:model = Chain(\n  Input(N),\n  LSTM(N, 256),\n  LSTM(256, 256),\n  Affine(256, N),\n  softmax)\n\nm = tf(unroll(model, nunroll))\n\n@time Flux.train!(m, Xs, Ys, η = 0.1, epoch = 1)Finally, we can sample the model. For sampling we remove the softmax from the end of the chain so that we can \"sharpen\" the resulting probabilities.function sample(model, n, temp = 1)\n  s = [rand(alphabet)]\n  m = tf(unroll(model, 1))\n  for i = 1:n\n    push!(s, wsample(alphabet, softmax(m(Seq((onehot(Float32, s[end], alphabet),)))[1]./temp)))\n  end\n  return string(s...)\nend\n\nsample(model[1:end-1], 100)sample then produces a string of Shakespeare-like text. This won't produce great results after only a single epoch (though they will be recognisably different from the untrained model). Going for 30 epochs or so produces good results.Trained on a dataset from base Julia, the network can produce code like:function show(io::IO, md::Githompty)\n    Buffer(jowerTriangular(inals[i], initabs_indices), characters, side, nextfloat(typeof(x)))\n    isnull(r) && return\n    start::I!\n    for j = 1:length(b,1)\n        a = s->cosvect(code)\n        return\n    end\n    indsERenv | maximum(func,lsg))\n    for i = 1:last(Abjelar) && fname (=== nothing)\n        throw(ArgumentError(\"read is declave non-fast-a/remaining of not descride method names\"))\n    end\n    if e.ht === Int\n        # update file to a stroducative, but is decould.\n        # xna i -GB =# [unsafe_color <c *has may num 20<11E 16/s\n        tuple | Expr(:(UnitLowerTriangular(transpose,(repl.ptr)))\n        dims = pipe_read(s,Int(a)...)\n    ex,0 + y.uilid_func & find_finwprevend(msg,:2)\n    ex = stage(c)\n    # uvvalue begin\n    end\nend"
+    "text": "This walkthrough will take you through a model like that used in Karpathy's 2015 blog post, which can learn to generate text in the style of Shakespeare (or whatever else you may use as input). shakespeare_input.txt is here.using Flux\nimport StatsBase: wsampleFirstly, we define up front how many steps we want to unroll the RNN, and the number of data points to batch together. Then we create some functions to prepare our data, using Flux's built-in utilities.nunroll = 50\nnbatch = 50\n\ngetseqs(chars, alphabet) =\n  sequences((onehot(Float32, char, alphabet) for char in chars), nunroll)\ngetbatches(chars, alphabet) =\n  batches((getseqs(part, alphabet) for part in chunk(chars, nbatch))...)Because we want the RNN to predict the next letter at each iteration, our target data is simply our input data offset by one. For example, if the input is \"The quick brown fox\", the target will be \"he quick brown fox \". Each letter is one-hot encoded and sequences are batched together to create the training data.input = readstring(\"shakespeare_input.txt\");\nalphabet = unique(input)\nN = length(alphabet)\n\n# An iterator of (input, output) pairs\ntrain = zip(getbatches(input, alphabet), getbatches(input[2:end], alphabet))\n# We will evaluate the loss on a particular batch to monitor the training.\neval = tobatch.(first(drop(train, 5)))Creating the model and training it is straightforward:model = Chain(\n  Input(N),\n  LSTM(N, 256),\n  LSTM(256, 256),\n  Affine(256, N),\n  softmax)\n\nm = tf(unroll(model, nunroll))\n\n# Call this to see how the model is doing\nevalcb = () -> @show logloss(m(eval[1]), eval[2])\n\n@time Flux.train!(m, train, η = 0.1, loss = logloss, cb = [evalcb])\nFinally, we can sample the model. For sampling we remove the softmax from the end of the chain so that we can \"sharpen\" the resulting probabilities.function sample(model, n, temp = 1)\n  s = [rand(alphabet)]\n  m = unroll1(model)\n  for i = 1:n-1\n    push!(s, wsample(alphabet, softmax(m(unsqueeze(onehot(s[end], alphabet)))./temp)[1,:]))\n  end\n  return string(s...)\nend\n\nsample(model[1:end-1], 100)sample then produces a string of Shakespeare-like text. This won't produce great results after only a single epoch (though they will be recognisably different from the untrained model). Going for 30 epochs or so produces good results.Trained on a dataset from base Julia, the network can produce code like:function show(io::IO, md::Githompty)\n    Buffer(jowerTriangular(inals[i], initabs_indices), characters, side, nextfloat(typeof(x)))\n    isnull(r) && return\n    start::I!\n    for j = 1:length(b,1)\n        a = s->cosvect(code)\n        return\n    end\n    indsERenv | maximum(func,lsg))\n    for i = 1:last(Abjelar) && fname (=== nothing)\n        throw(ArgumentError(\"read is declave non-fast-a/remaining of not descride method names\"))\n    end\n    if e.ht === Int\n        # update file to a stroducative, but is decould.\n        # xna i -GB =# [unsafe_color <c *has may num 20<11E 16/s\n        tuple | Expr(:(UnitLowerTriangular(transpose,(repl.ptr)))\n        dims = pipe_read(s,Int(a)...)\n    ex,0 + y.uilid_func & find_finwprevend(msg,:2)\n    ex = stage(c)\n    # uvvalue begin\n    end\nend"
 },
 
 {