910 lines
55 KiB
Plaintext
910 lines
55 KiB
Plaintext
Published as a conference paper at ICLR 2020
|
||
|
||
|
||
|
||
|
||
MOGRIFIER LSTM
|
||
|
||
|
||
Gábor Melis y , Tomáš Kociskýˇ y , Phil Blunsom yz
|
||
{melisgl,tkocisky,pblunsom}@google.com
|
||
y DeepMind, London, UK
|
||
z University of Oxford
|
||
|
||
|
||
|
||
ABSTRACT
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
arXiv:1909.01792v2 [cs.CL] 29 Jan 2020 Many advances in Natural Language Processing have been based upon more expres-
|
||
sive models for how inputs interact with the context in which they occur. Recurrent
|
||
networks, which have enjoyed a modicum of success, still lack the generalization
|
||
and systematicity ultimately required for modelling language. In this work, we
|
||
propose an extension to the venerable Long Short-Term Memory in the form of
|
||
mutual gating of the current input and the previous output. This mechanism affords
|
||
the modelling of a richer space of interactions between inputs and their context.
|
||
Equivalently, our model can be viewed as making the transition function given
|
||
by the LSTM context-dependent. Experiments demonstrate markedly improved
|
||
generalization on language modelling in the range of 3–4 perplexity points on Penn
|
||
Treebank and Wikitext-2, and 0.01–0.05 bpc on four character-based datasets. We
|
||
establish a new state of the art on all datasets with the exception of Enwik8, where
|
||
we close a large gap between the LSTM and Transformer models.
|
||
|
||
|
||
1 I NTRODUCTION
|
||
|
||
The domination of Natural Language Processing by neural models is hampered only by their limited
|
||
ability to generalize and questionable sample complexity (Belinkov and Bisk 2017; Jia and Liang
|
||
2017; Iyyer et al. 2018; Moosavi and Strube 2017; Agrawal et al. 2016), their poor grasp of grammar
|
||
(Linzen et al. 2016; Kuncoro et al. 2018), and their inability to chunk input sequences into meaningful
|
||
units (Wang et al. 2017). While direct attacks on the latter are possible, in this paper, we take a
|
||
language-agnostic approach to improving Recurrent Neural Networks (RNN, Rumelhart et al. (1988)),
|
||
which brought about many advances in tasks such as language modelling, semantic parsing, machine
|
||
translation, with no shortage of non-NLP applications either (Bakker 2002; Mayer et al. 2008). Many
|
||
neural models are built from RNNs including the sequence-to-sequence family (Sutskever et al. 2014)
|
||
and its attention-based branch (Bahdanau et al. 2014). Thus, innovations in RNN architecture tend to
|
||
have a trickle-down effect from language modelling, where evaluation is often the easiest and data
|
||
the most readily available, to many other tasks, a trend greatly strengthened by ULMFiT (Howard
|
||
and Ruder 2018), ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018), which promote language
|
||
models from architectural blueprints to pretrained building blocks.
|
||
To improve the generalization ability of language models, we propose an extension to the LSTM
|
||
(Hochreiter and Schmidhuber 1997), where the LSTM’s inputxis gated conditioned on the output of
|
||
the previous stephprev . Next, the gated input is used in a similar manner to gate the output of the
|
||
previous time step. After a couple of rounds of this mutual gating, the last updatedxandhprev are
|
||
fed to an LSTM. By introducing these additional of gating operations, in one sense, our model joins
|
||
the long list of recurrent architectures with gating structures of varying complexity which followed
|
||
the invention of Elman Networks (Elman 1990). Examples include the LSTM, the GRU (Chung et al.
|
||
2015), and even designs by Neural Architecture Search (Zoph and Le 2016).
|
||
Intuitively, in the lowermost layer, the first gating step scales the input embedding (itself a representa-
|
||
tion of theaveragecontext in which the token occurs) depending on theactualcontext, resulting in a
|
||
contextualized representation of the input. While intuitive, as Section4 shows, this interpretation
|
||
cannot account for all the observed phenomena.
|
||
In a more encompassing view, our model can be seen as enriching the mostly additive dynamics of
|
||
recurrent transitions placing it in the company of the Input Switched Affine Network (Foerster et al.
|
||
|
||
1 Published as a conference paper at ICLR 2020
|
||
|
||
|
||
|
||
|
||
|
||
|
||
h0 ⦁ h2 ⦁ h4
|
||
LSTM
|
||
|
||
|
||
x-1 ⦁ x1 ⦁ x3 ⦁ x5
|
||
|
||
|
||
Figure 1: Mogrifier with 5 rounds of updates. The previous stateh0 =hprev is transformed linearly (dashed
|
||
arrows), fed through a sigmoid and gatesx |