910 lines
55 KiB
Plaintext
910 lines
55 KiB
Plaintext
|
Published as a conference paper at ICLR 2020
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
MOGRIFIER LSTM
|
|||
|
|
|||
|
|
|||
|
Gábor Melis y , Tomáš Kociskýˇ y , Phil Blunsom yz
|
|||
|
{melisgl,tkocisky,pblunsom}@google.com
|
|||
|
y DeepMind, London, UK
|
|||
|
z University of Oxford
|
|||
|
|
|||
|
|
|||
|
|
|||
|
ABSTRACT
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
arXiv:1909.01792v2 [cs.CL] 29 Jan 2020 Many advances in Natural Language Processing have been based upon more expres-
|
|||
|
sive models for how inputs interact with the context in which they occur. Recurrent
|
|||
|
networks, which have enjoyed a modicum of success, still lack the generalization
|
|||
|
and systematicity ultimately required for modelling language. In this work, we
|
|||
|
propose an extension to the venerable Long Short-Term Memory in the form of
|
|||
|
mutual gating of the current input and the previous output. This mechanism affords
|
|||
|
the modelling of a richer space of interactions between inputs and their context.
|
|||
|
Equivalently, our model can be viewed as making the transition function given
|
|||
|
by the LSTM context-dependent. Experiments demonstrate markedly improved
|
|||
|
generalization on language modelling in the range of 3–4 perplexity points on Penn
|
|||
|
Treebank and Wikitext-2, and 0.01–0.05 bpc on four character-based datasets. We
|
|||
|
establish a new state of the art on all datasets with the exception of Enwik8, where
|
|||
|
we close a large gap between the LSTM and Transformer models.
|
|||
|
|
|||
|
|
|||
|
1 I NTRODUCTION
|
|||
|
|
|||
|
The domination of Natural Language Processing by neural models is hampered only by their limited
|
|||
|
ability to generalize and questionable sample complexity (Belinkov and Bisk 2017; Jia and Liang
|
|||
|
2017; Iyyer et al. 2018; Moosavi and Strube 2017; Agrawal et al. 2016), their poor grasp of grammar
|
|||
|
(Linzen et al. 2016; Kuncoro et al. 2018), and their inability to chunk input sequences into meaningful
|
|||
|
units (Wang et al. 2017). While direct attacks on the latter are possible, in this paper, we take a
|
|||
|
language-agnostic approach to improving Recurrent Neural Networks (RNN, Rumelhart et al. (1988)),
|
|||
|
which brought about many advances in tasks such as language modelling, semantic parsing, machine
|
|||
|
translation, with no shortage of non-NLP applications either (Bakker 2002; Mayer et al. 2008). Many
|
|||
|
neural models are built from RNNs including the sequence-to-sequence family (Sutskever et al. 2014)
|
|||
|
and its attention-based branch (Bahdanau et al. 2014). Thus, innovations in RNN architecture tend to
|
|||
|
have a trickle-down effect from language modelling, where evaluation is often the easiest and data
|
|||
|
the most readily available, to many other tasks, a trend greatly strengthened by ULMFiT (Howard
|
|||
|
and Ruder 2018), ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018), which promote language
|
|||
|
models from architectural blueprints to pretrained building blocks.
|
|||
|
To improve the generalization ability of language models, we propose an extension to the LSTM
|
|||
|
(Hochreiter and Schmidhuber 1997), where the LSTM’s inputxis gated conditioned on the output of
|
|||
|
the previous stephprev . Next, the gated input is used in a similar manner to gate the output of the
|
|||
|
previous time step. After a couple of rounds of this mutual gating, the last updatedxandhprev are
|
|||
|
fed to an LSTM. By introducing these additional of gating operations, in one sense, our model joins
|
|||
|
the long list of recurrent architectures with gating structures of varying complexity which followed
|
|||
|
the invention of Elman Networks (Elman 1990). Examples include the LSTM, the GRU (Chung et al.
|
|||
|
2015), and even designs by Neural Architecture Search (Zoph and Le 2016).
|
|||
|
Intuitively, in the lowermost layer, the first gating step scales the input embedding (itself a representa-
|
|||
|
tion of theaveragecontext in which the token occurs) depending on theactualcontext, resulting in a
|
|||
|
contextualized representation of the input. While intuitive, as Section4 shows, this interpretation
|
|||
|
cannot account for all the observed phenomena.
|
|||
|
In a more encompassing view, our model can be seen as enriching the mostly additive dynamics of
|
|||
|
recurrent transitions placing it in the company of the Input Switched Affine Network (Foerster et al.
|
|||
|
|
|||
|
1 Published as a conference paper at ICLR 2020
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
h0 ⦁ h2 ⦁ h4
|
|||
|
LSTM
|
|||
|
|
|||
|
|
|||
|
x-1 ⦁ x1 ⦁ x3 ⦁ x5
|
|||
|
|
|||
|
|
|||
|
Figure 1: Mogrifier with 5 rounds of updates. The previous stateh0 =hprev is transformed linearly (dashed
|
|||
|
arrows), fed through a sigmoid and gatesx |