1161 lines
86 KiB
Plaintext
1161 lines
86 KiB
Plaintext
|
Direct Feedback Alignment Scales to
|
|||
|
Modern Deep Learning Tasks and Architectures
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Julien Launay 1;2 Iacopo Poli 1 François Boniface 1 Florent Krzakala 1;2
|
|||
|
|
|||
|
1 LightOn 2 École Normale Supérieure
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
arXiv:2006.12878v1 [stat.ML] 23 Jun 2020 {julien, iacopo, francois, florent}@lighton.ai
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Abstract
|
|||
|
|
|||
|
Despite being the workhorse of deep learning, the backpropagation algorithm is
|
|||
|
no panacea. It enforces sequential layer updates, thus preventing efficient paral-
|
|||
|
lelization of the training process. Furthermore, its biological plausibility is being
|
|||
|
challenged. Alternative schemes have been devised; yet, under the constraint of
|
|||
|
synaptic asymmetry, none have scaled to modern deep learning tasks and architec-
|
|||
|
tures. Here, we challenge this perspective, and study the applicability of Direct
|
|||
|
Feedback Alignment to neural view synthesis, recommender systems, geometric
|
|||
|
learning, and natural language processing. In contrast with previous studies lim-
|
|||
|
ited to computer vision tasks, our findings show that it successfully trains a large
|
|||
|
range of state-of-the-art deep learning architectures, with performance close to
|
|||
|
fine-tuned backpropagation. At variance with common beliefs, our work supports
|
|||
|
that challenging tasks can be tackled in the absence of weight transport.
|
|||
|
|
|||
|
|
|||
|
1 Introduction
|
|||
|
|
|||
|
While the backpropagation algorithm (BP) [1,2] is at the heart of modern deep learning achievements,
|
|||
|
it is not without pitfalls. For one, its weight updates are non-local and rely on upstream layers. Thus,
|
|||
|
they cannot be easily parallelized [3], incurring important memory and compute costs. Moreover,
|
|||
|
its biological implementation is problematic [4,5]. For instance, BP relies on the transpose of the
|
|||
|
weights to evaluate updates. Hence, synaptic symmetry is required between the forward and backward
|
|||
|
path: this is implausible in biological brains, and known as the weight transport problem [6].
|
|||
|
Consequently, alternative training algorithms have been developed. Some of these algorithms are
|
|||
|
explicitly biologically inspired [7–13], while others focus on making better use of available compute
|
|||
|
resources [3,14–19]. Despite these enticing characteristics, none has been widely adopted, as they
|
|||
|
are often demonstrated on a limited set of tasks. Moreover, as assessed in [20], their performance on
|
|||
|
challenging datasets under the constraint of synaptic asymmetry is disappointing.
|
|||
|
We seek to broaden this perspective, and demonstrate the applicability of Direct Feedback Alignment
|
|||
|
(DFA) [19] in state-of-the-art settings: from applications of fully connected networks such as neural
|
|||
|
view synthesis and recommender systems, to geometric learning with graph convolutions, and natural
|
|||
|
language processing with Transformers. Our results define new standards for learning without weight
|
|||
|
transport and show that challenging tasks can indeed be tackled under synaptic asymmetry.
|
|||
|
All code needed to reproduce our experiments is available athttps://github.com/lightonai/
|
|||
|
dfa-scales-to-modern-deep-learning.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Preprint. Under review. 1.1 Related work
|
|||
|
|
|||
|
Training a neural network is a credit assignment problem: an update is derived for each parameter
|
|||
|
from its contribution to a cost function. To solve this problem, a spectrum of algorithms exists [21].
|
|||
|
|
|||
|
Biologically motivated methods Finding a training method applicable under the constraints of
|
|||
|
biological brains remains an open problem. End-to-end propagation of gradients is unlikely to occur
|
|||
|
[22], implying local learning is required. Furthermore, the weight transport problem enforces synaptic
|
|||
|
asymmetry [6]. Inspired by auto-encoders, target propagation methods (TP) [10–12] train distinct
|
|||
|
feedback connections to invert the feedforward ones. Feedback alignment (FA) [13] replaces the
|
|||
|
transpose of the forward weights used in the backward pass by a random matrix. Throughout training,
|
|||
|
the forward weights learn toalignwith the arbitrary backward weights, eventually approximating BP.
|
|||
|
|
|||
|
Beyond biological considerations As deep learning models grow bigger, large-scale distributed
|
|||
|
training is increasingly desirable. Greedy layer-wise training [14] allows networks to be built layer
|
|||
|
by layer, limiting the depth of backpropagation. To enable parallelization of the backward pass,
|
|||
|
updates must only depend on local quantities. Unsupervised learning is naturally suited for this,
|
|||
|
as it relies on local losses such as Deep InfoMax [17] and Greedy InfoMax [18]. More broadly,
|
|||
|
synthetic gradient methods, like decoupled neural interfaces [3,15] and local error signals (LES)
|
|||
|
[16], approximate gradients using layer-wise trainable feedback networks. DFA [19] expands on FA
|
|||
|
and directly projects a global error to each layer. A shared feedback path is still needed, but it only
|
|||
|
depends on a simple random projection operation.
|
|||
|
|
|||
|
Performance of alternative methods Local training methods are successful in unsupervised learn-
|
|||
|
ing [18]. Even in a supervised setting, they scale to challenging datasets like CIFAR-100 or ImageNet
|
|||
|
[14,16]. Thus, locality is not too penalizing. However, TP, FA, and DFA are unable to scale to these
|
|||
|
tasks [20]. In fact, DFA is unable to train convolutional layers [23]. To enable feedback alignment
|
|||
|
techniques to perform well on challenging datasets, some form of weight transport is necessary:
|
|||
|
either by explicitly sharing sign information [24–26], or by introducing dedicated phases of alignment
|
|||
|
for the forward and backward weights where some information is shared [27]. To the best of our
|
|||
|
knowledge, no method compatible with the weight transport problem has ever been demonstrated on
|
|||
|
challenging tasks.
|
|||
|
|
|||
|
1.2 Motivations and contributions
|
|||
|
|
|||
|
We focus on DFA, a compromise between biological and computational considerations. Notably,
|
|||
|
DFA is compatible with synaptic asymmetry: this asymmetry raises important challenges, seemingly
|
|||
|
preventing learning in demanding settings. Moreover, it allows for asynchronous weight updates,
|
|||
|
and puts a single operation at the center of the training stage. This enables new classes of training
|
|||
|
co-processors [28, 29], leveraging dedicated hardware to perform the random projection.
|
|||
|
|
|||
|
Extensive survey We apply DFA in a large variety of settings matching current trends in machine
|
|||
|
learning. Previous works have found that DFA is unsuitable for computer vision tasks [20,23]; but
|
|||
|
computer vision alone cannot be the litmus test of a training method. Instead, we consider four vastly
|
|||
|
different domains, across eight tasks, and with eleven different architectures. This constitutes a survey
|
|||
|
of unprecedented scale for an alternative training method, and makes a strong case for the possibility
|
|||
|
of learning without weight transport in demanding scenarios.
|
|||
|
|
|||
|
Challenging settings We demonstrate the ability of DFA to tackle challenging tasks. We success-
|
|||
|
fully learn and render real-world 3D scenes (section 3.1.1); we perform recommendation at scale
|
|||
|
(section 3.1.2); we explore graph-based citation networks (section 3.2); and we consider language
|
|||
|
modelling with a Transformer (section 3.3). We study tasks at the state-of-the-art level, that have
|
|||
|
only been recently successfully tackled with deep learning.
|
|||
|
|
|||
|
Modern architectures We prove that the previously established failure of DFA to train convolutions
|
|||
|
does not generalize. By evaluating performance metrics, comparing against a shallow baseline,
|
|||
|
measuring alignment, and visualizing t-SNE embeddings, we show that learning indeed occurs in
|
|||
|
layers involving graph convolutions and attention. This significantly broadens the applicability of
|
|||
|
DFA–previously thought to be limited to simple problems like MNIST and CIFAR-10.
|
|||
|
|
|||
|
2 2 Methods
|
|||
|
|
|||
|
Forward pass In a fully connected network, at layeriout ofN, neglecting its biases, withWi its
|
|||
|
weight matrix,fi its non-linearity, andhi its activations, the forward pass is:
|
|||
|
8i2[i;:::;N] :ai =Wi hi |