1161 lines
86 KiB
Plaintext
1161 lines
86 KiB
Plaintext
Direct Feedback Alignment Scales to
|
||
Modern Deep Learning Tasks and Architectures
|
||
|
||
|
||
|
||
|
||
Julien Launay 1;2 Iacopo Poli 1 François Boniface 1 Florent Krzakala 1;2
|
||
|
||
1 LightOn 2 École Normale Supérieure
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
arXiv:2006.12878v1 [stat.ML] 23 Jun 2020 {julien, iacopo, francois, florent}@lighton.ai
|
||
|
||
|
||
|
||
Abstract
|
||
|
||
Despite being the workhorse of deep learning, the backpropagation algorithm is
|
||
no panacea. It enforces sequential layer updates, thus preventing efficient paral-
|
||
lelization of the training process. Furthermore, its biological plausibility is being
|
||
challenged. Alternative schemes have been devised; yet, under the constraint of
|
||
synaptic asymmetry, none have scaled to modern deep learning tasks and architec-
|
||
tures. Here, we challenge this perspective, and study the applicability of Direct
|
||
Feedback Alignment to neural view synthesis, recommender systems, geometric
|
||
learning, and natural language processing. In contrast with previous studies lim-
|
||
ited to computer vision tasks, our findings show that it successfully trains a large
|
||
range of state-of-the-art deep learning architectures, with performance close to
|
||
fine-tuned backpropagation. At variance with common beliefs, our work supports
|
||
that challenging tasks can be tackled in the absence of weight transport.
|
||
|
||
|
||
1 Introduction
|
||
|
||
While the backpropagation algorithm (BP) [1,2] is at the heart of modern deep learning achievements,
|
||
it is not without pitfalls. For one, its weight updates are non-local and rely on upstream layers. Thus,
|
||
they cannot be easily parallelized [3], incurring important memory and compute costs. Moreover,
|
||
its biological implementation is problematic [4,5]. For instance, BP relies on the transpose of the
|
||
weights to evaluate updates. Hence, synaptic symmetry is required between the forward and backward
|
||
path: this is implausible in biological brains, and known as the weight transport problem [6].
|
||
Consequently, alternative training algorithms have been developed. Some of these algorithms are
|
||
explicitly biologically inspired [7–13], while others focus on making better use of available compute
|
||
resources [3,14–19]. Despite these enticing characteristics, none has been widely adopted, as they
|
||
are often demonstrated on a limited set of tasks. Moreover, as assessed in [20], their performance on
|
||
challenging datasets under the constraint of synaptic asymmetry is disappointing.
|
||
We seek to broaden this perspective, and demonstrate the applicability of Direct Feedback Alignment
|
||
(DFA) [19] in state-of-the-art settings: from applications of fully connected networks such as neural
|
||
view synthesis and recommender systems, to geometric learning with graph convolutions, and natural
|
||
language processing with Transformers. Our results define new standards for learning without weight
|
||
transport and show that challenging tasks can indeed be tackled under synaptic asymmetry.
|
||
All code needed to reproduce our experiments is available athttps://github.com/lightonai/
|
||
dfa-scales-to-modern-deep-learning.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Preprint. Under review. 1.1 Related work
|
||
|
||
Training a neural network is a credit assignment problem: an update is derived for each parameter
|
||
from its contribution to a cost function. To solve this problem, a spectrum of algorithms exists [21].
|
||
|
||
Biologically motivated methods Finding a training method applicable under the constraints of
|
||
biological brains remains an open problem. End-to-end propagation of gradients is unlikely to occur
|
||
[22], implying local learning is required. Furthermore, the weight transport problem enforces synaptic
|
||
asymmetry [6]. Inspired by auto-encoders, target propagation methods (TP) [10–12] train distinct
|
||
feedback connections to invert the feedforward ones. Feedback alignment (FA) [13] replaces the
|
||
transpose of the forward weights used in the backward pass by a random matrix. Throughout training,
|
||
the forward weights learn toalignwith the arbitrary backward weights, eventually approximating BP.
|
||
|
||
Beyond biological considerations As deep learning models grow bigger, large-scale distributed
|
||
training is increasingly desirable. Greedy layer-wise training [14] allows networks to be built layer
|
||
by layer, limiting the depth of backpropagation. To enable parallelization of the backward pass,
|
||
updates must only depend on local quantities. Unsupervised learning is naturally suited for this,
|
||
as it relies on local losses such as Deep InfoMax [17] and Greedy InfoMax [18]. More broadly,
|
||
synthetic gradient methods, like decoupled neural interfaces [3,15] and local error signals (LES)
|
||
[16], approximate gradients using layer-wise trainable feedback networks. DFA [19] expands on FA
|
||
and directly projects a global error to each layer. A shared feedback path is still needed, but it only
|
||
depends on a simple random projection operation.
|
||
|
||
Performance of alternative methods Local training methods are successful in unsupervised learn-
|
||
ing [18]. Even in a supervised setting, they scale to challenging datasets like CIFAR-100 or ImageNet
|
||
[14,16]. Thus, locality is not too penalizing. However, TP, FA, and DFA are unable to scale to these
|
||
tasks [20]. In fact, DFA is unable to train convolutional layers [23]. To enable feedback alignment
|
||
techniques to perform well on challenging datasets, some form of weight transport is necessary:
|
||
either by explicitly sharing sign information [24–26], or by introducing dedicated phases of alignment
|
||
for the forward and backward weights where some information is shared [27]. To the best of our
|
||
knowledge, no method compatible with the weight transport problem has ever been demonstrated on
|
||
challenging tasks.
|
||
|
||
1.2 Motivations and contributions
|
||
|
||
We focus on DFA, a compromise between biological and computational considerations. Notably,
|
||
DFA is compatible with synaptic asymmetry: this asymmetry raises important challenges, seemingly
|
||
preventing learning in demanding settings. Moreover, it allows for asynchronous weight updates,
|
||
and puts a single operation at the center of the training stage. This enables new classes of training
|
||
co-processors [28, 29], leveraging dedicated hardware to perform the random projection.
|
||
|
||
Extensive survey We apply DFA in a large variety of settings matching current trends in machine
|
||
learning. Previous works have found that DFA is unsuitable for computer vision tasks [20,23]; but
|
||
computer vision alone cannot be the litmus test of a training method. Instead, we consider four vastly
|
||
different domains, across eight tasks, and with eleven different architectures. This constitutes a survey
|
||
of unprecedented scale for an alternative training method, and makes a strong case for the possibility
|
||
of learning without weight transport in demanding scenarios.
|
||
|
||
Challenging settings We demonstrate the ability of DFA to tackle challenging tasks. We success-
|
||
fully learn and render real-world 3D scenes (section 3.1.1); we perform recommendation at scale
|
||
(section 3.1.2); we explore graph-based citation networks (section 3.2); and we consider language
|
||
modelling with a Transformer (section 3.3). We study tasks at the state-of-the-art level, that have
|
||
only been recently successfully tackled with deep learning.
|
||
|
||
Modern architectures We prove that the previously established failure of DFA to train convolutions
|
||
does not generalize. By evaluating performance metrics, comparing against a shallow baseline,
|
||
measuring alignment, and visualizing t-SNE embeddings, we show that learning indeed occurs in
|
||
layers involving graph convolutions and attention. This significantly broadens the applicability of
|
||
DFA–previously thought to be limited to simple problems like MNIST and CIFAR-10.
|
||
|
||
2 2 Methods
|
||
|
||
Forward pass In a fully connected network, at layeriout ofN, neglecting its biases, withWi its
|
||
weight matrix,fi its non-linearity, andhi its activations, the forward pass is:
|
||
8i2[i;:::;N] :ai =Wi hi |