testing_generation/Corpus/Direct Feedback Alignment S...

1161 lines
86 KiB
Plaintext
Raw Normal View History

2020-08-06 20:53:44 +00:00
Direct Feedback Alignment Scales to
Modern Deep Learning Tasks and Architectures
Julien Launay 1;2 Iacopo Poli 1 François Boniface 1 Florent Krzakala 1;2
1 LightOn 2 École Normale Supérieure
arXiv:2006.12878v1 [stat.ML] 23 Jun 2020 {julien, iacopo, francois, florent}@lighton.ai
Abstract
Despite being the workhorse of deep learning, the backpropagation algorithm is
no panacea. It enforces sequential layer updates, thus preventing efficient paral-
lelization of the training process. Furthermore, its biological plausibility is being
challenged. Alternative schemes have been devised; yet, under the constraint of
synaptic asymmetry, none have scaled to modern deep learning tasks and architec-
tures. Here, we challenge this perspective, and study the applicability of Direct
Feedback Alignment to neural view synthesis, recommender systems, geometric
learning, and natural language processing. In contrast with previous studies lim-
ited to computer vision tasks, our findings show that it successfully trains a large
range of state-of-the-art deep learning architectures, with performance close to
fine-tuned backpropagation. At variance with common beliefs, our work supports
that challenging tasks can be tackled in the absence of weight transport.
1 Introduction
While the backpropagation algorithm (BP) [1,2] is at the heart of modern deep learning achievements,
it is not without pitfalls. For one, its weight updates are non-local and rely on upstream layers. Thus,
they cannot be easily parallelized [3], incurring important memory and compute costs. Moreover,
its biological implementation is problematic [4,5]. For instance, BP relies on the transpose of the
weights to evaluate updates. Hence, synaptic symmetry is required between the forward and backward
path: this is implausible in biological brains, and known as the weight transport problem [6].
Consequently, alternative training algorithms have been developed. Some of these algorithms are
explicitly biologically inspired [713], while others focus on making better use of available compute
resources [3,1419]. Despite these enticing characteristics, none has been widely adopted, as they
are often demonstrated on a limited set of tasks. Moreover, as assessed in [20], their performance on
challenging datasets under the constraint of synaptic asymmetry is disappointing.
We seek to broaden this perspective, and demonstrate the applicability of Direct Feedback Alignment
(DFA) [19] in state-of-the-art settings: from applications of fully connected networks such as neural
view synthesis and recommender systems, to geometric learning with graph convolutions, and natural
language processing with Transformers. Our results define new standards for learning without weight
transport and show that challenging tasks can indeed be tackled under synaptic asymmetry.
All code needed to reproduce our experiments is available athttps://github.com/lightonai/
dfa-scales-to-modern-deep-learning.
Preprint. Under review. 1.1 Related work
Training a neural network is a credit assignment problem: an update is derived for each parameter
from its contribution to a cost function. To solve this problem, a spectrum of algorithms exists [21].
Biologically motivated methods Finding a training method applicable under the constraints of
biological brains remains an open problem. End-to-end propagation of gradients is unlikely to occur
[22], implying local learning is required. Furthermore, the weight transport problem enforces synaptic
asymmetry [6]. Inspired by auto-encoders, target propagation methods (TP) [1012] train distinct
feedback connections to invert the feedforward ones. Feedback alignment (FA) [13] replaces the
transpose of the forward weights used in the backward pass by a random matrix. Throughout training,
the forward weights learn toalignwith the arbitrary backward weights, eventually approximating BP.
Beyond biological considerations As deep learning models grow bigger, large-scale distributed
training is increasingly desirable. Greedy layer-wise training [14] allows networks to be built layer
by layer, limiting the depth of backpropagation. To enable parallelization of the backward pass,
updates must only depend on local quantities. Unsupervised learning is naturally suited for this,
as it relies on local losses such as Deep InfoMax [17] and Greedy InfoMax [18]. More broadly,
synthetic gradient methods, like decoupled neural interfaces [3,15] and local error signals (LES)
[16], approximate gradients using layer-wise trainable feedback networks. DFA [19] expands on FA
and directly projects a global error to each layer. A shared feedback path is still needed, but it only
depends on a simple random projection operation.
Performance of alternative methods Local training methods are successful in unsupervised learn-
ing [18]. Even in a supervised setting, they scale to challenging datasets like CIFAR-100 or ImageNet
[14,16]. Thus, locality is not too penalizing. However, TP, FA, and DFA are unable to scale to these
tasks [20]. In fact, DFA is unable to train convolutional layers [23]. To enable feedback alignment
techniques to perform well on challenging datasets, some form of weight transport is necessary:
either by explicitly sharing sign information [2426], or by introducing dedicated phases of alignment
for the forward and backward weights where some information is shared [27]. To the best of our
knowledge, no method compatible with the weight transport problem has ever been demonstrated on
challenging tasks.
1.2 Motivations and contributions
We focus on DFA, a compromise between biological and computational considerations. Notably,
DFA is compatible with synaptic asymmetry: this asymmetry raises important challenges, seemingly
preventing learning in demanding settings. Moreover, it allows for asynchronous weight updates,
and puts a single operation at the center of the training stage. This enables new classes of training
co-processors [28, 29], leveraging dedicated hardware to perform the random projection.
Extensive survey We apply DFA in a large variety of settings matching current trends in machine
learning. Previous works have found that DFA is unsuitable for computer vision tasks [20,23]; but
computer vision alone cannot be the litmus test of a training method. Instead, we consider four vastly
different domains, across eight tasks, and with eleven different architectures. This constitutes a survey
of unprecedented scale for an alternative training method, and makes a strong case for the possibility
of learning without weight transport in demanding scenarios.
Challenging settings We demonstrate the ability of DFA to tackle challenging tasks. We success-
fully learn and render real-world 3D scenes (section 3.1.1); we perform recommendation at scale
(section 3.1.2); we explore graph-based citation networks (section 3.2); and we consider language
modelling with a Transformer (section 3.3). We study tasks at the state-of-the-art level, that have
only been recently successfully tackled with deep learning.
Modern architectures We prove that the previously established failure of DFA to train convolutions
does not generalize. By evaluating performance metrics, comparing against a shallow baseline,
measuring alignment, and visualizing t-SNE embeddings, we show that learning indeed occurs in
layers involving graph convolutions and attention. This significantly broadens the applicability of
DFApreviously thought to be limited to simple problems like MNIST and CIFAR-10.
2 2 Methods
Forward pass In a fully connected network, at layeriout ofN, neglecting its biases, withWi its
weight matrix,fi its non-linearity, andhi its activations, the forward pass is:
8i2[i;:::;N] :ai =Wi hi1 ;hi =fi (ai ): (1)
h0 =Xis the input data, andhN =f(aN ) =^yare the predictions. A task-specific cost function
L(^y;y)is computed to quantify the quality of the predictions with respect to the targetsy.
Backward pass with BP The weight updates are computed by backpropagation of the error vector.
Using the chain-rule of derivatives, each neuron is updated based on its contribution to the cost
function. Leaving aside the specifics of the optimizer used, the equation for the weight updates is:
@L @LW Ti = =[(W a (2)@W i+1 i+1 )f0 (ai i )]hT ;ai1 i =
i @ai
Backward pass with DFA The gradient signalWT ai+1 i+1 of the (i+1)-th layer violates synaptic
asymmetry. DFA replaces it with a random projection of the topmost derivative of the loss,ay .
For common classification and regression losses such as the mean squared error or the negative log
likelihood, this corresponds to a random projection of the global errore=^yy. WithBi , a fixed
random matrix of appropriate shape drawn at initialization for each layers:
@LWi =[(Bi ay )f0 (a ai i )]hT ; i1 y = (3)@ay
3 Experiments
We study the applicability of DFA to a diverse set of applications requiring state-of-the-art architec-
tures. We start with fully connected networks, where DFA has already been demonstrated, and address
new challenging settings. We then investigate geometric learning: we apply DFA to graph neural net-
works in classification tasks on citation networks, as well as graph autoencoders. These architectures
feature graph convolutions and attention layers. Finally, we use DFA to train a transformer-based
Natural Language Processing (NLP) model on a dataset of more than 100 million tokens.
3.1 Fully connected architectures
DFA has been successful at training fully connected architectures, with performance on-par with
backpropagation [19,20]. However, only computer vision tasks have been considered, where fully
connected networks considerably underperform their convolutional counterpart. Here, we focus on
tasks where fully connected architectures are state-of-the-art. Moreover, the architectures considered
are deeper and more complex than those necessary to solve a simple task like MNIST.
3.1.1 Neural view synthesis with Neural Radiance Fields
The most recent state-of-the-artneural view synthesismethods are based on large fully connected
networks: this is an ideal setting for a first evaluation of DFA on a challenging task.
Background There has been growing interest in methods capable of synthesising novel renders of
a 3D scene using a dataset of past renders. The network is trained to learn an inner representation of
the scene, and a classical rendering system can then query the model to generate novel views. With
robust enough methods, real-world scenes can also be learned from a set of pictures.
Until recently, most successful neural view synthesis methods were based on sampled volumetric
representations [3032]. In this context, Convolutional Neural Networks (CNNs) can be used to
smooth out the discrete sampling of 3D space [33,34]. However, these methods scale poorly to
higher resolutions, as they still require finer and finer sampling. Conversely, alternative schemes
based on a continuous volume representation have succeeded in generating high-quality renders [35],
even featuring complex phenomenons such as view-dependant scattering [36]. These schemes make
point-wise predictions, and use fully connected neural networks to encode the scene.
3 Figure 1: Comparisons of NeRF-DFA with state-of-the-art methods trained with BP on the most
challenging synthetic and real-world scenes. While NeRF-DFA generates render of lower quality,
they maintain multi-view consistency and exhibit no geometric artefacts. BP results from [36].
Setting We employ Neural Radiance Fields (NeRF) [36], the state-of-the-art for neural view
synthesis. NeRF represents scenes as a continuous 5D function of spacethree spatial coordinates,
two viewing anglesand outputs a point-wise RGB radiance and opacity. A ray-casting renderer can
then query the network to generate arbitrary views of the scene. The network modeling the continuous
function is 10 layers deep. Two identical networks are trained: thecoarsenetwork predictions inform
the renderer about the spatial coordinates that thefinenetwork should preferentially evaluate to avoid
empty space and occluded regions.
Results We report quantitative results of training NeRF with DFA in Table 1. Neural view synthesis
methods are often better evaluated qualitatively: we showcase some renders in Figure 1.
On a dataset of renders featuring complex scenes with non-Lambertian materials (NeRF-Synthetic
[36]), NeRF-DFA outperforms two previous fine-tuned state-of-the-art methodsScene Representation
Networks (SRN) [35] and Local Light Field Fusion (LLFF) [32]and nearly matches the performance
of Neural Volumes (NV) [34]. While DFA underperforms alternative methods trained with BP on
the real world view dataset (LLFF-Real [32]), its performance remains significant: real world view
synthesis is a challenging tasks, and this level of PSNR indicates that learning is indeed happening.
In particular, we find that NeRF-DFA retains the key characteristics of NeRF-BP: it can render view-
dependant effects, and is multi-view consistent. The last point is an especially important achievement,
and most visible in videos, as it is a challenge for most algorithms [3032,35]. The main drawback
of NeRF-DFA appears to be a seemingly lower render definition. The NeRF architecture has not
Table 1: Peak Signal to Noise Ratio (PSNR, higher is better) of neural view synthesis methods
trained with backpropagation against NeRF trained with DFA. Even when trained with DFA, NeRF
outperforms two state-of-the-art methods on a synthetic dataset (NeRF-Synthetic), and achieves fair
performance on a challenging real world views datasets (LLFF-Real). BP results from [36].
NV SRN LLFF NeRF
BP BP BP BP DFA
NeRF-Synthetic 26.05 22.26 24.88 31.01 25.41
LLFF-Real / 22.84 24.13 26.50 20.77
4 been fine-tuned to achieve these results: DFA works out-of-the-box on this advanced method. Future
research focusing on architectural changes to NeRF could improve performance with DFA; some
preliminary results are included in the supplementary material.
3.1.2 Click-through rate prediction with recommender systems
We have demonstrated that DFA can train large fully connected networks on the difficult task of neural
view synthesis. We now seek to use DFA in more complex heterogeneous architectures, combining
the use of fully connected networks with other machine learning methods.Recommender systemsare
an ideal application for such considerations.
Background Recommender systems are used to model the behavior of users and predict future
interactions. In particular, in the context of click-through rate (CTR) prediction, these systems model
the probability of a user clicking on a given item. Building recommender systems is hard [37]: their
input is high-dimensional and sparse, and the model must learn to extract high-order combinatorial
features from the data. Moreover, they need to do so efficiently, as they are used to make millions of
predictions and the training data may contain billions of examples.
Factorization Machines (FM) [38] use inner-products of latent vectors between features to extract
pairwise feature interactions. They constitute an excellent baseline for shallow recommender systems,
but fail to efficiently transcribe higher-level features. To avoid extensive feature engineering, it has
been suggested that deep learning can be used in conjunction with wide shallow models to extract
these higher-level features [39]. In production, these systems are regularly retrained on massive
datasets: the speedup allowed by backward unlocking in DFA is thus of particular interest.
Setting Deep Factorization Machines (DeepFM) [40] combine FM and a deep fully connected
neural network, which we train with DFA. The input embedding is still trained directly via gradient
descent, as weight transport is not necessary to backpropagate through the FM. Deep & Cross
Networks (DCN) [41] replace the FM with a Cross Network, a deep architecture without non-
linearities capable of extracting high-degree interactions across features. We train the fully connected
network, the deep cross network, and the embeddings with DFA. Finally, Adaptative Factorization
Network (AFN) [42] uses Logarithmic Neural Networks [43] to enhance the representational power
of its deep component. We evaluate these methods on the Criteo dataset [44], which features nearly
46 million samples of one million sparse features. This is a difficult task, where performance
improvements of the AUC on the0.001-levelcan enhance CTR significantly [39].
Results Performance metrics are reported in Table 2. To obtain these results, a simple hyperpa-
rameter grid search over optimization and regularization parameters was performed for BP and DFA
independently. DFA successfully trains all methods above the FM baseline, and in fact matches BP
performance in both DeepFM and AFN. Because of their complexity, recommender systems require
intensive tuning and feature engineering to perform at the state-of-the-art leveland reproducing
existing results can be challenging [45]. Hence, it is not surprising that a performance gap exists with
Deep&Crossfurther fine-tuning may be necessary for DFA to reach BP performance.
Alignment measurements corroborate that learning is indeed occurring in the special layers of
Deep&Cross and AFNsee supplementary for details. Our results on recommender systems support
that DFA can learn in a large variety of settings, and that weight transport is not necessary to solve a
difficult recommendation task.
Table 2: AUC (higher is better) and log loss (lower is better) of recommender systems trained on the
Criteo dataset [44]. Even in complex heterogeneous architectures, DFA performance is in line with
BP. Values inboldindicate DFA AUC within 0.001 from the BP AUC or better.
FM DeepFM Deep&Cross AFN
BP DFA BP DFA BP DFA
AUC 0.7915 0.7954 0.7956 0.8104 0.8009 0.7933 0.7924
Loss 0.4687 0.4610 0.4624 0.4414 0.4502 0.4630 0.4621
5 3.2 Geometric Learning with Graph Convolutional Networks
The use of sophisticated architectures beyond fully connected layers is necessary for certain tasks,
such asgeometric learning[46], where information lies in a complex structured domain. To address
geometric learning tasks, methods capable of handling graph-based data are commonly needed.
Graph convolutional neural networks (GCNNs) [4750] have demonstrated the ability to process
large-scale graph data efficiently. We study the applicability of DFA to these methods, including
recent architectures based on an attention mechanism. Overall, this is an especially interesting setting,
as DFA fails to train more classic 2D image convolutional layers [23].
Background Complex data like social networks or brain connectomes lie on irregular or non-
Euclidean domains. They can be represented as graphs, and efficient processing in the spectral
domain is possible. Non-spectral techniques to apply neural networks to graphs have also been
developed [5153], but they exhibit unfavorable scaling properties. The success of CNNs in deep
learning can be attributed to their ability to efficiently process structured high-dimensional data
by sharing local filters. Thus, a generalization of the convolution operator to the graph domain is
desirable: [47] first proposed a spectral convolution operation for graphs, and [48] introduced a form
of regularization to enforce spatial locality of the filters. We use DFA to train different such GCNNs
implementations. We study both spectral and non-spectral convolutions, as well as methods inspired
by the attention mechanism. We consider the task of semi-supervised node classification: nodes from
a graph are classified using their relationship to other nodes as well as node-wise features.
Setting Fast Localized Convolutions (ChebConv) [49] approximate the graph convolution kernel
with Chebyshev polynomials, and are one of the first scalable convolution methods on graph. Graph
Convolutions (GraphConv) [50] remove the need for an explicit parametrization of the kernel by
enforcing linearity of the convolution operation on the graph Laplacian spectrum. It is often considered
as the canonical graph convolution. More recent methods do not operate in the spectral domain. Spline
Convolutions (SplineConv) [54] use a spline-based kernel, enabling the inclusion of information
about the relative positioning of nodes, enhancing their representational powerfor instance in the
context of 3D meshes. Graph Attention Networks (GATConv) [55] use self-attention [56] layers to
enable predictions at a given node toattendmore specifically to certain parts of its neighborhood.
Finally, building upon Jumping Knowledge Network [57], Just Jump (DNAConv) [58] uses multi-
head attention [59] to enhance the aggregation process in graph convolutions and enable deeper
architectures. We use PyTorch Geometric [60] for reference implementation of all of these methods.
We evaluate performance on three citation network datasets: Cora, CiteSeer, and PubMed [61].
Results We report classification accuracy in Table 3. BP and DFA regularization and optimiza-
tion hyperparameters are fine-tuned separately on the Cora dataset. In general, we find that less
regularization and lower learning rates are needed with DFA. DFA successfully trains all graph
methods, independent of whether they use the spectral domain or not, and even if they use attention.
Furthermore, for GraphConv, SplineConv, and GATConv DFA performance nearly matches BP.
As GCNNs struggle with learning meaningful representations when stacking many layers [62], all
architectures but DNAConv are quite shallow (two layers). However, DFA performance is still
significantly higher than that of a shallow training methodsee supplementary for details. The lower
performance on DNAConv is not a failure to learn: alignment measurements show that learning is
indeed occurring. It may be explained instead by a need for more in-depth fine-tuning, as this is a
deep architecture with 5 successive attention layers.
Table 3: Classification accuracy (%, higher is better) of graph convolution methods trained with BP
and DFA, on citation networks [61]. But for ChebConv and DNAConv, DFA performance nearly
matches BP performance. Values inboldwhen DFA is within 2.5% of BP.
ChebConv GraphConv SplineConv GATConv DNAConv
BP DFA BP DFA BP DFA BP DFA BP DFA
Cora 79.2 75.4 80.1 79.9 81.0 77.7 82.6 80.6 84.6 82.9
CiteSeer 69.5 67.6 71.6 69.4 70.0 69.8 72.0 71.2 73.4 70.8
PubMed 79.5 75.7 78.8 77.8 77.5 77.2 77.7 77.1 87.2 79.9
6 GAE
BP DFA
AUC 0.918 0.900Cora AP 0.918 0.900
AUC 0.886 0.879CiteSeer AP 0.895 0.889
AUC 0.967 0.945PubMed AP 0.966 0.945
Table 4: AUC and Average Precision Figure 2: t-SNE visualization of the hidden layer
(AP, higher is better) for a Graph- activations of a two-layer GraphConv trained on
Conv GAE trained with BP or DFA Cora with DFA. Classes forms clear clusters, in-
on citation networks. DFA repro- dicating that a useful intermediary representation
duces BP performance. is learned. Colors represent different classes.
We further demonstrate that DFA helps graph convolutions learn meaningful representations by
aplying t-SNE [63,64] to the hidden layer activations in GraphConv (Figure 2). Cluster of classes
are well-separated, indicating that a useful intermediary representation is derived by the first layer.
Graph autoencoders We consider one last application of graph convolutions, in the context of
graph autoencoders (GAE). We train a non-probabilistic GAE [65] based on GraphConv with DFA,
and report results in Table 4. DFA performance is always in line with BP.
3.3 Natural Language Processing with Transformers
We complete our study by training a Transformer [59] on a language modelling task. Transformers
have proved successful in text, image, music generation, machine translation, and many supervised
NLP tasks [59,6669]. Here, we demonstrate that DFA can train them, and we show the influence of
tuning the optimizer hyperparameters in narrowing the gap with BP.
Background NLP has largely benefited from advances in deep learning. Recurrent Neural Net-
works were responsible for early breakthroughs, but their sequential nature prevented efficient
parallelization of data processing. Transformers are attention-based models that do not rely on
recurrence or convolution. Their ability to scale massively has allowed the training of models with
several billion parameters [70,71], obtaining state-of-the-art results on all NLP tasks: Transformers
now top the prominent SQuAD 2.0 [72,73] and SuperGLUE [74] benchmarks. In parallel, transfer
learning in NLP has leaped forward thanks to language modelling, the unsupervised task of predicting
the next word. It can leverage virtually unlimited data from web scraping [75]. This enabled the
training ofuniversal language models[76] on extremely large and diversified text corpora. These
models are useful across a wide range of domains, and can solve most NLP tasks after fine-tuning.
Setting The prominence of both language modelling and Transformers gives us the ideal candidate
for our NLP experiments: we train a Transformer to predict the next word on the WikiText-103
dataset [77], a large collection ofgoodandfeaturedWikipedia articles. We use byte-pair-encoding
[78] with 32,000 tokens. Our setup is similar to GPT [66]: we adapt the Transformer, originally an
encoder-decoder model designed for machine translation, to language modelling. We keep only the
encoder and mask the tokens to predict. Our architecture consists in 6 layers, 8 attention heads, a
model dimension of 512, and a hidden size of 2048 in the feed-forward blocks. The text is sliced
in chunks of 128 tokens and batches of 64 such chunks, resulting in 8192 tokens per batch. Our
baseline is trained with BP using the optimization setup of [59]. We found perplexity after 20 epochs
to be an excellent indicator of perplexity at convergence; to maximize the number of experiments
we could perform, we report the best validation perplexity after 20 epochs. We study two ways of
implementing DFA: applying the feedback after every encoder block (macro) or after every layer in
those blocks (micro). The input embedding layer receives gradients from the next feedback point
through BP. This leaves some amount of weight transport even in themicrocase.
7 Table 5: Best validation perplexity after 20 epochs of a Transformer trained on WikiText-103 (lower
is better). The BP and DFA baselines share all hyper-parameters. InMacrothe feedback is applied
after every transformer layer, while inMicrothe feedback is applied after every sub-layer. The
learning rate of Adam without the learning rate scheduler is5:10 5 . With the scheduler, the initial
learning rate is1:10 4 and it is multiplied by 0.2 when performance plateaus, with a patience of 1.
* score after 22 epochs to let the learning rate scheduler take effect
DFA BP
Baseline + Adam +2 = 0:999 + LR schedule Baseline +2 = 0:999
Macro 95.0 77.1 55.0 52.0 34.4 29.8Micro 182 166 99.9 93.3*
Results Our results are summarized in Table 5. Hyper-parameters fine-tuned for BP did not fare
well with DFA, but changes in the optimizer narrowed the gap between BP and DFA considerably.
The learning rate schedule used on top of Adam [79] in [59] proved detrimental. Using Adam alone
required reducing the learning rate between BP and DFA. Increasing2 from 0.98 [59] to 0.999
improved performance significantly. Finally, a simple scheduler that reduces the learning rate when
the validation perplexity plateaus helped reducing it further. Considering that the perplexity of the
shallow baseline is over 400, DFA is clearly able to train Transformers. However, our results are not
on par with BP, especially in themicrosetting. A substantial amount of work remains to make DFA
competitive with BP, even more so in a minimal weight transport scenario. The large performance
improvements brought by small changes in the optimizer indicate that intensive fine-tuning, common
in publications introducing state-of-the-art results, could close the gap between BP and DFA.
4 Conclusion and outlooks
We conducted an extensive study demonstrating the ability of DFA to train modern architectures. We
considered a broad selection of domains and tasks, with complex models featuring graph convolutions
and attention. Our results on large networks like NeRF and Transformers are encouraging, suggesting
that with further tuning, such leading architectures can be effectively trained with DFA. Future work
on principled training with DFAin particular regarding the influence of common practices and
whether new procedures are requiredwill help close the gap with BP.
More broadly, we verified for the first time that learning under synaptic asymmetry is possible beyond
fully-connected layers, and in tasks significantly more difficult than previously considered. This
addresses a notable concern in biologically-plausible architectures. DFA still requires an implausible
global feedback pathway; however, local training has already been demonstrated at scale. The next
step towards biologically-compatible learning is a local method without weight transport.
While the tasks and architectures we have considered are not biologically inspired, they constitute
a good benchmark forbehavioural realism[20]. Any learning algorithm claiming to approximate
the brain should reproduce its ability to solve complex and unseen task. Furthermore, even though
the current implementation of mechanisms like attention is devoid of biological considerations, they
represent broader concepts applicable to human brains [80]. Understanding how our brain learns is a
gradual process, and future research could incorporate further realistic elements, like spiking neurons.
Finally, unlocking the backward pass in large architectures like Transformers is promising. More opti-
mized implementation of DFAbuilt at a lower-level of existing ML librariescould unlock significant
speed-up. Leveraging the use of a single random projection as the cornerstone of training, dedicated
accelerators may employ more exotic hardware architectures. This will open new possibilities in the
asynchronous training of massive models.
8 Broader Impact
Of our survey This study is the first experimental validation of DFA as an effective training method
in a wide range of challenging tasks and neural networks architectures. This significantly broadens the
applications of DFA, and more generally brings new insight on training techniques alternative to back-
propagation. From neural rendering and recommender systems, to natural language processing or
geometric learning, each of these applications has its own potential impact. Our task selection process
was motivated by current trends in deep learning, as well as by technically appealing mechanisms
(graph convolutions, attention). A limit of our survey is that ourarguably biasedselection of tasks
cannot be exhaustive. Our experiments required substantial cloud compute resources, with state-of-
the-art GPU hardware. Nevertheless, as this study provides new perspectives for hardware accelerator
technologies, it may favor the application of neural networks in fields previously inaccessible because
of computational limits. Future research on DFA should continue to demonstrate its use in novel
contexts of interest as they are discovered.
Of the considered applications Each of the applications considered in our study has a wide
potential impact, consider for example the impact of textual bias in pretrained word embeddings [81].
We refer to [82] and references therein for a discussion of ethical concerns of AI applications.
Of DFA as a training method DFA enables parallelization of the backward pass and places a
single operation at the center of the training process, opening the prospect of reducing the power
consumption of training chips by an order of magnitude [28]. Not only is more efficient training a
path to more environmentally responsible machine learning [83], but it may lower the barrier of entry,
supporting equality and sustainable development goals. A significant downside of moving from BP to
DFA is a far more limited understanding of how to train models and how the trained models behave.
There is a clear empirical understanding of the impact of techniques such as batch normalization
or skip connections on the performance of BP; new insights need to be obtained for DFA. BP also
enjoys decades of works on topics like adversarial attacks, interpretability, and fairness. Much of
this work has to be cross-checked for alternative training methods, something we encourage further
research to consider as the next step towards safely and responsively scaling up DFA.
Of biologically motivated method Finally, a key motivation for this study was to demonstrate that
learning challenging tasks was possible without weight transport. Biologically motivated methods
are a more foundational research direction, and as such the possible long-term impact of our findings
is harder to estimate under this light. However, fundamental research of this kind is important to open
new pathways for ML and neuroscience.
Acknowledgments and Disclosure of Funding
We thank Igor Carron and Laurent Daudet for the general guidance on the subject of this investigation
and the insightful comments, as well as the larger LightOn team for their support.
References
[1]P. J. Werbos.Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences. PhD thesis, Harvard University, 1974.
[2]D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error
propagation. InParallel Distributed Processing, volume 1, pages 318362. MIT Press, 1986.
[3]Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves,
David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients.
InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages
16271635, 2017.
[4]Francis Crick. The recent excitement about neural networks.Nature, 337(6203):129132, 1989.
[5]Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep
learning and neuroscience.Frontiers in computational neuroscience, 10:94, 2016.
9 [6]Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance.
Cognitive science, 11(1):2363, 1987.
[7]Javier R Movellan. Contrastive hebbian learning in the continuous hopfield model. InConnec-
tionist models, pages 1017. Elsevier, 1991.
[8]Randall C OReilly. Biologically plausible error-driven learning using local activation differ-
ences: The generalized recirculation algorithm.Neural computation, 8(5):895938, 1996.
[9]Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. InArtificial intelligence
and statistics, pages 448455, 2009.
[10]Yann Le Cun. Learning process in an asymmetric threshold network. InDisordered systems
and biological organization, pages 233240. Springer, 1986.
[11]Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target
propagation.arXiv preprint arXiv:1407.7906, 2014.
[12]Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propaga-
tion. InJoint european conference on machine learning and knowledge discovery in databases,
pages 498515. Springer, 2015.
[13]Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synap-
tic feedback weights support error backpropagation for deep learning.Nature communications,
7(1):110, 2016.
[14]Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can
scale to imagenet. InInternational Conference on Machine Learning, pages 583593, 2019.
[15]Wojciech M Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan
Pascanu. Sobolev training for neural networks. InAdvances in Neural Information Processing
Systems, pages 42784287, 2017.
[16]Arild Nøkland and Lars Hiller Eidnes. Training neural networks with local error signals. In
International Conference on Machine Learning, pages 48394850, 2019.
[17]R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,
Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information
estimation and maximization. InInternational Conference on Learning Representations, 2019.
URLhttps://openreview.net/forum?id=Bklr3j0cKX.
[18]Sindy Löwe, Peter OConnor, and Bastiaan Veeling. Putting an end to end-to-end: Gradient-
isolated learning of representations. InAdvances in Neural Information Processing Systems,
pages 30333045, 2019.
[19] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In
Advances in neural information processing systems, pages 10371045, 2016.
[20]Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy
Lillicrap. Assessing the scalability of biologically-motivated deep learning algorithms and
architectures. InAdvances in Neural Information Processing Systems, pages 93689378, 2018.
[21]Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton.
Backpropagation and the brain.Nature Reviews Neuroscience, pages 112, 2020.
[22]Natalia Caporale and Yang Dan. Spike timingdependent plasticity: a hebbian learning rule.
Annu. Rev. Neurosci., 31:2546, 2008.
[23]Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with
direct feedback alignment.arXiv preprint arXiv:1906.04554, 2019.
[24]Qianli Liao, Joel Z Leibo, and Tomaso Poggio. How important is weight symmetry in back-
propagation? InThirtieth AAAI Conference on Artificial Intelligence, 2016.
10 [25]Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep
convolutional networks.arXiv preprint arXiv:1812.06488, 2018.
[26]Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. Biologically-plausible learning
algorithms can scale to large datasets. InInternational Conference on Learning Representations,
2019. URLhttps://openreview.net/forum?id=SygvZ209F7.
[27]Mohamed Akrout, Collin Wilson, Peter C Humphreys, Timothy Lillicrap, and Douglas Tweed.
Using weight mirrors to improve feedback alignment.arXiv preprint arXiv:1904.05391, 2019.
[28]Julien Launay, Iacopo Poli, Kilian Müller, Igor Carron, Laurent Daudet, Florent Krzakala, and
Sylvain Gigan. Light-in-the-loop: using a photonics co-processor for scalable training of neural
networks, 2020.
[29]Charlotte Frenkel.Bottom-Up and Top-Down Neuromorphic Processor Design: Unveiling
Roads to Embedded Cognition. PhD thesis, UCL-Université Catholique de Louvain, 2020.
[30]Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis.ACM Transactions on
Graphics (TOG), 36(6):111, 2017.
[31]John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck,
Noah Snavely, and Richard Tucker. Deepview: View synthesis with learned gradient descent.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
23672376, 2019.
[32]Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi
Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis
with prescriptive sampling guidelines.ACM Transactions on Graphics (TOG), 38(4):114,
2019.
[33]Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael
Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. InProceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 24372446, 2019.
[34]Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and
Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.ACM
Transactions on Graphics (TOG), 38(4):65, 2019.
[35]Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks:
Continuous 3d-structure-aware neural scene representations. InAdvances in Neural Information
Processing Systems, pages 11191130, 2019.
[36]Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi,
and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.arXiv
preprint arXiv:2003.08934, 2020.
[37]H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady,
Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view
from the trenches. InProceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 12221230, 2013.
[38]Steffen Rendle. Factorization machines. In2010 IEEE International Conference on Data
Mining, pages 9951000. IEEE, 2010.
[39]Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye,
Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for
recommender systems. InProceedings of the 1st workshop on deep learning for recommender
systems, pages 710, 2016.
[40]Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a
factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247,
2017.
11 [41]Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click
predictions. InProceedings of the ADKDD17, ADKDD17, New York, NY, USA, 2017.
Association for Computing Machinery. ISBN 9781450351942. doi: 10.1145/3124749.3124754.
URLhttps://doi.org/10.1145/3124749.3124754.
[42]Weiyu Cheng, Yanyan Shen, and Linpeng Huang. Adaptive factorization network: Learning
adaptive-order feature interactions. InThirty-Fourth AAAI Conference on Artificial Intelligence,
2020.
[43]J Wesley Hines. A logarithmic neural network architecture for unbounded non-linear function
approximation. InProceedings of International Conference on Neural Networks (ICNN96),
volume 2, pages 12451250. IEEE, 1996.
[44]Criteo. Kaggle contest dataset is now available for academic use!http://labs.criteo.com/
2014/09/kaggle-contest-dataset-now-available-academic-use/, 2014. accessed
on the 2020-05-20.
[45]Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much
progress? a worrying analysis of recent neural recommendation approaches. InProceedings of
the 13th ACM Conference on Recommender Systems, pages 101109, 2019.
[46]Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Magazine, 34
(4):1842, 2017.
[47]Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally
connected networks on graphs. InInternational Conference on Learning Representations, pages
httpopenreview, 2014.
[48]Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured
data.arXiv preprint arXiv:1506.05163, 2015.
[49]Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks
on graphs with fast localized spectral filtering. InAdvances in neural information processing
systems, pages 38443852, 2016.
[50]Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. InInternational Conference on Learning Representations (ICLR), 2017.
[51]Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph
domains. InProceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.,
volume 2, pages 729734. IEEE, 2005.
[52]Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
The graph neural network model.IEEE Transactions on Neural Networks, 20(1):6180, 2008.
[53]Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural
networks. InInternational Conference on Learning Representations, 2016.
[54]Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric
deep learning with continuous b-spline kernels. InProceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 869877, 2018.
[55]Petar Velickoviˇ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua´
Bengio. Graph attention networks. InInternational Conference on Learning Representations,
2018. URLhttps://openreview.net/forum?id=rJXMpikCZ.
[56] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. In3rd International Conference on Learning Representations,
ICLR 2015, 2015.
[57]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? InInternational Conference on Machine Learning, 2018.
12 [58]Matthias Fey. Just jump: Dynamic neighborhood aggregation in graph neural networks. In
ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[59]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information
processing systems, pages 59986008, 2017.
[60]Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric.
InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[61]Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-
Rad. Collective classification in network data.AI magazine, 29(3):9393, 2008.
[62]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? InInternational Conference on Learning Representations, 2019. URLhttps:
//openreview.net/forum?id=ryGs6iA5Km.
[63]Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine
learning research, 9(Nov):25792605, 2008.
[64]David M Chan, Roshan Rao, Forrest Huang, and John F Canny. Gpu accelerated t-distributed
stochastic neighbor embedding.Journal of Parallel and Distributed Computing, 131:113,
2019.
[65]Thomas N Kipf and Max Welling. Variational graph auto-encoders.NIPS Workshop on Bayesian
Deep Learning, 2016.
[66]Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language
understanding with unsupervised learning.Technical report, OpenAI, 2018.
[67]Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku,
and Dustin Tran. Image transformer.ArXiv, abs/1802.05751, 2018.
[68]Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya
Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341, 2020.
[69]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer-
ence of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), pages 41714186, Minneapolis,
Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.
URLhttps://www.aclweb.org/anthology/N19-1423.
[70]Mohammad Shoeybi, Mostofa Ali Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and
Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model
parallelism.ArXiv, abs/1909.08053, 2019.
[71]Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners.arXiv preprint arXiv:2005.14165, 2020.
[72]Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+
questions for machine comprehension of text. InProceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, pages 23832392, Austin, Texas, Novem-
ber 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL
https://www.aclweb.org/anthology/D16-1264.
[73]Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you dont know: Unanswerable
questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pages 784789, Melbourne, Australia,
July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL
https://www.aclweb.org/anthology/P18-2124.
13 [74]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose
language understanding systems. InAdvances in Neural Information Processing Systems, pages
32613275, 2019.
[75]The Common Crawl Team. Common Crawl.https://commoncrawl.org, 2020.
[76]Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifica-
tion. InACL. Association for Computational Linguistics, 2018. URLhttp://arxiv.org/
abs/1801.06146.
[77]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
models.ArXiv, abs/1609.07843, 2017.
[78]Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare
words with subword units. InProceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 17151725, Berlin, Germany,
August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL
https://www.aclweb.org/anthology/P16-1162.
[79]Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International
Conference on Learning Representations, 12 2014.
[80]Grace W Lindsay. Attention in psychology, neuroscience, and machine learning.Frontiers in
Computational Neuroscience, 14:29, 2020.
[81]Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.
Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In
Advances in neural information processing systems, pages 43494357, 2016.
[82]Alexandra Luccioni and Yoshua Bengio. On the morality of artificial intelligence.arXiv preprint
arXiv:1912.11945, 2019.
[83]Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for
deep learning in nlp.arXiv preprint arXiv:1906.02243, 2019.
[84]Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer:
Rethinking self-attention in transformer models.arXiv preprint arXiv:2005.00743, 2020.
[85]Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao,
and Jiawei Han. On the variance of the adaptive learning rate and beyond.arXiv preprint
arXiv:1908.03265, 2019.
[86]Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. Fixed encoder self-attention patterns
in transformer-based machine translation.arXiv preprint arXiv:2002.10260, 2020.
[87]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-
performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32,
pages 80248035. Curran Associates, Inc., 2019. URLhttp://papers.neurips.cc/paper/
9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
pdf.
14 Appendix
We first provide additional elements to corroborate our findings: alignment measurement (Section
A), and shallow baselines (Section B). We then discuss the process of adapting the considered
architectures for DFA (Section C), and the issue of weight transport in attention layers (Section D).
We provide some supplementary results for NeRF (Section E), including details of performance on
each scene of each datatset, and a discussion on possible mitigation of DFA shortcomings. Finally,
we outline steps necessary for reproduction of this work (Section F).
A Alignment
Alignment measurement In feedback alignment methods, the forward weights learn toalignwith
the random backward weights, making the delivered updates useful. This alignment can be quantified
by measuring the cosine similarity between the gradient signal delivered by DFABi ay and the
gradient signal BP would have deliveredWT ai+1 i+1 . For learning to occur and DFA to work as
a training method, there must be alignment. This can be measured numerically [23]. Measuring
alignments allows to check whether or not the layers are effectively being trained by DFA, regardless
of performance metrics. We note that any alignment value superior to 0 signifies that learning is
occuring. Values closer to 1 indicate a better match with BP, but small alignment values are sufficient
to enable learning. We report values measured at the deepest DFA layer.
Recommender systems We measure alignment on the Criteo dataset, in the two architectures
featuring non-conventional fully-connected layers: Deep & Cross and AFN. Alignment is measured
after 15 epochs of training, and averaged over a random batch of 512 samples. Results are reported in
table A.1. These alignment measurements indicate that learning is indeed occurring in the cross and
logarithmic layers. High-variance of alignment in the cross layers is unique: it may be explained by
the absence of non-linearity, and account for the difference in performance between BP and DFA on
this architecturewhich is higher than on the others.
Table A.1: Alignment cosine similarity (higher is better, standard deviation in parenthesis) of
recommender systems as measured on the Criteo dataset. Learning occurs in both architectures, and
high variance may explain the larger performance gap on Deep & Cross compared to other methods.
Deep & Cross AFN
Alignment 0.40 (0.91) 0.49 (0.08)
Graph convolutions We measure alignment on the Cora dataset, after 250 epochs of training,
averaging values over every sample availabletrain, validation, and test split included. Results are
reported in Table A.2. We observe high alignment values in all architectures, indicative that learning
is indeed occuring. Slightly lower values in SplineConv and GATConv may be explained by the use
of the Exponential Linear Unit (ELU) instead of the Rectified Linear Unit (ReLU) used as activation
in other architectures.
Table A.2: Alignment cosine similarity (standard deviation in parenthesis) of various graph convolu-
tions architectures as measured on the Cora dataset. These values corroborate that DFA successfully
trains all architectures considered.
ChebConv GraphConv SplineConv GATConv DNAConv
Alignment 0.87 (0.12) 0.77 (0.25) 0.56 (0.22) 0.63 (0.18) 0.92 (0.30)
B Shallow baselines
Shallow learning We compare DFA to BP, but also to shallow learningwhere only the topmost
layer is trained. While DFA may not reach the performance level of BP, it should still vastly
15 Figure A.1: Comparisons of Tiny-NeRF trained with BP, DFA, and a shallow approach. Shallow
training is insufficient to learn scene geometry. Lego scene from the NeRF synthetic dataset.
outperform shallow learning: failure to do so would mean that the weight updates delivered by DFA
are useless. On a simple task like MNIST, a shallow baseline may be as high as 90%. However, given
the difficulty of the tasks we consider, the shallow baseline is here usually much lower.
NeRF Because NeRF models are expensive to trainup to 15 hours on a V100we consider a
simplified setup for the shallow baseline, NeRF-Tiny. This setup operates at half the full resolution
of the training images available, runs for 5000 iterations only, and does away with view-dependant
characteristics. Furthermore, the network is cut down to 3 layers of half the width of NeRF, and
no coarse network is used to inform the sampling. We train this network on the Lego scene of the
NeRF-Synthetic dataset, and compare results.
Figure A.1 presents renders generated by NeRF-Tiny trained with BP, DFA, and a shallow approach.
While BP and DFA delivers similar renders, shallow training fails to reproduce even basic scene
geometry, instead outputting a diffuse cloud of colors. This highlights that while DFA may not reach
a level of performance on-par with BP on NeRF, it nonetheless delivers meaningful updates enabling
the learning of complex features.
Recommender systems Because recommender systems require fine-tuning, we perform the same
hyperparameter search for shallow learning than for DFA and BP. Results are detailed in Table A.3.
Performance of shallow training is always well under BP and DFAremember that0.001-levelmatter
in recommender systems. In particular, in Deep & Cross, where there was the biggest gap between
BP and DFA, the performance of the shallow method is extremely poor, well below the FM baseline.
Finally, it is expected to see that DeepFM recovers more or less the performance of FM even with a
shallow baseline.
Table A.3: Shallow baseline for recommender system models on the Criteo dataset. Performance is
always well below BP and DFA, as expected.
DeepFM Deep&Cross AFN
AUC 0.7920 0.7324 0.7859
Loss 0.4682 0.5010 0.4685
Graph convolutions We use the same hyperparameters as for DFA to produce the shallow baseline
on graph datasets. Results are reported in Table A.4. Performance is always much worse than BP
and DFA. GATConv recovers the best performance: random attention layers may still deliver useful
features [84], as do random convolutions.
Transformers In the baseline setting (optimizer and hyper-parameters of [59]), a Transformer
trained in the shallow regime yields a perplexity of 428 on WikiText-103. We do not consider
16 Table A.4: Shallow baseline for GCNNs on Cora, CiteSeer, and PubMed [61]. Performance is always
well below BP and DFA.
ChebConv GraphConv SplineConv GATConv DNAConv
Cora 23.3 37.0 39.6 59.4 30.2
CiteSeer 27.4 33.8 30.1 49.8 24.0
PubMed 37.6 44.8 44.2 67.8 42.2
other settings, as the cost of training a Transformer is high and we do not expect any meaningful
improvementsas with NeRF above.
C Adapting architectures to DFA
NeRF We use an architecture identical to the one used in [36], but based on the effective code
implementation rather than the description in the paper 1 . During our tests, we have found that
lowering the learning rate to1:10 4 rather than5:10 4 works best with DFA.
Recommender systems For all training methods (BP, DFA, and shallow), we have conducted
independent hyperparameter searches. We performed a grid search over the learning rate, from
1:10 4 to1:10 3 in1:10 4 steps, as well as over the dropout probability, from0:1to0:5in0:1steps
(where applicable). On DeepFM, this search leads to reduce the learning rate from3:10 4 with BP
to5:10 5 with DFA, but to keep the 0.5 dropout rate. On Deep & Cross, we reduce learning rate
from2:10 4 to5:10 5 , with no dropout in both cases. In AFN, we reduce dropout from4:10 4 to
3:10 4 and dropout from 0.3 to 0.
Graph convolutions We manually test for a few hyperparameters configuration on the Cora dataset,
focusing on learning rate, weight decay, and dropout. We do not consider architectural changes, such
as changing the number of filters or of attention heads. For ChebConv and GraphConv, we reduce
weight decay to1:10 4 instead of5:10 4 , and set the dropout rate to0and0:1respectively, instead
of0:5with BP. For SplineConv, we find that no change in the hyperparameters are necessary. For
GATConv, we reduce weight decay to1:10 4 instead of5:10 4 and reduce dedicated dropout layer
to0:1instead of0:6but keep the0:6dropout rate within the GAT layer. Finally, on DNAConv we
disable weight decay entirely, instead of an original value of5:10 4 , double the learning rate from
5:10 3 to1:10 2 , and disable dropout entirely. In all cases, we share the backward random matrix
across all nodes in a graph.
Transformers The model hyper-parameters were fixed across all of our experiments, except for
the number of attention heads in one case, that we will precise below, and dropout. We tested several
values of dropout probability between 0 and 0.5, but found the original value of 0.1 to perform
best. We manually tested a number of optimizers, optimizer parameters and attention mechanisms.
We tested four combinations of optimizers and schedulers : Adam with the scheduler used in [59],
Adam alone, RAdam [85] alone, and Adam with a scheduler that reduces the learning rate when
the validation perplexity plateaus. We found it necessary to reduce the initial learning rate of Adam
from1:10 4 to5:10 5 , although it could be set back to1:10 4 with a scheduler. We tried two values
of2 : 0.98 and 0.999. We also tried to change1 and observed some small differences that were
not significant enough for the main text. Finally, we tried three attention mechanisms in addition to
the standard multihead scaled dot-product attention: the dense and random (learnable) Synthesizers
of [84], as well as the fixed attention patterns of [86]. The latter needed to be adapted to language
modelling to prevent attending to future tokens, which led us to reduced the number of attention
heads to 4. The backward random matrix is always shared across all tokens and batches.
1 https://github.com/bmild/nerf/issues/11
17 D Weight transport and attention
We consider an attention layer operating on inputx. The queries, keys, and values are respectively
q=xW Q ;k=xW K ;v=xW V , anddk is the dimension of the queries and keys. The layer
performs: qk T
Attention(q;k;v) =softmax p v (4)dk
When using DFA on attention, we deliver the random feedback to the top of the layer. Accordingly,
to obtain updates toWQ ;WK ;andWV we still to have to backpropagate through the attention
mechanism itself. This involves weight transport onWV , sacrificing some biological realism for
simplicity. Overall weight transport between layers still does not occur, and updating the layers in
parallel remains possible.
Beside using FA or DFA within the attention layer, alternative mechanisms like the synthesizer
[84]which uses random attention in place of the query and key systemor fixed attention [86] can
remove the need for weight transport. Implementing these mechanisms in DFA-trained Transformers,
or other attention-powered architectures, will require further research.
E Supplementary NeRF results
Quantitative results We report per-scene scores for each dataset in Table A.5. BP values are taken
from [36]. On three scenes of the synthetic datasets, NeRF-DFA even outperforms past state-of-the-art
methods trained with BP. Note that Neural Volumes (NV) is not applicable to forward-facing view
synthesisas is required in LLFF-Realand thus no results are reported.
Qualitative results We report sample renders from the NeRF-Synthetic dataset (Figure A.2) and
the LLFF-Real dataset (Figure A.2), for every scene available. However, we recommend readers to
consult the supplementary video to make better sense of characteristics like multi-view consistency
and view-dependent effects (most visible on the LLFF-Real Room scene).
Table A.5: Per-scene PSNR for NeRF DFA and BP against other state-of-the-art methods on the
Nerf-Synthetic and LLFF-Real. DFA performance is fairly homogeneous across each dataset and in
line with the differences in other methods.
NV SRN LLFF NeRF
BP BP BP BP DFA
NeRF-Synthetic 26.05 22.26 24.88 31.01 25.41
Chair 28.33 26.96 28.72 33.00 28.74
Drums 22.58 17.18 21.13 25.01 22.15
Ficus 24.79 20.73 21.79 30.13 25.61
Hotdog 30.71 26.81 31.41 36.18 28.03
Lego 26.08 20.85 24.54 32.54 24.93
Materials 24.22 18.09 20.72 29.62 25.15
Mic 27.78 26.85 27.48 32.91 25.43
Ship 23.93 20.60 23.22 28.65 23.25
LLFF-Real 22.84 24.13 26.50 20.77
Room 27.29 28.42 32.70 24.20
Fern 21.37 22.95 25.17 21.82
Leaves 18.24 19.52 20.92 16.50
Fortress 26.63 29.40 31.16 25.16
Orchids 17.37 18.52 20.36 16.73
Flower 26.63 25.46 27.40 21.55
T-Rex 22.87 24.15 26.80 19.43
Horns 24.33 24.70 27.45 20.75
18 Possible future directions Despite retranscribing scene geometry in a multi-view consistent way,
NeRF produces renders of a lower quality when trained with DFA instead of BP. In particular, it
struggles to transcribe small-scale details, resulting in "blurry" renders. Moreover, it displays high-
frequency artefacts: not in the scene geometry, but in individual pixels taking values very distant from
their neighborhood. Interestingly, this noise phenomenon is unique to NeRF-DFA: it is not observed
on NeRF-BP with similar PSNR values (achieved during training) or on other methods with similar
or lower PSNR. This leads us to hypothesize this is an aspect unique to DFA, possibly due to the
alignment process. Indeed, DFA creates a bias on the weights, by encouraging them to be "aligned"
with an arbitrary values dependant on the random matrix used. It is possible this could introduce
random noise in the final rendersthough we leave a more principled experiment to future research.
To attempt to alleviate this issue, we first consider NeRF-Dual. In NeRF-Dual, we average the
pixel-wise prediction between the fine and coarse network, to attempt to remove some of the noise.
To do so, we first still use the coarse network to create a probability distribution for the hierarchical
sampling. Then, we evaluate again both the coarse and fine networks at the locations informed by
this probability distribution. Compared to vanilla NeRF, this requires an extra batch of evaluation of
the coarse network for all raysrougly speaking, this increases inference time by 30-50% depending
on the coarse network architecture considered. We note that this is not applied during training, so that
training times remain identical.
Figure A.2 and Figure A.3 showcase comparisons between NeRF and NeRF-Dual trained with DFA
on all scenes. When viewed at high resolutionsuch as in our supplementary videothe NeRF-Dual
renders are more pleasing, especially for the full scenes. They remove most of the high-frequency
noise, leading to smoother renders. However, this averaging process further blurs small-scale details in
the render. This is especially visible in the NeRF-Synthetic dataset, on scenes like Ficus. Furthermore,
NeRF-Dual introduces novel artefacts in the Mic and Ship scenes, with areas improperly colored
with a violet tint. The cause for these artefacts is unknown, but they show that NeRF-Dual is far from
a silver bullet. The PSNR is also minimally increased, by less than 0.5 per scene. Nevertheless, this
shows some promise in possibilities to allievate the shortcomings of NeRF-DFA. It is possible that
changes to the overall rendering process, or the use of classic image processing techniques, may help
enhance the NeRF-DFA images.
Finally, we also experimented with increasing the capacity of the fine network, by widening its layers
to 512 neurons. We call this architecture NeRF-XL. However, we have not succeeded in getting
PSNR values higher than with vanilla NeRF on DFA. In particular, the training process becomes
much more cumbersome, as multi-GPU parallelism is needed to fit the model. It is possible that
higher network capacity may help learning both the task at hand and to align simultaneously, but
further work is required.
F Reproducibility
Hardware used All main experiments require at most a single NVIDIA V100 GPU with 16GB
of memory to reproduce. Alignment measurement on large architectures (NeRF and Transformers)
require a second identical GPU to keep a copy of the network to evaluate BP gradients.
We estimate that a total of around 10,000 GPU-hours on V100s were necessary for this paper.
Accordingly, we estimate the cloud-computing carbon impact of this paper to be of 1700 kgCO 2 eq 2 .
However, without hyperparameter searches, our results can be reproduced with less than 500 GPU-
hours on V100s, with most of that budget going to NeRF and Transformers.
Implementation We use the shared random matrix trick from [23] to reduce memory use in DFA
and enable its scaling to large networks. We use PyTorch [87] for all experiments. For reference
implementation of the methods considered, we relied on various sources. Our NeRF implementation
is based on the PyTorch implementation by Krishna Murthy 3 , with modifications to allow for proper
test and validation, as well as DFA and multi-GPU support. For recommender systems, we use
2 https://mlco2.github.io/impact#compute
3 https://github.com/krrish94/nerf-pytorch
19 thetorchfmpackage 4 . Finally, we use PyTorch Geometric [60] for all graph operations. Our
Transformer implementation is our own. Our code is available as supplementary material.
NeRF We provide training, testing, and rendering code along with the configurations used to obtain
our results. An example to reproduce our results is given in the supplementary code repository. Given
the computing cost associated with training a NeRF, we also provide our trained models.
Recommender systems We provide bash scripts to reproduce the results in Table 2 and A.3, with
the results of our hyperparameter search. We provide code to reproduce the results in Table A.1.
Graph convolutions We provide the code to reproduce all of our results. Note that the t-SNE
results are not exactly reproducible, as the CUDA implementation used is non-deterministic.
Transformers We provide bash scripts to reproduce Table 5 and the shallow results.
4 https://github.com/rixwew/pytorch-fm
20 Figure A.2: Sample renders for every scene of the NeRF-Synthetic dataset, for NeRF and NeRF-Dual
trained with DFA.
21 Figure A.3: Sample renders for every scene of the LLFF-Real dataset, for NeRF and NeRF-Dual
trained with DFA.
22