diff --git a/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt b/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt
new file mode 100644
index 0000000..0c2f968
--- /dev/null
+++ b/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt	
@@ -0,0 +1,555 @@
+        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1
+
+
+
+         A Survey of Model Compression and Acceleration
+
+                             for Deep Neural Networks
+
+                 Yu Cheng, Duo Wang, Pan Zhou,Member, IEEE,and Tao Zhang,Senior Member, IEEE
+
+
+
+
+         Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model
+        recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment
+        in devices with low memory resources or in applications with to billions [4].
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+     arXiv:1710.09282v7  [cs.LG]  7 Feb 2019  strict latency requirements. Therefore, a natural thought is to   As larger neural networks with more layers and nodes
+        perform model compression and acceleration in deep networks are considered, reducing their storage and computational cost
+        without signiﬁcantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech-
+        niques for compacting and accelerating CNNs model developed. tion, recent years witnessed signiﬁcant progress in virtual
+        These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre-
+        parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle
+        ferred/compact convolutional ﬁlters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced.
+        For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efﬁcient deep learning methods can have
+        performance, related applications, advantages, and drawbacks signiﬁcant impacts on distributed systems, embedded devices,
+        etc. Then we will go through a few very recent additional and FPGA for Artiﬁcial Intelligence. For example, the ResNet-
+        successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion ﬂoating number multiplications matrix, the main datasets used for evaluating the model per-
+        formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant
+        this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than
+        on this topic.                                   75% of parameters and 50% computational time. For devices
+         Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte
+        Model Compression and Acceleration,                  resources, how to compact the models used on them is also
+                                                   important.
+                                                     Achieving these goal calls for joint solutions from manyI. I NTRODUCTION                disciplines, including but not limited to machine learning, op-
+         In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing,
+        lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works
+        achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which
+        These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community
+        billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years.
+        very high computation capability plays a key role in their   We classify these approaches into four categories: pa-
+        success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans-
+        achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional ﬁlters, and knowledge distil-
+        using a network containing 60 million parameters with ﬁve lation. The parameter pruning and sharing based methods
+        convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to
+        it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor-
+        ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to
+        example is the top face veriﬁcation results on the Labeled estimate the informative parameters of the deep CNNs. The
+        Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional ﬁlters
+        containing hundreds of millions of parameters, using a mix design special structural convolutional ﬁlters to reduce the
+        of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge
+                                                   distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
+        Way, Redmond, WA 98052, USA.                         compact neural network to reproduce the output of a larger
+         Duo Wang and Tao Zhang are with the Department of Automation, network.
+        Tsinghua University, Beijing 100084, China.                     In Table I, we brieﬂy summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074,
+        China.                                        rank factorization and knowledge distillation approaches can        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2
+
+
+                                                TABLE I
+                        SUMMARIZATION OF DIFFERENT APPROACHES FOR MODEL COMPRESSION AND ACCELERATION .
+              Theme Name                Description             Applications             More details
+          Parameter pruning and sharing    Reducing redundant parameters which   Convolutional layer and  Robust to various settings, can achieve
+                               are not sensitive to the performance    fully connected layer   good performance, can support both train
+                                                                       from scratch and pre-trained model
+            Low-rank factorization      Using matrix/tensor decomposition to   Convolutional layer and    Standardized pipeline, easily to be
+                               estimate the informative parameters    fully connected layer    implemented, can support both train
+                                                                       from scratch and pre-trained model
+         Transferred/compact convolutional  Designing special structural convolutional   Convolutional layer   Algorithms are dependent on applications,
+                 ﬁlters              ﬁlters to save parameters           only         usually achieve good performance,
+                                                                        only support train from scratch
+            Knowledge distillation     Training a compact neural network with  Convolutional layer and    Model performances are sensitive
+                               distilled knowledge of a large model    fully connected layer    to applications and network structure
+                                                                        only support train from scratch
+
+
+        be used in DNN models with fully connected layers and
+        convolutional layers, achieving comparable performances. On
+        the other hand, methods using transferred/compact ﬁlters are
+        designed for models with convolutional layers only. Low-rank
+        factorization and transfered/compact ﬁlters based approaches
+        provide an end-to-end pipeline and can be easily implemented
+        in CPU/GPU environment, which is straightforward. while
+        parameter pruning & sharing use different methods such as
+        vector quantization, binary coding and sparse constraints to
+        perform the task. Generally it will take several steps to achieve
+        the goal.                                     Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output
+        eter pruning/sharing low-rank factorization can be extracted is the compression model.
+        from pre-trained ones or trained from scratch. While the
+        transferred/compact ﬁlter and knowledge distillation models
+        can only support train from scratch. These methods are inde- memory usage and ﬂoat point operations with little loss in
+        pendently designed and complement each other. For example, classiﬁcation accuracy.
+        transferred layers and parameter pruning & sharing can be   The method proposed in [10] quantized the link weights
+        used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the
+        used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce
+        speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con-
+        properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the
+                                                   small-weight connections. Finally, the network was retrained
+              II. P                                 to learn the ﬁnal weights for the remaining sparse connections. ARAMETER PRUNING AND SHARING        This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importanceﬁtting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which   In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classiﬁed into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix.       instance, BinaryConnect [12], BinaryNet [13] and XNORNet-
+                                                   works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization                    activation during the model training. The systematic study in
+         Network quantization compresses the original network by [15] showed that networks trained with back propagation could
+        reducing the number of bits required to represent each weight. be resilient to speciﬁc weight distortions, including binary
+        Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights.
+        quantization to the parameter values. Vanhouckeet al.[8]   Drawbacks: the accuracy of the binary nets is signiﬁcantly
+        showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet.
+        in signiﬁcant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina-
+        work in [9] used 16-bit ﬁxed-point representation in stochastic rization schemes are based on simple matrix approximations
+        rounding based CNN training, which signiﬁcantly reduced and ignore the effect of binarization on the accuracy loss.        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3
+
+
+         To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of
+        Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear
+        directly minimizes the loss with respect to the binary weights. transformsf(x;M) = (Mx), where ( )is an element-wise
+        The work in [17] reduced the time on ﬂoat point multiplication nonlinear operator,xis the input vector, andMis them n
+        in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense
+        converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing
+        signiﬁcant changes.                              matrix-vector products inO(mn)time. Thus, an intuitive
+                                                   way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing                           structural matrix. Anm nmatrix that can be described
+         Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured
+        network complexity and to address the over-ﬁtting issue. An matrix. Typically, the structure should not only reduce the
+        early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference
+        [18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and
+        Surgeon [20] methods reduced the number of connections gradient computations.
+        based on the Hessian of the loss function, and their work sug-   Following this direction, the work in [30], [31] proposed a
+        gested that such pruning gave higher accuracy than magnitude- simple and efﬁcient approach based on circulant projections,
+                                                   while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training (rprocedure of those methods followed the way training from   0 ;r 1 ;   ;r d 1 ), a circulant matrixR2Rd d is deﬁned
+                                                   as:
+        scratch manner.                                               2                    3 r A recent trend in this direction is to prune redundant,                   0  rd 1  ::: r 2  r1 6r6 1   r0  rd 1     r2 77 non-informative weights in a pre-trained CNN model. For                6 ..            .     7
+        example, Srinivas and Babu [21] explored the redundancy      R= circ(r) :=66 .   r        .   ..   . 71   r0       . 7:  (1)6          .         7 among neurons, and proposed a data-free pruning method to                4r         .   ..   ..    5d 2              rd 1
+        remove redundant neurons. Hanet al.[22] proposed to reduce                 rd 1 rd 2  ::: r 1  r0
+        the total number of parameters and operations in the entire  thus the memory cost becomesO(d)instead ofO(d2 ).network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourierused a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan-   In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fullyon soft weight-sharing was proposed, which included both connected layers. The Adaptive Fastfood transform matrixquantization and pruning in one simple (re-)training procedure. R2Rn d was deﬁned as:The above pruning schemes typically produce connections
+        pruning in CNNs.                                              R=SHG HB            (2)
+         There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices.  2
+        with sparsity constraints. Those sparsity constraints are typ- f0;1gd d is a random permutation matrix, andHdenotes
+        ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con-
+        norm regularizers. The work in [25] imposed group sparsity nected layer withdinputs andnoutputs using the Adaptive
+        constraint on the convolutional ﬁlters to achieve structured Fastfood transform reduces the storage and the computational
+        brain Damage, i.e., pruning entries of the convolution kernels costs fromO(nd)toO(n)and fromO(nd)toO(nlogd),
+        in a group-wise fashion. In [26], a group-sparse regularizer respectively.
+        on neurons was introduced during the training stage to learn   The work in [29] showed the effectiveness of the new
+        compact CNNs with reduced ﬁlters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their
+        structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured
+        ﬁlters, channels or even layers. In the ﬁlter-level pruning, all matrix classes, including block and multi-level Toeplitz-like
+        the above works usedl2;1 -norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34].
+        usedl1 -norm to select and prune unimportant ﬁlters.       Following this idea, [35] proposed a general structured efﬁ-
+         Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs.
+        and sharing. First, pruning withl1 orl2 regularization requires   Drawbacks: one problem of this kind of approaches is that
+        more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the
+        pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand,
+        which demands ﬁne-tuning of the parameters and could be how to ﬁnd a proper structural matrix is difﬁcult. There is no
+        cumbersome for some applications.                   theoretical way to derive it out.
+
+        C. Designing Structural Matrix                          III. L OW -RANK FACTORIZATION AND SPARSITY
+         In architectures that contain fully-connected layers, it is   Convolution operations contribute the bulk of most com-
+        critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4
+
+
+                                                                      TABLE II
+                                                    COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES
+                                                                   ON ILSVRC-2012.
+                                                        Model TOP-5 Accuracy Speed-up Compression Rate
+                                                       AlexNet 80.03% 1. 1.
+                                                      BN Low-rank 80.56% 1.09 4.94
+                                                      CP Low-rank 79.66% 1.82 5.
+                                                       VGG-16 90.60% 1. 1.
+        Fig. 2. A typical framework of the low-rank regularization method. The left    BN Low-rank 90.47% 1.53 2.72
+        is the original convolutional layer and the right is the low-rank constraint    CP Low-rank 90.31% 2.05 2.75
+        convolutional layer with rank-K.                             GoogleNet 92.21% 1. 1.
+                                                      BN Low-rank 91.88% 1.08 2.79
+                                                      CP Low-rank 91.79% 1.20 2.84
+        would improve the compression rate as well as the overall
+        speedup. For the convolution kernels, it can be viewed as a
+        4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic
+        the intuition that there is a signiﬁcant amount of redundancy parameters in deep models using the low-rank method. [42]
+        in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the ﬁnal weight
+        remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted
+        it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite
+        help.                                        the fully connected layer for designing compact multi-task
+         It has been a long time for using low-rank ﬁlters to acceler- deep learning architectures.
+        ate convolution, for example, high dimensional DCT (discrete   Drawbacks: low-rank approaches are straightforward for
+        cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements
+        to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti-
+        respectively. Learning separable 1D ﬁlters was introduced ﬁed units and maxout. However, the implementation is not
+        by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which
+        idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current
+        approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and
+        kernels were proposed in [37]. They achieved 2 speedup thus cannot perform global parameter compression, which
+        for a single convolutional layer with 1% drop in classiﬁcation is important as different layers hold different information.
+        accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to
+        decomposition schemes, reporting a 4.5 speedup with 1% achieve convergence when compared to the original model.
+        drop in accuracy in text recognition.
+         The low-rank approximation was done layer by layer. The   IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS
+        parameters of one layer were ﬁxed after it was done, and the   CNNs are parameter efﬁcient due to exploring the trans-layers above were ﬁne-tuned based on a reconstruction error lation invariant property of the representations to the inputcriterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-ﬁtting. Although a strong theorylowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant propertyused nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu-decomposition for training low-rank constrained CNNs from tional ﬁlters to compress CNN models is motivated by recentscratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input, ( )be a network or layer andT( )be theboth the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is deﬁned as:Low-rank) can be used to train CNNs from scratch. However,
+        there are few differences between them. For example, ﬁnding                T‘  (x) =  (Tx)            (3)the best low-rank approximation in CP decomposition is an ill-
+        posed problem, and the best rank-K(Kis the rank number) indicating that transforming the inputxby the transformT( )
+        approximation may not exist sometimes. While for the BN and then passing it through the network or layer ( )should
+        scheme, the decomposition always exists. We perform a simple give the same result as ﬁrst mappingxthrough the network
+        comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq.
+        speedup and the compression rates are used to measure their (10), the transformsT( )andT0 ( )are not necessarily the
+        performances.                                  same as they operate on different objects. According to this
+         As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or ﬁlters
+        be viewed as a 2D matrix and thus the above mentioned  ( )to compress the whole network models. From empirical
+        methods can also be applied there. There are several classical observation, deep CNNs also beneﬁt from using a large set of
+        works on exploiting low-rankness in fully connected layers. convolutional ﬁlters by applying certain transformT( )to a        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5
+
+
+        small set of base ﬁlters since it acts as a regularizer for the                   TABLE III
+        model.                                       ASIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND
+         Following this direction, there are many recent reworks                   CIFAR-100.
+        proposed to build a convolutional layer from a set of base       Model CIFAR-100 CIFAR-10 Compression Rate
+        ﬁlters [43]–[46]. What they have in common is that the      VGG-16 34.26% 9.85% 1.
+        transformT( )lies in the family of functions that only operate      MBA [46] 33.66% 9.76% 2.
+                                                       CRELU [45] 34.57% 9.92% 2. in the spatial domain of the convolutional ﬁlters. For example,      CIRC [43] 35.15% 10.23% 4.
+        the work in [45] found that the lower convolution layers of     DCNN [44] 33.57% 9.65% 1.62
+        CNNs learned redundant ﬁlters to extract both positive and
+        negative phase information of an input signal, and deﬁnedT( )   Drawbacks: there are few issues to be addressed for ap-to be the simple negation function:                   proaches that apply transform constraints to convolutional ﬁl-
+                       T(Wx ) =W              (4) ters. First, these methods can achieve competitive performance x                 for wide/ﬂat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional ﬁlter andW  is the ﬁlter x         ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2 compression   Using a compact ﬁlter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric ﬁlters with compact blocks to improve the classiﬁcation accuracy. The intuition is that the learning the speed, which signiﬁcantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing3 3convolution into two1 1to useful convolutional ﬁlters instead of redundant ones.     convolutions was used in [48], which achieved signiﬁcantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace3 3convolution with1 1convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The
+        transformT( )was deﬁne as:                                 V. K NOWLEDGE DISTILLATION T‘  (x) =Wx +             (5)   To the best of our knowledge, exploiting knowledge transfer
+        where were the multi-bias factors. The work in [47] con- (KT) to compress model was ﬁrst proposed by Caruanaet
+        sidered a combination of rotation by a multiple of90   and al.[50]. They trained a compressed/ensemble model of strong
+        horizontal/vertical ﬂipping with:                     classiﬁers with pseudo-data labeled, and reproduced the output
+                                                   of the original larger network. But the work is limited toT‘  (x) =WT              (6) shallow models. The idea has been recently adopted in [51]
+        whereWT  was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide
+        original ﬁlters with angle 2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model
+        transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The
+         was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from
+        can achieve good classiﬁcation performance.             a large teacher model into a small one by learning the class
+         The work in [44] deﬁnedT( )as the set of translation distributions output via softmax.
+        functions applied to 2D ﬁlters:                        The work in [52] introduced a KD compression framework,
+                                                   which eased the training of deep networks by following aT‘  (x) =T( ;x;y)x;y2f k;:::;kg;(x;y)6=(0;0)    (7) student-teacher paradigm, in which the student was penalized
+        whereT( ;x;y)denoted the translation of the ﬁrst operand by according to a softened version of the teacher’s output. The
+        (x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into
+        at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained
+        can be used to 1) improve the classiﬁcation accuracy as a to predict the output and the classiﬁcation labels. Despite
+        regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various
+        parameter efﬁciency by ﬂexibly varying their architectures to image classiﬁcation tasks. The work in [53] aimed to address
+        compress networks.                              the network compression problem by taking advantage of
+         Table III brieﬂy compares the performance of different depth neural networks. It proposed an approach to train thin
+        methods with transferred convolutional ﬁlters, using VGGNet but deep networks, called FitNets, to compress wide and
+        (16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended
+        on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In
+        observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher
+        little or no drop in classiﬁcation accuracy.               network, FitNet made the student mimic the full feature maps        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 6
+
+
+        of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec-
+        the capacities of teacher and student may differ greatly.     ture such as GoogleNet or Network in Network, can achieve
+         All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting
+        10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully
+        experimental results show that these methods match or outper- optimized the utilization of the computing resources inside
+        form the teacher’s performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62]
+        parameters and multiplications.                      and motivated them to increase the depth and width of the
+         There are several extension along this direction of dis- network while keeping the computational budget constant.
+        tillation knowledge. The work in [54] trained a parametric   The work in [63] targeted the Residual Network based
+        student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called
+        proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory
+        neural networks for the student model. Different from previous setup to train short networks and used deep networks at test
+        works which represented the knowledge using the soften label time. It started with very deep networks, while during training,
+        probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers
+        neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this
+        information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual
+        The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed
+        instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers
+        network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best
+        are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional
+        mations between neural network speciﬁcations. Zagoruyko networks with adaptive inference graphs to adaptively deﬁne
+        et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66].
+        assumption of FitNet. They transferred the attention maps that   Other approaches to reduce the convolutional overheads in-are summaries of the full activations.                  clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help signiﬁcantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classiﬁcation tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral ﬁlters [70]. Those worksperformance competitive with other type of approaches.     only aim to speed up the computation but not reduce the
+                                                   memory storage.VI. O THER TYPES OF APPROACHES
+         We ﬁrst summarize the works utilizing attention-based
+        methods. Note that attention-based mechanism [58] can reduce    VII. B ENCHMARKS , E VALUATION AND DATABASES
+        computations signiﬁcantly by learning to selectively focus or   In the past ﬁve years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to ﬁrst standard models include LeNets [71], All-CNN-nets [72] andﬁnd the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been signiﬁcantly reduced.                  layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected
+        a sparse combination of the experts to process each input. In   The standard criteria to measure the quality of model
+        [61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the
+        which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters
+        selected and executed a subset of D2NN neurons based on the in the original modelManda  is that of the compressed
+        input.                                        modelM  , then the compression rate (M;M   )ofM  over
+         There have been other attempts to reduce the number of Mis                     aparameters of neural networks by replacing the fully connected                 (M;M   ) =  :            (8)a         IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7
+
+
+                          TABLE IV                       or low rank factorization based methods. If you need
+           SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT         end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION .          and transferred convolutional ﬁlters approaches could be
+            Baseline Models         Representative Works            considered.
+             Alexnet [1]        structural matrix [29], [30], [32]         For applications in some speciﬁc domains, methods with low-rank factorization [40]           human prior (like the transferred convolutional ﬁlters, Network in network [73]      low-rank factorization [40]
+            VGG nets [74]          transferred ﬁlters [44]            structural matrix) sometimes have beneﬁts. For example,
+                             low-rank factorization [40]           when doing medical images classiﬁcation, transferred Residual networks [75]  compact ﬁlters [49], stochastic depth [63]       convolutional ﬁlters could work well as medical images parameter sharing [24]
+           All-CNN-nets [72]         transferred ﬁlters [45]            (like organ) do have the rotation transformation property.
+             LeNets [71]          parameter sharing [24]            Usually the approaches of pruning & sharing could give parameter pruning [20], [22]          reasonable compression rate while not hurt the accuracy.
+                                                       Thus for applications which requires stable model accu-
+        Another widely used measurement is the index space saving     racy, it is better to utilize pruning & sharing.
+        deﬁned in several papers [30], [35] as                     If your problem involves small/medium size datasets, you
+                                                       can try the knowledge distillation approaches. The com-a a 
+                      (M;M   ) =     ;           (9)     pressed student model can take the beneﬁt of transferringa                     knowledge from teacher model, making it robust datasets
+        whereaandaare the number of the dimension of the index     which are not large.
+        space in the original model and that of the compressed model,     As we mentioned before, techniques of the four groups
+        respectively.                                      are orthogonal. It is reasonable to combine two or three
+         Similarly, given the running timesofMands  ofM  ,     of them to maximize the performance. For some spe-
+        the speedup rate (M;M   )is deﬁned as:                  ciﬁc applications, like object detection, which requires
+                                 s                     both convolutional and fully connected layers, you can (M;M   ) =  :            (10)s                      compress the convolutional layers with low rank based
+        Most work used the average training time per epoch to measure     method and the fully connected layers with a pruning
+        the running time, while in [30], [35], the average testing time     technique.
+        was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster
+        computation for both the training and the testing stages.       Techniques for deep model compression and acceleration
+         Good compression methods are expected to achieve almost are still in the early stage and the following challenges still
+        the same performance as the original model with much smaller need to be addressed.
+        parameters and less computational time. However, for different     Most of the current state-of-the-art approaches are built
+        applications with different CNN designs, the relation between     on well-designed CNN models, which have limited free-
+        parameter size and computational time may be different.     dom to change the conﬁguration (e.g., network structural,
+        For example, it is observed that for deep CNNs with fully     hyper-parameters). To handle more complicated tasks,
+        connected layers, most of the parameters are in the fully     it should provide more plausible ways to conﬁgure the
+        connected layers; while for image classiﬁcation tasks, ﬂoat     compressed models.
+        point operations are mainly in the ﬁrst few convolutional layers     Pruning is an effective way to compress and acceler-
+        since each ﬁlter is convolved with the whole image, which is     ate CNNs. The current pruning techniques are mostly
+        usually very large at the beginning. Thus compression and     designed to eliminate connections between neurons. On
+        acceleration of the network should focus on different type of     the other hand, pruning channel can directly reduce the
+        layers for different applications.                         feature map width and shrink the model into a thinner
+                                                       one. It is efﬁcient but also challenging because removing
+               VIII. D ISCUSSION AND CHALLENGES            channels might dramatically change the input of the
+                                                       following layer.In this paper, we summarized recent efforts on compressing
+        and accelerating deep neural networks (DNNs). Here we dis-     As we mentioned before, methods of structural matrix
+                                                       and transferred convolutional ﬁlters impose prior humancuss more details about how to choose different compression     knowledge to the model, which could signiﬁcantly affectapproaches, and possible challenges/solutions on this area.       the performance and stability. It is critical to investigate
+                                                       how to control the impact of those prior knowledge.A. General Suggestions                               The methods of knowledge distillation provide many ben-
+         There is no golden rule to measure which approach is the     eﬁts such as directly accelerating model without special
+        best. How to choose the proper method is really depending     hardware or implementations. It is still worthy developing
+        on the applications and requirements. Here are some general     KD-based approaches and exploring how to improve their
+        guidance we can provide:                             performances.
+           If the applications need compacted models from pre-     Hardware constraints in various of small platforms (e.g.,
+           trained models, you can choose either pruning & sharing     mobile, robotic, self-driving car) are still a major problem        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8
+
+
+           to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g.,
+           use of the limited computational source and how to design video and image frames [88], [89]).
+           special compression methods for such platforms are still
+           challenges that need to be addressed.                         IX. ACKNOWLEDGMENTS
+           Despite the great achievements of these compression ap-
+           proaches, the black box mechanism is still the key barrier   The authors would like to thank the reviewers and broader
+           to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular,
+           is still an important problem.                    we would like to thank Hong Zhao from the Department of
+                                                   Automation of Tsinghua University for her help on modifying
+        C. Possible Solutions                             the paper. This research is supported by National Science
+                                                   Foundation of China with Grant number 61401169.To solve the hyper-parameters conﬁguration problem, we
+        can rely on the recent learning-to-learn strategies [76], [77].
+        This framework provides a mechanism allowing the algorithm                  REFERENCES
+        to automatically learn how to exploit structure in the problem  [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classiﬁcation with of interest. Very recently, leveraging reinforcement learning     deep convolutional neural networks,” inNIPS, 2012.
+        to efﬁciently sample the design space and improve the model  [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
+        compression has also been tried [78].                     gap to human-level performance in face veriﬁcation,” inCVPR, 2014.
+                                                    [3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efﬁciency beneﬁt on both     adaptive feature sharing in multi-task networks with applications in
+        CPU and GPU because no special implementation is required.     person attribute classiﬁcation,”CoRR, vol. abs/1611.05377, 2016.
+        But it is also challenging to handle the input conﬁguration.  [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
+                                                       M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel     distributed deep networks,” inNIPS, 2012.
+        pruning methods [79], which focus on imposing sparse con-  [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
+        straints on weights during training. However, training from     recognition,”CoRR, vol. abs/1512.03385, 2015.
+                                                    [6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In     deep convolutional networks using vector quantization,”CoRR, vol.
+        [80], the authors provided an iterative two-step algorithm to     abs/1412.6115, 2014.
+        effectively prune channels in each layer.                 [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized
+                                                       convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models     on Computer Vision and Pattern Recognition (CVPR), 2016.
+        and transferring it to the student models is useful for the  [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
+        knowledge distillation (KD) approaches. Instead of directly re-     neural networks on cpus,” inDeep Learning and Unsupervised Feature
+                                                       Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl-  [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
+        edge of neurons could be helpful. One can derive a way to     learning with limited numerical precision,” inProceedings of the
+        select essential neurons related to the task [81], [82]. The     32Nd International Conference on International Conference on Machine
+                                                       Learning - Volume 37, ser. ICML’15, 2015, pp. 1737–1746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
+        or samples, that implies these regions or samples share some     deep neural networks with pruning, trained quantization and huffman
+        common properties that may relate to the task.              coding,”International Conference on Learning Representations (ICLR),
+                                                       2016. For methods with the convolutional ﬁlters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network
+        matrix, we can conclude that the transformation lies in the     quantization,”CoRR, vol. abs/1612.01543, 2016.
+        family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
+                                                       neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is     in Neural Information Processing Systems 28: Annual Conference on
+        to provide a generalization of the aforementioned approaches     Neural Information Processing Systems 2015, December 7-12, 2015,
+        in two aspects: 1) instead of limiting the transformation to     Montreal, Quebec, Canada, 2015, pp. 3123–3131.
+                                                   [13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predeﬁned transformations, let it be the     works with weights and activations constrained to +1 or -1,”CoRR, vol.
+        whole family of spatial transformations applied on 2D ﬁlters     abs/1602.02830, 2016.
+        or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
+                                                       Imagenet classiﬁcation using binary convolutional neural networks,” in model parameters.                                  ECCV, 2016.
+         Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha,
+        some general/uniﬁed approaches is one direction. Wanget al.     “Deep neural networks are robust to weight binarization and other non-
+        [83] presented a feature map dimensionality reduction method     linear distortions,”CoRR, vol. abs/1606.01981, 2016.
+                                                   [16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen-     networks,”CoRR, vol. abs/1611.01600, 2016.
+        erated from different ﬁlters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
+        information of the original network. The idea can be applied     with few multiplications,”CoRR, vol. abs/1510.03009, 2015.
+                                                   [18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The     construction with back-propagation,” inAdvances in Neural Information
+        work in [84] proposed a one-shot whole network compression     Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177–185.
+        scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information
+                                                       processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and ﬁne-tuning to make deep     Damage, pp. 598–605.
+        CNNs work in mobile devices.                      [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives
+         Despite the classiﬁcation task, people are also adapting the     for network pruning: Optimal brain surgeon,” inAdvances in Neural
+                                                       Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164– compacted models in other tasks [85]–[87]. We would like to     171.          IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9
+
+
+
+          [21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural  [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net-
+              networks,” inProceedings of the British Machine Vision Conference      works,”arXiv preprint arXiv:1602.07576, 2016.
+              2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp.  [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural
+              31.1–31.12.                                              networks,” inAdvances In Neural Information Processing Systems, 2016,
+          [22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and      pp. 1082–1090.
+              connections for efﬁcient neural networks,” inProceedings of the 28th  [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and
+              International Conference on Neural Information Processing Systems, ser.      improving convolutional neural networks via concatenated rectiﬁed
+              NIPS’15, 2015.                                            linear units,”arXiv preprint arXiv:1603.05201, 2016.
+          [23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com-  [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in
+              pressing neural networks with the hashing trick.” JMLR Workshop and      deep neural networks,”arXiv preprint arXiv:1604.00676, 2016.
+              Conference Proceedings, 2015.                             [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic
+          [24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural      symmetry in convolutional neural networks,” inProceedings of the
+              network compression,”CoRR, vol. abs/1702.04008, 2017.               33rd International Conference on International Conference on Machine
+          [25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain      Learning - Volume 48, ser. ICML’16, 2016.
+              damage,” in2016 IEEE Conference on Computer Vision and Pattern  [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
+              Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,      resnet and the impact of residual connections on learning.”CoRR, vol.
+              pp. 2554–2564.                                            abs/1602.07261, 2016.
+          [26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact  [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Uniﬁed,
+              cnns,” inEuropean Conference on Computer Vision, Amsterdam, the      small, low power fully convolutional neural networks for real-time object
+              Netherlands, October 2016, pp. 662–677.                          detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016.
+          [27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured  [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ
+              sparsity in deep neural networks,” inAdvances in Neural Information      inProceedings of the 12th ACM SIGKDD International Conference on
+              Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg,      Knowledge Discovery and Data Mining, ser. KDD ’06, 2006, pp. 535–
+              I. Guyon, and R. Garnett, Eds., 2016, pp. 2074–2082.                 541.
+          [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning  [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
+              ﬁlters for efﬁcient convnets,”CoRR, vol. abs/1608.08710, 2016.           Advances in Neural Information Processing Systems 27: Annual Confer-
+          [29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for      ence on Neural Information Processing Systems 2014, December 8-13
+              small-footprint deep learning,” inAdvances in Neural Information Pro-      2014, Montreal, Quebec, Canada, 2014, pp. 2654–2662.
+              cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,  [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
+              and R. Garnett, Eds., 2015, pp. 3088–3096.                        neural network,”CoRR, vol. abs/1503.02531, 2015.
+          [30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F.  [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
+              Chang, “An exploration of parameter redundancy in deep networks with      Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550,
+              circulant projections,” inInternational Conference on Computer Vision      2014.
+              (ICCV), 2015.                                         [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling,
+          [31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and      “Bayesian dark knowledge,” inAdvances in Neural Information Process-
+              S. Chang, “Fast neural networks with circulant projections,”CoRR, vol.      ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
+              abs/1502.03436, 2015.                                       and R. Garnett, Eds., 2015, pp. 3420–3428.
+          [32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song,  [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression
+              and Z. Wang, “Deep fried convnets,” inInternational Conference on      by distilling knowledge from neurons,” inProceedings of the Thirtieth
+              Computer Vision (ICCV), 2015.                                 AAAI Conference on Artiﬁcial Intelligence, February 12-17, 2016,
+          [33]J. Chun and T. Kailath,Generalized Displacement Structure for Block-      Phoenix, Arizona, USA., 2016, pp. 3560–3566.
+              Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel-  [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning
+              berg: Springer Berlin Heidelberg, 1991, pp. 215–236.                  via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015.
+          [34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution  [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
+              in low-rank tensor formats via cross approximation,”SIAM J. Scientiﬁc      Improving the performance of convolutional neural networks via atten-
+              Computing, vol. 37, no. 2, 2015.                                tion transfer,”CoRR, vol. abs/1612.03928, 2016.
+          [35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc:  [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
+              A structured efﬁcient linear layer,” inInternational Conference on      jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014.
+              Learning Representations (ICLR), 2016.                       [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and
+          [36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable      A. C. Courville, “Dynamic capacity networks,” inProceedings of the
+              ﬁlters,” in2013 IEEE Conference on Computer Vision and Pattern      33nd International Conference on Machine Learning, ICML 2016, New
+              Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754–      York City, NY, USA, June 19-24, 2016, 2016, pp. 2549–2558.
+              2761.                                               [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
+          [37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,      and J. Dean, “Outrageously large neural networks: The sparsely-gated
+              “Exploiting linear structure within convolutional networks for efﬁcient      mixture-of-experts layer,” 2017.
+              evaluation,” inAdvances in Neural Information Processing Systems 27,  [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and
+              Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.      J. Odobez, “Deep dynamic neural networks for multimodal gesture
+              Weinberger, Eds., 2014, pp. 1269–1277.                           segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell.,
+          [38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional      vol. 38, no. 8, pp. 1583–1597, 2016.
+              neural networks with low rank expansions,” inProceedings of the British  [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
+              Machine Vision Conference. BMVA Press, 2014.                    V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
+          [39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit-      inComputer Vision and Pattern Recognition (CVPR), 2015.
+              sky, “Speeding-up convolutional neural networks using ﬁne-tuned cp-  [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep
+              decomposition,”CoRR, vol. abs/1412.6553, 2014.                    Networks with Stochastic Depth, 2016.
+          [40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks  [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual
+              with low-rank regularization,” vol. abs/1511.06067, 2015.               networks with separated stochastic depth,”CoRR, vol. abs/1612.01230,
+          [41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas,      2016.
+              “Predicting parameters in deep learning,” in Advances in Neural  [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and
+              Information Processing Systems 26, C. Burges, L. Bottou, M. Welling,      R. Feris, “Blockdrop: Dynamic inference paths in residual networks,”
+              Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 2148–2156.      inCVPR, 2018.
+              [Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper   [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer-
+              ﬁles/nips26/1053.pdf                                        ence graphs,” 2018.
+          [42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab-  [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional
+              hadran, “Low-rank matrix factorization for deep neural network training      networks through FFTs, 2014.
+              with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on  [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
+              Acoustics, Speech and Signal Processing, 2013.                      works,” in2016 IEEE Conference on Computer Vision and Pattern          IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10
+
+
+
+              Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,  [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong,
+              pp. 4013–4021.                                            M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X.
+          [69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S.      Yu, “Ibm research and columbia university trecvid-2012 multimedia
+              Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol.      event detection (med), multimedia event recounting (mer), and semantic
+              abs/1611.05138, 2016.                                       indexing (sin) systems,” 2012.
+          [70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving
+              pooling in deep networks,” inProceedings of the IEEE Conference on
+              Computer Vision and Pattern Recognition, 2018.                                  Yu Cheng(yu.cheng@microsoft.com) currently is a
+          [71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning                   Researcher at Microsoft. Before that, he was a Re-
+              applied to document recognition,” inProceedings of the IEEE, 1998, pp.                   search Staff Member at IBM T.J. Watson Research
+              2278–2324.                                                            Center. Yu got his Ph.D. from Northwestern Univer-
+          [72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried-                   sity in 2015 and bachelor from Tsinghua University
+              miller, “Striving for simplicity: The all convolutional net,”CoRR, vol.                   in 2010. His research is about deep learning in
+              abs/1412.6806, 2014.                                                     general, with speciﬁc interests in the deep generative
+          [73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014.                    model, model compression, and transfer learning.
+          [74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for                   He regularly serves on the program committees of
+              large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014.                        top-tier AI conferences such as NIPS, ICML, ICLR,
+          [75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image                   CVPR and ACL.
+              recognition,”arXiv preprint arXiv:1512.03385, 2015.
+          [76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
+              D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient
+              descent by gradient descent,” inNeural Information Processing Systems
+              (NIPS), 2016.                                                          Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference                   ceived the B.S. degree in automation from theon Learning Representations 2016, 2016.                                       Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl                   Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe                   Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018.                    Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in                   are about deep learning, particularly in few-shotdeep networks,” pp. 2270–2278, 2016.                                         learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating                   on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on                   robotics vision.Computer Vision (ICCV), Oct 2017.
+          [81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep
+              neural networks,”ECCV, 2018.
+          [82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric
+              learning via cross sample similarities transfer,” inProceedings of the
+              Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18),
+              New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852–                   Pan Zhou(panzhou@hust.edu.cn) is currently an
+              2859.                                                                associate professor with School of Electronic In-
+          [83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond ﬁlters: Compact feature                   formation and Communications, Wuhan, China. He
+              map for portable deep model,” inProceedings of the 34th International                   received his Ph.D. in the School of Electrical and
+              Conference on Machine Learning, ser. Proceedings of Machine Learning                   Computer Engineering at the Georgia Institute of
+              Research, D. Precup and Y. W. Teh, Eds., vol. 70. International                   Technology in 2011. Before that, he received his
+              Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp.                   B.S. degree in theAdvanced Classof HUST, and
+              3703–3711.                                                            a M.S. degree in the Department of Electronics
+          [84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression                   and Information Engineering from HUST, Wuhan,
+              of deep convolutional neural networks for fast and low power mobile                   China, in 2006 and 2008, respectively. His current
+              applications,”CoRR, vol. abs/1511.06530, 2015.                                   research interest includes big data analytics and
+          [85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efﬁcient  machine learning, security and privacy, and information networks.
+              object detection models with knowledge distillation,” inAdvances in
+              Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg,
+              S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
+              Eds., 2017, pp. 742–751.
+          [86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,                   Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob-
+              “Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE                   tained his B.S., M.S., and Ph.D. degrees from Ts-
+              Conference on Computer Vision and Pattern Recognition (CVPR), June                   inghua University, Beijing, China, in 1993, 1995,
+              2018.                                                                and 1999, respectively, and another Ph.D. degree
+          [87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,                   from Saga University, Saga, Japan, in 2002, all in
+              Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy                   control engineering. He is currently a Professor with
+              trade-offs for modern convolutional object detectors,” in2017 IEEE                   the Department of Automation, Tsinghua University.
+              Conference on Computer Vision and Pattern Recognition, CVPR 2017,                   He serves the Associate Dean, School of Information
+              Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296–3297.                          Science and Technology and Head of the Department
+          [88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence                   of Automation. His current research interests include
+              modeling for video event detection,” in The IEEE Conference on                   artiﬁcial intelligence, robotics, image processing,
+              Computer Vision and Pattern Recognition (CVPR), June 2014.        control theory, and control of spacecraft.
\ No newline at end of file
diff --git a/Corpus/A guide to convolution arithmetic for deep learning.txt b/Corpus/A guide to convolution arithmetic for deep learning.txt
new file mode 100644
index 0000000..a47ff7f
Binary files /dev/null and b/Corpus/A guide to convolution arithmetic for deep learning.txt differ
diff --git a/Corpus/Analysis and Design of Echo State Networks.txt b/Corpus/Analysis and Design of Echo State Networks.txt
new file mode 100644
index 0000000..ec72712
--- /dev/null
+++ b/Corpus/Analysis and Design of Echo State Networks.txt	
@@ -0,0 +1,1298 @@
+            LETTER                                 Communicated by Herbert Jaeger
+
+
+
+            Analysis and Design of Echo State Networks
+
+
+            Mustafa C. Ozturk
+            can@cnel.uﬂ.edu
+            Dongming Xu
+            dmxu@cnel.uﬂ.edu
+            JoseC.Pr´     ´ıncipe
+            principe@cnel.uﬂ.edu
+            Computational NeuroEngineering Laboratory, Department of Electrical and
+            Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
+
+
+            The design of echo state network (ESN) parameters relies on the selec-
+            tion of the maximum eigenvalue of the linearized system around zero
+            (spectral radius). However, this procedure does not quantify in a sys-
+            tematic manner the performance of the ESN in terms of approximation
+            error. This article presents a functional space approximation framework
+            to better understand the operation of ESNs and proposes an information-
+            theoretic metric, the average entropy of echo states, to assess the richness
+            of the ESN dynamics. Furthermore, it provides an interpretation of the
+            ESN dynamics rooted in system theory as families of coupled linearized
+            systems whose poles move according to the input signal dynamics. With
+            this interpretation, a design methodology for functional approximation
+            is put forward where ESNs are designed with uniform pole distributions
+            covering the frequency spectrum to abide by the richness metric, irre-
+            spective of the spectral radius. A single bias parameter at the ESN input,
+            adapted with the modeling error, conﬁgures the ESN spectral radius to
+            the input-output joint space. Function approximation examples compare
+            the proposed design methodology versus the conventional design.
+
+
+            1 Introduction
+
+            Dynamic computational models require the ability to store and access the
+            time history of their inputs and outputs. The most common dynamic neural
+            architecture is the time-delay neural network (TDNN) that couples delay
+            lines with a nonlinear static architecture where all the parameters (weights)
+            are adapted with the backpropagation algorithm. The conventional delay
+            line utilizes ideal delay operators, but delay lines with local ﬁrst-order re-
+            cursive ﬁlters have been proposed by Werbos (1992) and extensively stud-
+            ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
+            1993). Chains of ﬁrst-order integrators are interesting because they effec-
+            tively decrease the number of delays necessary to create time embeddings
+
+
+            Neural Computation19, 111–138(2007)   C 2006 Massachusetts Institute of Technology           112 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           (Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
+           ent type of embedding that is largely unexplored. RNNs are perhaps the
+           most biologically plausible of the artiﬁcial neural network (ANN) models
+           (Anderson, Silverstein, Ritz, & Jones, 1977; Hopﬁeld, 1984; Elman, 1990),
+           but are not well understood theoretically (Siegelmann & Sontag, 1991;
+           Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
+           RNNs is the difﬁculty to adapt the system weights. Various algorithms,
+           such as backpropagation through time (Werbos, 1990) and real-time recur-
+           rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
+           however, these algorithms suffer from computational complexity, resulting
+           in slow training, complex performance surfaces, the possibility of instabil-
+           ity, and the decay of gradients through the topology and time (Haykin,
+           1998). The problem of decaying gradients has been addressed with spe-
+           cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
+           native second-order training methods based on extended Kalman ﬁltering
+           (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
+           Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
+           et al., 1998) provide more reliable performance and have enabled practical
+           applications in identiﬁcation and control of dynamical systems (Kechri-
+           otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
+           Kambhampati, & Warwick, 1995).
+             Recently,twonewrecurrentnetworktopologieshavebeenproposed:the
+           echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
+           the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
+           2002). ESNs possess a highly interconnected and recurrent topology of
+           nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
+           and contain information about the history of input and output patterns.
+           The outputs of these internal PEs (echo states) are fed to a memoryless but
+           adaptive readout network (generally linear) that produces the network out-
+           put. The interesting property of ESN is that only the memoryless readout is
+           trained, whereas the recurrent topology has ﬁxed connection weights. This
+           reduces the complexity of RNN training to simple linear regression while
+           preserving a recurrent topology, but obviously places important constraints
+           in the overall architecture that have not yet been fully studied. Similar ideas
+           have been explored independently by Maass and formalized in the LSM
+           architecture. LSMs, although formulated quite generally, are mostly im-
+           plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
+           whereas ESNs are dynamical ANN models. Both attempt to model biolog-
+           ical information processing using similar principles. We focus on the ESN
+           formulation in this letter.
+             The echo state condition is deﬁned in terms of the spectral radius (the
+           largest among the absolute values of the eigenvalues of a matrix, denoted
+           by·) of the reservoir’s weight matrix (W<1). This condition states
+           that the dynamics of the ESN is uniquely controlled by the input, and the
+           effect of the initial states vanishes. The current design of ESN parameters           Analysis and Design of Echo State Networks 113
+
+
+           relies on the selection of spectral radius. However, there are many possible
+           weight matrices with the same spectral radius, and unfortunately they do
+           not all perform at the same level of mean square error (MSE) for functional
+           approximation. A similar problem exists in the design of the LSM. LSMs
+           have been shown to possess universal approximation given the separation
+           property (SP) for the liquid (reservoir in ESNs) and the approximation
+           property (AP) for the readout (Maass et al., 2002). SP is quantiﬁed by a
+           kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
+           (2005) that is based on the rank of a matrix formed by the system states
+           corresponding to different input signals. The kernel quality is a measure
+           for the complexity and diversity of nonlinear operations carried out by the
+           liquid on its input stream in order to boost the classiﬁcation power of a
+           subsequent linear decision hyperplane (Maass et al., 2005). A variation of
+           SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
+           been argued that complex calculations can be best carried out by networks
+           on the boundary between ordered and chaotic dynamics.
+             Inthisletter,weareinterestedinstudyingtheESNforfunctionalapprox-
+           imation (ﬁlters that map input functionsu(·) of time on output functionsy(·)
+           of time). We see two major shortcomings with the current ESN approach
+           that uses echo state condition as a design principle. First, the impact of ﬁxed
+           reservoir parameters for function approximation means that the informa-
+           tion about the desired response is conveyed only to the output projection.
+           This is not optimal, and strategies to select different reservoirs for different
+           applications have not been devised. Second, imposing a constraint only on
+           the spectral radius is a weak condition to properly set the parameters of
+           the reservoir, as experiments show (different randomizations with the same
+           spectral radius perform differently for the same problem; see Figure 2).
+             This letter aims to address these two problems by proposing a frame-
+           work, a metric, and a design principle for ESNs. The framework is a signal
+           processing interpretation of basis and projections in functional spaces to
+           describe and understand the ESN architecture. According to this interpre-
+           tation, the ESN states implement a set of basis functionals (representation
+           space) constructed dynamically by the input, while the readout simply
+           projects the desired response onto this representation space. The metric
+           to describe the richness of the ESN dynamics is an information-theoretic
+           quantity, the average state entropy (ASE). Entropy measures the amount of
+           information contained in a given random variable (Shannon, 1948). Here,
+           the random variable is the instantaneous echo state from which the en-
+           tropy for the overall state (vector) is estimated. The probability density
+           function (pdf) in a differential geometric framework should be thought of
+           as a volume form; that is, in our case, the pdf of the state vector describes
+           the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
+           established information as a coordinate free metric in the state manifold.
+           Therefore, entropy becomes a global descriptor of information that quanti-
+           ﬁes the volume of the manifold deﬁned by the random variable. Due to the           114 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           time dependency of the states, the state entropy averaged over time (ASE)
+           is an appropriate estimate of the volume of the state manifold.
+             The design principle speciﬁes that one should consider independently
+           thecorrelationamongthebasisandthespectralradius.Intheabsenceofany
+           information about the desired response, the ESN states should be designed
+           with the highest ASE, independent of the spectral radius. We interpret the
+           ESN dynamics as a combination of time-varying linear systems obtained
+           from the linearization of the ESN nonlinear PE in a small, local neighbor-
+           hood of the current state. The design principle means that the poles of the
+           linearized ESN reservoir should have uniform pole distributions to gener-
+           ate echo states with the most diverse pole locations (which correspond to
+           the uniformity of time constants). Effectively, this will create the least cor-
+           related bases for a given spectral radius, which corresponds to the largest
+           volume spanned by the basis set. When the designer has no other informa-
+           tion about the desired response to set the basis, this principle distributes
+           the system’s degrees of freedom uniformly in space. It approximates for
+           ESNs the well-known property of orthogonal basis. The unresolved issue
+           that ASE does not quantify is how to set the spectral radius, which depends
+           again on the desired mapping. The concept of memory depth as explained
+           in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
+           issues associated with the spectral radius. The correlation time of the de-
+           siredresponse(asestimatedbytheﬁrstzerooftheautocorrelationfunction)
+           gives an indication of the type of spectral radius required (long correlation
+           time requires high spectral radius). Alternatively, a simple adaptive bias is
+           added at the ESN input to control the spectral radius integrating the infor-
+           mation from the input-output joint space in the ESN bases. For sigmoidal
+           PEs, the bias adjusts the operating points of the reservoir PEs, which has
+           the net effect of adjusting the volume of the state manifold as required to
+           approximate the desired response with a small error. This letter shows that
+           ESNs designed with this strategy obtain systematically better results in a
+           set of experiments when compared with the conventional ESN design.
+
+
+           2 Analysis of Echo State Networks
+
+              2.1 Echo States as Bases and Projections.Let us consider the ar-
+           chitecture and recursive update equation of a typical ESN more closely.
+           Consider the recurrent discrete-time neural network given in Figure 1
+           withMinput units,Ninternal PEs, andLoutput units. The value of
+           the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal
+           units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)=
+           [y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M
+           weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal
+           PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij           Analysis and Design of Echo State Networks 115
+
+
+            Input Layer      Dynamical Reservoir Read-out
+
+                     Win  WW                 out
+
+
+
+
+
+
+
+                                       x(n) u(n) 
+
+                                .                     +
+                                .                               y(n) 
+
+
+
+
+
+
+
+                                                             Wback
+
+
+           Figure 1: An echo state network (ESN). ESN is composed of two parts: a ﬁxed-
+           weight (W<1) recurrent network and a linear readout. The recurrent net-
+           work is a reservoir of highly interconnected dynamical components, states of
+           which are called echo states. The memoryless linear readout is trained to pro-
+           duce the output.
+
+
+           output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
+           activation of the internal PEs (echo state) is updated according to
+
+               x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)),             (2.1)
+
+           wheref=(f1 ,f2 ,...,fN )aretheinternalPEs’activationfunctions.Here,all
+           f                              e−x
+            i ’s are hyperbolic tangent functions ( ex −  ). The output from the readout ex +e−x
+           network is computed according to
+
+               y(n+1)=fout (Wout x(n+1)),                           (2.2)
+
+           wherefout =(fout ,fout ,...,fout ) are the output unit’s nonlinear functions 1   2      L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity.
+             ESNs resemble the RNN architecture proposed in Puskorius and
+           Feldkamp (1996) and also used by Sanchez (2004) in brain-machine           116 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           interfaces. The critical difference is the dimensionality of the hidden re-
+           current PE layer and the adaptation of the recurrent weights. We submit
+           that the ideas of approximation theory in functional spaces (bases and pro-
+           jections), so useful in adaptive signal processing (Principe, 2001), should
+           be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued
+           function of a real-valued vector
+
+               u(t)=[u1 (t),u2 (t),...,uM (t)] T .
+
+           In functional approximation, the goal is to estimate the behavior ofh(u(t))
+           as a combination of simpler functionsϕi (t), called the basis functionals,
+           such that its approximant,hˆ(u(t)), is given by
+
+                       N
+               hˆ(u(t))=   ai ϕi (t).
+                       i=1
+
+           Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of
+           the central questions in practical functional approximation is how to choose
+           the set of bases to approximate a given desired signal. In signal processing,
+           thechoicenormallygoesforacompletesetoforthogonalbasis,independent
+           of the input. When the basis set is complete and can be made as large
+           as required, ﬁxed bases work wonders (e.g., Fourier decompositions). In
+           neural computing, the basic idea is to derive the set of bases from the
+           input signal through a multilayered architecture. For instance, consider a
+           single hidden layer TDNN withNPEs and a linear output. The hidden-
+           layer PE outputs can be considered a set of nonorthogonal basis functionals
+           dependent on the input,
+                                 
+                          
+               ϕi (u(t))=g  bij uj (t).
+                           j
+
+           bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi-
+           mation produced by the TDNN is then
+
+                       N
+               h ˆ(u(t))=   ai ϕi (u(t)),                                (2.3)
+                       i=1
+
+           whereai ’s are the weights of the output layer. Notice that thebij ’s adapt
+           the bases and theai ’s adapt the projection in the projection space. Here the
+           goal is to restrict the number of bases (number of hidden layer PEs) because
+           their number is coupled with the number of parameters to adapt, which
+           has an impact on generalization and training set size, for example. Usually,           Analysis and Design of Echo State Networks 117
+
+
+           since all of the parameters of the network are adapted, the best basis in the
+           joint (input and desired signals) space as well as the best projection can be
+           achieved and represents the optimal solution. The output of the TDNN is
+           a linear combination of its internal representations, but to achieve a basis
+           set (even if nonorthogonal), linear independence among theϕi (u(t))’s must
+           be enforced. Ito, Shah and Pon, and others have shown that this is indeed
+           the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
+           the scope of this article.
+             The ESN (and the RNN) architecture can also be studied in this frame-
+           work. The states of equation 2.1 correspond to the basis set, which are
+           recursively computed from the input, output, and previous states through
+           Win ,W,andWback . Notice, however, that none of these weight matrices is
+           adapted, that is, the functional bases in the ESN are uniquely deﬁned by the
+           input and the initial selection of weights. In a sense, ESNs are trading the
+           adaptive connections in the RNN hidden layer by a brute force approach
+           of creating ﬁxed diversiﬁed dynamics in the hidden layer.
+             For an ESN with a linear readout network, the output equation (y(n+
+           1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and
+           ai ’s are replaced by the echo states and the readout weights, respectively.
+           The readout weights are adapted in the training data, which means that the
+           ESN is able to ﬁnd the optimal projection in the projection space, just like
+           the RNN or the TDNN.
+             A similar perspective of basis and projections for information processing
+           in biological networks has been proposed by Pouget and Sejnowski (1997).
+           They explored the possibility that the response of neurons in parietal cortex
+           serves as basis functions for the transformations from the sensory input
+           to the motor responses. They proposed that “the role of spatial represen-
+           tations is to code the sensory inputs and posture signals in a format that
+           simpliﬁes subsequent computation, particularly in the generation of motor
+           commands”.
+             The central issue in ESN design is exactly the nonadaptive nature of
+           the basis set. Parameter sets in the reservoir that provide linearly inde-
+           pendent states and possess a given spectral radius may deﬁne drastically
+           different projection spaces because the correlation among the bases is not
+           constrained. A simple experiment was designed to demonstrate that the se-
+           lection of the ESN parameters by constraining the spectral radius is not the
+           most suitable for function approximation. Consider a 100-unit ESN where
+           the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
+           the ESN generate the seventh power of the input signal. Different realiza-
+           tions of a randomly connected 100-unit ESN were constructed where the
+           entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025,
+           and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
+           weights are set to+1or,−1 with equal probabilities, andWback is set to
+           zero. Input is applied for 300 time steps, and the echo states are calculated
+           using equation 2.1. The next step is to train the linear readout. One method           118 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+                              MSE for different realizations10 4
+
+
+
+
+
+
+
+
+              10 6
+
+
+
+
+
+
+
+
+              10 8
+
+
+
+
+              10 9
+                0        10        20        30        40        50
+                                Different realizations
+
+           Figure 2: Performances of ESNs for different realizations ofWwith the same
+           weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba-
+           bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
+           of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results
+           show that for each set of random weights that provide the same spectral ra-
+           dius, the correlation or degree of redundancy among the bases will change, and
+           different performances are encountered in practice.
+
+
+           to determine the optimal output weight matrix,Wout , in the mean square
+           error (MSE) sense (where MSE is deﬁned byO=1 (d−y)T (d−y)) is to use 2 the Wiener solution given by Haykin (2001):
+
+                                                            −1  1 
+            Wout =E[xx T ]−1 E[xd]∼  1
+                              =      x(n)x(n)T         x(n)d(n) .  (2.4) N               Nn               n
+
+           Here,E[.] denotes the expected value operator, andddenotes the desired
+           signal. Figure 2 depicts the MSE values for 50 different realizations of
+           the ESNs. As observed, even though each ESN has the same sparseness
+           and spectral radius, the MSE values obtained vary greatly among differ-
+           ent realizations. The minimum MSE value obtained among the 50 realiza-
+           tions is 5.9x10 −9 , whereas the maximum MSE is 8.9x10 −5 . This experiment           Analysis and Design of Echo State Networks 119
+
+
+           demonstrates that a design strategy that is based solely on the spectral
+           radius is not sufﬁcient to specify the system architecture for function ap-
+           proximation. This shows that for each set of random weights that provide
+           thesamespectralradius,thecorrelationordegreeofredundancyamongthe
+           bases will change, and different performances are encountered in practice.
+
+             2.2 ESN Dynamics as a Combination of Linear Systems.It is well
+           known that the dynamics of a nonlinear system can be approximated by
+           that of a linear system in a small neighborhood of an equilibrium point
+           (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis
+           with hyperbolic tangent nonlinearities and approximate the ESN dynam-
+           ics by the dynamics of the linearized system in the neighborhood of the
+           current system state. Hence, when the system operating point varies over
+           time, the linear system approximating the ESN dynamics changes. We are
+           particularly interested in the movement of the poles of the linearized ESN.
+           Consider the update equation for the ESN without output feedback given
+           by
+
+               x(n+1)=f(Win u(n+1)+Wx(n)).
+
+           Linearizing the system around the current statex(n), one obtains the
+           Jacobian matrix,J(n+1), deﬁned by
+                                                         f˙(net 1 (n))w   ˙11 f(net 1 (n))w12 ··· f˙(net 1 (n))w1N                                       f˙(net                                J(n+1)=    2 (n))w   ˙21 f(net 2 (n))w22 ··· f˙(net 2 (n))w2N                                          ··· ··· ··· ···    
+                      f˙(net N (n))w  ˙N1 f(net N (n))wN2 ···f˙(net N (n))wNN
+                                               f˙(net 1 (n)) 0   ···   0
+                                                  0    f ˙(net               =            2 (n))···   0                               ·W=F(n)·W.   (2.5)
+                       ··· ··· ··· ···  
+                         00···f˙ (net N (n))
+
+
+           Here,net i (n)istheith entry of the vector (Win u(n+1)+Wx(n)), andwij
+           denotes the (i,j)th entry ofW. The poles of the linearized system at time
+           n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the
+           amplitude of each PE changes, the local slope changes, and so the poles of
+
+
+
+             1 The transfer function of a linear systemx(n+1)=Ax(n)+Bu(n)is X(z) =(zI−U(z)
+           A)−1 B=Adjoint(zI−A) B. The poles of the transfer function can be obtained by solving det(zI−A)
+           det(zI−A)=0. The solution corresponds to the eigenvalues ofA.           120 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           the linearized system are time varying, although the parameters of ESN are
+           ﬁxed.
+             In order to visualize the movement of the poles, consider an ESN with
+           100 states. The entries of the internal weight matrix are chosen to be 0,
+           0.4 and−0.4 with probabilities 0.9, 0.05, and 0.05.Wis scaled such that a
+           spectral radius of 0.95 is obtained. Input weights are set to+1or−1 with
+           equal probabilities. A sinusoidal signal with a period of 100 is fed to the
+           system, and the echo states are computed according to equation 2.1. Then
+           the Jacobian matrix and the eigenvalues are calculated using equation 2.5.
+           Figure 3 shows the pole tracks of the linearized ESN for different input
+           values. A single ESN with ﬁxed parameters implements a combination of
+           many linear systems with varying pole locations, hence many different
+           time constants that modulate the richness of the reservoir of dynamics as a
+           function of input amplitude. Higher-amplitude portions of the signal tend
+           to saturate the nonlinear function and cause the poles to shrink toward
+           the origin of thez-plane (decreases the spectral radius), which results in a
+           system with a large stability margin. When the input is close to zero, the
+           poles of the linearized ESN are close to the maximal spectral radius chosen,
+           decreasing the stability margin. When compared to their linear counterpart,
+           an ESN with the same number of states results in a detailed coverage of
+           thez-plane dynamics, which illustrates the power of nonlinear systems.
+           Similar results can be obtained using signals of different shapes at the ESN
+           input.
+             A key corollary of the above analysis is that the spectral radius of an
+           ESN can be adjusted using a constant bias signal at the ESN input without
+           changing the recurrent connection matrix,W. The application of a nonzero
+           constant bias will move the operating point to regions of the sigmoid func-
+           tion closer to saturation and always decrease the spectral radius due to the
+           shape of the nonlinearity. 2 The relevance of bias in terms of overall system
+           performance has also been discussed in Jaeger (2002b) and Bertschinger
+           and Natschlager (2004), but here we approach it from a system theory per-¨
+           spective and explain its effect on reservoir dynamics.
+
+           3 Average State Entropy as a Measure of the Richness of ESN Reservoir
+
+           Previous research was aware of the inﬂuence of diversity of the recurrent
+           layer outputs on the overall performance of ESNs and LSMs. Several met-
+           rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al.,
+
+
+             2 AssumeWhas nondegenerate eigenvalues and corresponding linearly independent
+           eigenvectors. Then consider the eigendecomposition ofW,whereW=PDP −1 ,Pis the
+           eigenvectormatrixandDisthediagonalmatrixofeigenvalues(Dii )ofW.SinceF(n)andD
+           are diagonal,J(n+1)=F(n)W=F(n)(PDP −1 )=P(F(n)D)P−1 is the eigendecomposition
+           ofJ(n+1). Here, each entry ofF(n)D,f  (net(n))Dii , is an eigenvalue ofJ. Therefore,
+           |f  (net(n))Dii |≤|Dii |sincef  (net i )≤f  (0).           Analysis and Design of Echo State Networks 121
+
+
+            (A) 1                        (B) 1
+                     D0.8                          0.8
+               0.6   C                      0.6
+               0.4                          0.4
+
+
+
+                                         Imaginary
+              Amplitude 0.2                          0.2
+                0 B        E                 0
+              -0.2                          -0.2
+              -0.4                          -0.4
+              -0.6                          -0.6
+              -0.8                          -0.8
+               -1                           -1 0   20   40   60   80  100     -1    -0.5   Real 0    0.5    1 Time
+            (C) 1                        (D) 1
+               0.8                          0.8
+               0.6                          0.6
+               0.4                          0.4
+
+
+
+              Imaginary 0.2
+
+
+                                         Imaginary 0.2
+                0                           0
+              -0.2                          -0.2
+              -0.4                          -0.4
+              -0.6                          -0.6
+              -0.8                          -0.8
+               -1                           -1-1    -0.5   Real 0    0.5    1     -1    -0.5   Real 0    0.5    1
+
+            (E) 1                        (F) 1
+               0.8                          0.8
+               0.6                          0.6
+               0.4                          0.4
+
+
+
+              Imaginary 0.2
+
+
+                                         Imaginary 0.2
+                0                           0
+              -0.2                          -0.2
+              -0.4                          -0.4
+              -0.6                          -0.6
+              -0.8                          -0.8
+               -1                           -1-1    -0.5   Real 0    0.5    1     -1    -0.5   Real 0    0.5    1
+
+           Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input
+           goes through a cycle. An ESN with ﬁxed parameters implements a combination
+           of linear systems with varying pole locations. (A) One cycle of sinusoidal signal
+           with a period of 100. (B–E) The positions of poles of the linearized systems
+           when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative
+           pole locations show the movement of the poles as the input changes. Due to
+           the varying pole locations, different time constants modulate the richness of
+           the reservoir of dynamics as a function of input amplitude. Higher-amplitude
+           signals tend to saturate the nonlinear function and cause the poles to shrink
+           toward the origin of thez-plane (decreases the spectral radius), which results in
+           a system with a large stability margin. When the input is close to zero, the poles
+           ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing
+           the stability margin. An ESN with more states results in a detailed coverage of
+           thez-plane dynamics, which illustrates the power of nonlinear systems, when
+           compared to their linear counterpart.           122 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           2005). Here, our approach of bases and projections leads to a new metric.
+           We propose the instantaneous state entropy to quantify the distribution of
+           instantaneous amplitudes across the ESN states. Entropy of the instanta-
+           neous ESN states is appropriate to quantify performance in function ap-
+           proximation because the ESN output is a mere weighted combination of
+           the instantaneous value of the ESN states. If the echo state’s instantaneous
+           amplitudes are concentrated on only a few values across the ESN state dy-
+           namic range, the ability to approximate an arbitrary desired response by
+           weighting the states is limited (and wasteful due to redundancy between
+           the different states), and performance will suffer. On the other hand, if the
+           ESN states provide a diversity of instantaneous amplitudes, it is much eas-
+           ier to achieve the desired mapping. Hence, the instantaneous entropy of the
+           states appears as a good measure to quantify the richness of dynamics with
+           instantaneous mappers. Due to the time structure of signals, the average
+           state entropy (ASE), deﬁned as the state entropy averaged over time, will be
+           the parameter used to quantify the diversity in the dynamical reservoir of
+           the ESN. Moreover, entropy has been proposed as an appropriate measure
+           of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE
+           measures the volume of the echo state manifold spanned by trajectories.
+             Renyi’squadraticentropyisemployedherebecauseitisaglobalmeasure
+           of information. In addition, an efﬁcient nonparametric estimator of Renyi’s
+           entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe,
+           Xu, & Fisher, 2000). Renyi’s entropy with parameterγfor a random variable
+           Xwith a pdffX (x) is given by Renyi (1970):
+
+
+                        1Hγ (X)=     logE[fγ−1 (X)].1−γ      X
+
+
+           Renyi’s quadratic entropy is obtained forγ=2 (forγ→1, Shannon’s en-
+           tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un-
+           known pdf to be estimated, Parzen windowing approximates the underly-
+           ing pdf by
+
+
+                      1N
+                fX (x)=     KN    σ (x−xi ),
+                        i=1
+
+           whereKσ is the kernel function with the kernel sizeσ. Then the Renyi’s
+           quadratic entropy can be estimated by (Principe et al., 2000)
+
+                                            
+                                   
+               H2 (X)=−log1 
+                                      KN2        σ (xj −xi ) .               (3.1)
+                                j   i           Analysis and Design of Echo State Networks 123
+
+
+             The instantaneous state entropy is estimated using equation 3.1 where
+           thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T
+           of an ESN withNinternal PEs. Results will be shown with a gaussian kernel
+           with kernel size chosen to be 0.3 of the standard deviation of the entries
+           of the state vector. We will show that ASE is a more sensitive parameter to
+           quantify the approximation properties of ESNs by experimentally demon-
+           strating that ESNs with different spectral radius and even with the same
+           spectral radius display different ASEs.
+             Let us consider the same 100-unit ESN that we used in the previous
+           section built with three different spectral radii 0.2, 0.5, 0.8 with an input
+           signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks.
+           The instantaneous state entropy is also calculated at each time step using
+           equation 3.1 and plotted in Figure 4B. First, note that the instantaneous
+           state entropy changes over time with the distribution of the echo states as
+           we would expect, since state entropy is dependent on the input signal that
+           also changes in this case. Second, as the spectral radius increases in the
+           simulation, the diversity in the echo states increases. For the spectral radius
+           of 0.2, echo state’s instantaneous amplitudes are concentrated on only a
+           few values, which is wasteful due to redundancy between different states.
+           In practice, to quantify the overall representation ability over time, we will
+           use ASE, which takes values−0.735,−0.007, and 0.335 for the spectral
+           radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral
+           radius, several ASEs are possible. Figure 4C shows ASEs from 50 different
+           realizations of ESNs with the same spectral radius of 0.5, which means that
+           ASE is a ﬁner descriptor of the dynamics of the reservoir. Although we
+           have presented an experiment with sinusoidal signal, similar results are
+           obtained for other inputs as long as the input dynamic range is properly
+           selected.
+             Maximizing ASE means that the diversity of the states over time is the
+           largest and should provide a basis set that is as uncorrelated as possible.
+           This condition is unfortunately not a guarantee that the ESN so designed
+           will perform the best, because the basis set in ESNs is created independent
+           of the desired response and the application may require a small spectral
+           radius. However, we maintain that when the desired response is not ac-
+           cessible for the design of the ESN bases or when the same reservoir is
+           to be used for a number of problems, the default strategy should be to
+           maximize the ASE of the state vector. The following section addresses
+           the design of ESNs with high ASE values and a simple mechanism to
+           adjust the reservoir dynamics without changing the recurrent connection
+           weights.
+
+           4 Designing Echo State Networks
+
+             4.1 Design of the Echo State Recurrent Connections.According to the
+           interpretation of ESNs as coupled linear systems, the design of the internal           124 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           connection matrix,W, will be based on the distribution of the poles of the
+           linearized system around zero state. Our proposal is to design the ESN
+           such that the linearized system has uniform pole distribution inside the
+           unit circle of thez-plane. With this design scenario, the system dynamics
+           will include uniform coverage of time constants arising from the uniform
+           distribution of the poles, which also decorrelates as much as possible the
+           basis functionals. This principle was chosen by analogy to the identiﬁcation
+           oflinearsystemsusingKautzﬁlters(Kautz,1954),whichshowsthatthebest
+           approximation of a given transfer function by a linear system with ﬁnite
+           order is achieved when poles are placed in the neighborhood of the spectral
+           resonances. When no information is available about the desired response,
+           we should uniformly spread the poles to anticipate good approximation to
+           arbitrary mappings.
+             We again use a maximum entropy principle to distribute the poles inside
+           the unit circle uniformly. The constraints of a circle as boundary conditions
+           for discrete linear systems and complex conjugate locations are easy to
+           include for the pole distribution (Thogula, 2003). The poles are ﬁrst initial-
+           ized at random locations; the quadratic Renyi’s entropy is calculated by
+           equation 3.1, and poles are moved such that the entropy of the new dis-
+           tribution is increased over iterations (Erdogmus & Principe, 2002). This
+           method is efﬁcient to ﬁnd uniform coverage of the unit circle with an arbi-
+           trary number of poles. The system with the uniform pole locations can be
+           interpreted using linear system theory. The poles that are close to the unit
+           circle correspond to many sharp bandpass ﬁlters specializing in different
+           frequency regions, whereas the inner poles realize ﬁlters of larger frequency
+           support. Moreover, different orientations (angles) of the poles create ﬁlters
+           of different center frequencies.
+             Now the problem is to construct an internal weight matrix from the pole
+           locations (eigenvalues ofW). In principle, we would like to create a sparse
+
+
+
+
+           Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs
+           ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8,
+           from top to bottom, respectively. The diversity of echo states increases when the
+           spectral radius increases. Within the dynamic range of the echo states, systems
+           with smaller spectral radius can generate only uneven representations, while
+           forW=0.8, outputs of echo states almost uniformly distribute within their
+           dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1.
+           Information contained in the echo states is changing over time according to the
+           input amplitude. Therefore, the richness of representation is controlled by the
+           input amplitude. Moreover, the value of ASE increases with spectral radius.
+           (C) ASEs from 50 different realizations of ESNs with the same spectral radius
+           of 0.5. The plot shows that ASE is a ﬁner descriptor of the dynamics of the
+           reservoir than the spectral radius.           Analysis and Design of Echo State Networks 125
+
+
+                                    (A) Echo States1
+                          0
+                          - 10 20 40 60 801001201401601802001
+                          0
+                          - 10 20 40 60 801001201401601802001
+                          0
+                          - 10 20 40 60 80100120140160180200Time
+                                   (B) State Entropy1.5             Spectral Radius = 0.2
+                           1             Spectral Radius = 0.5 Spectral Radius = 0.8
+                          0.5
+                           0
+                          - 0.5
+                           - 1
+                          - 1.5
+                           - 2
+                          - 2.50     50    100    150    200Time
+                            (C) Different ASEs for the same spectral radius0.3
+
+                         0.25
+
+                          0.2
+
+                        ASE0.15
+
+                          0.1
+
+                         0.050    10   20   30   40   50
+                                       Trials           126 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           matrix, so we started with the sparsest matrix (with an inverse), which is
+           the direct canonical structure given by (Kailath, 1980)
+
+                                      −a1 −a2 ···−aN−1 −aN
+                    10···  00                    W= 01···  00                     .                       (4.1)
+                   ··· ··· ··· ··· ···
+                      00···  10
+
+           The characteristic polynomial ofWis
+
+               l(s)=det(sI−W)=sN +a N−11 s   +a2 sN−2 +aN
+                  =(s−p1 )(s−p2 )···(s−pN ),                        (4.2)
+
+           wherepi ’s are the eigenvalues andai ’s are the coefﬁcients of the character-
+           istic polynomial ofW. Here, we know the pole locations of the linear system
+           obtained from the linearization of the ESN, so using equation 4.2, we can
+           obtain the characteristic polynomial and constructWmatrix in the canon-
+           ical form using equation 4.1. We will call the ESN constructed based on
+           the uniform pole principle ASE-ESN. All other possible solutions with the
+           same eigenvalues can be obtained byQ−1 WQ,whereQis any nonsingular
+           matrix.
+             To corroborate our hypothesis, we would like to show that the linearized
+           ESN designed with the recurrent weight matrix having the eigenvalues
+           uniformly distributed inside the unit circle creates higher ASE values for a
+           given spectral radius compared to other ESNs with random internal con-
+           nection weight matrices. We will consider an ESN with 30 states and use our
+           procedure to create theWmatrix for ASE-ESN for different spectral radii
+           between [0.1, 0.95]. Similarly, we constructed ESNs with sparse randomW
+           matrices with different sparseness constraints. This corresponds to a weight
+           distribution having the values 0,cand−cwith probabilitiesp1 ,(1−p1 )/2,
+           and (1−p1 )/2, wherep1 deﬁnes the sparseness ofWandcis a constant
+           that takes a speciﬁc value depending on the spectral radius. We also created
+           Wmatrices with values uniformly distributed between−1 and 1 (U-ESN)
+           and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then,
+           for differentWin matrices, we run the ASE-ESNs with the sinusoidal input
+           given in section 3 and calculate ASE. Figure 5 compares the ASE values
+           averaged over 1000 realizations. As observed from the ﬁgure, the ASE-ESN
+           with uniform pole distribution generates higher ASE on average for all
+           spectral radii compared to ESNs with sparse and uniform random connec-
+           tions. This approach is indeed conceptually similar to Jeffreys’ maximum
+           entropy prior (Jeffreys, 1946): it will provide a consistently good response
+           for the largest class of problems. Concentrating the poles of the linearized           Analysis and Design of Echo State Networks 127
+
+
+                      1
+                            ASEESN
+                     0.8     UESN
+                            sparseness=0.2
+                     0.6     sparseness=0.1
+                            sparseness=0.07
+                     0.4
+
+                    ASE 0.2
+
+                      0
+
+                    - 0.2
+
+                    - 0.40      0.2     0.4     0.6     0.8      1
+                                    Spectral radius
+
+           Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith
+           uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN
+           with uniformly distributed weights between−1 and 1. Randomly generated
+           weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the
+           networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole
+           distribution generates a higher ASE on average for all spectral radii compared
+           to ESNs with random connections.
+
+
+           system in certain regions of the space provides good performance only if
+           the desired response has energy in this part of the space, as is well known
+           from the theory of Kautz ﬁlters (Kautz, 1954).
+
+             4.2 Design of the Adaptive Bias.In conventional ESNs, only the out-
+           put weights are trained, optimizing the projections of the desired response
+           onto the basis functions (echo states). Since the dynamical reservoir is ﬁxed,
+           the basis functions are only input dependent. However, since function ap-
+           proximation is a problem in the joint space of the input and desired signals,
+           a penalty in performance will be incurred. From the linearization analysis
+           that shows the crucial importance of the operating point of the PE non-
+           linearity in deﬁning the echo state dynamics, we propose to use a single
+           external adaptive bias to adjust the effective spectral radius of an ESN. No-
+           tice that according to linearization analysis, bias can reduce only spectral
+           radius. The information for adaptation of bias is the MSE in training, which
+           modulates the spectral radius of the system with the information derived
+           from the approximation error. With this simple mechanism, some informa-
+           tionfromtheinput-outputjointspaceisincorporatedinthedeﬁnitionofthe
+           projection space of the ESN. The beauty of this method is that the spectral           128 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           radius can be adjusted by a single parameter that is external to the system
+           without changing reservoir weights.
+             The training of bias can be easily accomplished. Indeed, since the pa-
+           rameter space is only one-dimensional, a simple line search method can be
+           efﬁciently employed to optimize the bias. Among different line search al-
+           gorithms, we will use a search that uses Fibonacci numbers in the selection
+           of points to be evaluated (Wilde, 1964). The Fibonacci search method min-
+           imizes the maximum number of evaluations needed to reduce the interval
+           of uncertainty to within the prescribed length. In our problem, a bias value
+           is picked according to Fibonacci search. For each value of bias, training
+           data are applied to the ESN, and the echo states are calculated. Then the
+           corresponding optimal output weights and the objective function (MSE)
+           are evaluated to pick the next bias value.
+             Alternatively, gradient-based methods can be utilized to optimize the
+           bias, due to simplicity and low computational cost. System update equation
+           with an external bias signal,b,isgivenby
+
+               x(n+1)=f(Win u(n+1)+Win b+Wx(n)).
+
+           The update equation forbis given by
+
+                ∂O(n+1)           ∂x(n+1)=−e·Wout ×                               (4.3)∂b                 ∂b                    ∂x(n)=−e·Wout × f˙(net n+1 )· W×     +Win  .    (4.4)∂b
+
+             Here,Ois the MSE deﬁned previously. This algorithm may suffer from
+           similar problems observed in gradient-based methods in recurrent net-
+           works training. However, we observed that the performance surface is
+           rather simple. Moreover, since the search parameter is one-dimensional,
+           the gradient vector can assume only one of the two directions. Hence, im-
+           precision in the gradient estimation should affect the speed of convergence
+           but normally not change the correct gradient direction.
+
+           5 Experiments
+
+           This section presents a variety of experiments in order to test the validity
+           of the ESN design scheme proposed in the previous section.
+
+             5.1 Short-TermMemoryCapacity.Thisexperimentcomparestheshort-
+           term memory (STM) capacity of ESNs with the same spectral radius using
+           the framework presented in Jaeger (2002a). Consider an ESN with a sin-
+           gle input signal,u(n), optimally trained with the desired signalu(n−k),
+           for a given delayk. Denoting the optimal output signalyk (n), thek-delay           Analysis and Design of Echo State Networks 129
+
+
+           STM capacity of a network,MC k , is deﬁned as a squared correlation coef-
+           ﬁcient betweenu(n−k)andyk (n) (Jaeger, 2002a). The STM capacity,MC,
+           of the network is deﬁned as  ∞ MC k=1   k . STM capacity measures how accu-
+           rately the delayed versions of the input signal are recovered with optimally
+           trained output units. Jaeger (2002a) has shown that the memory capacity
+           for recalling an independent and identically distributed (i.i.d.) input by an
+           Nunit RNN with linear output units is bounded byN.
+             We use ESNs with 20 PEs and a single input unit. ESNs are driven
+           by an i.i.d. random input signal,u(n), that is uniformly distributed over
+           [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions
+           of the input,u(n−1),...,u(n−40). We used four different ESNs: R-ESN,
+           U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN
+           used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47,
+           −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a
+           sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof
+           U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spec-
+           tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed
+           with uniform poles. BASE-ESN has the same recurrent weight matrix as
+           ASE-ESN and an adaptive bias at its input. In each ESN, the input weights
+           are set to 0.1 or−0.1 with equal probability, and direct connections from the
+           input to the output are allowed, whereasWback is set to0(Jaeger, 2002a).
+           The echo states are calculated using equation 2.1 for 200 samples of the
+           input signal, and the ﬁrst 100 samples corresponding to initial transient
+           are eliminated. Then the output weight matrix is calculated using equation
+           2.4. For the BASE-ESN, the bias is trained for each task. All networks are
+           run with a test input signal, and the corresponding output andMC k are
+           calculated. Figure 6 shows thek-delay STM capacity (averaged over 100
+           trials) of each ESN for delays 1,...,40 for the test signal. The STM capac-
+           ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70,
+           and 16.90, respectively. First, ESNs with uniform pole distribution (ASE-
+           ESN and BASE-ESN) haveMCs that are much longer than the randomly
+           generated ESN given in Jaeger (2002a) in spite of all having the same spec-
+           tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical
+           maximumvalueofN=20.AcloserlookattheﬁgureshowsthatR-ESNper-
+           forms slightly better than ASE-ESN for delays less than 9. In fact, for small
+           k, large ASE degrades the performance because the tasks do not need long
+           memory depth. However, the drawback of high ASE for smallkis recov-
+           ered in BASE-ESN, which reduces the ASE to the appropriate level required
+           for the task. Overall, the addition of the bias to the ASE-ESN increases the
+           STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly
+           better STM compared to R-ESN with only three different weight values,
+           although it has more distinct weight values compared to R-ESN. It is also
+           signiﬁcant to note that theMCwill be very poor for an ESN with smaller
+           spectral radius even with an adaptive bias, since the problem requires large
+           ASE and bias can only reduce ASE. This experiment demonstrates the           130 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+                      1                           RESN
+                                                   UESN
+                                                   ASEESN0.8                           BASEESN
+
+
+
+
+
+
+                    Memory Capacity 0.6
+
+
+                     0.4
+
+
+                     0.2
+
+
+                      0
+                       0       10       20       30       40
+                                       Delay
+
+           Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed
+           using the test signal. The results are averaged over 100 different realizations of
+           each ESN type with the speciﬁcations given in the text for differentWandWin
+           matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are
+           13.09, 13.55, 16.70, and 16.90, respectively.
+
+
+           suitability of maximizing ASE in tasks that require a substantial memory
+           length.
+
+             5.2 Binary Parity Check.The effect of the adaptive bias was marginal
+           in the previous experiment since the nature of the problem required large
+           ASE values. However, there are tasks in which the optimal solutions re-
+           quire smaller ASE values and smaller spectral radius. Those are the tasks
+           where the adaptive bias becomes a crucial design parameter in our design
+           methodology.
+             Consider an ESN with 100 internal units and a single input unit. ESN is
+           drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal
+           is to train an ESN to generate them-bit parity corresponding to lastmbits
+           received, wheremis 3,...,8. Similar to the previous experiments, we used
+           the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly
+           connected ESN where the entries ofWmatrix are set to 0, 0.06,−0.06
+           with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse
+           connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN
+           are designed with a spectral radius of 0.9. The input weights are set to 1 or -1
+           with equal probability, and direct connections from the input to the output
+           are allowed whereasWback is set to 0. The echo states are calculated using
+           equation 2.1 for 1000 samples of the input signal, and the ﬁrst 100 samples
+           correspondingtotheinitialtransientareeliminated.Thentheoutputweight           Analysis and Design of Echo State Networks 131
+
+
+                    350
+
+                    300
+
+                    250
+
+
+
+
+
+
+                   Wrong Decisions 200
+
+                    150
+
+                    100
+                                                    ASEESN50                             RESN
+                                                    BASEESN0
+                          3     4     5     6     7     8
+                                         m
+
+           Figure 7: The number of wrong decisions made by each ESN form=3,...,8
+           in the binary parity check problem. The results are averaged over 100 differ-
+           ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin
+           matrices with the speciﬁcations given in the text. The total numbers of wrong
+           decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and
+           699.
+
+
+
+           matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias
+           is trained for each task. The binary decision is made by a threshold detector
+           that compares the output of the ESN to 0.5. Figure 7 shows the number of
+           wrong decisions (averaged over 100 different realizations) made by each
+           ESN form=3,...,8.
+             The total numbers of wrong decisions form=3,...,8 of R-ESN, ASE-
+           ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs
+           poorly since the nature of the problem requires a short time constant for
+           fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the
+           R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions.
+           BASE-ESN performs a lot better than ASE-ESN and slightly better than
+           the R-ESN since the adaptive bias reduces the spectral radius effectively.
+           Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN,
+           since the task requires access to longer input history, which compromises
+           the need for fast response. Indeed, the bias in the BASE-ESN takes effect
+           when there are errors (m>4) and when the task beneﬁts from smaller
+           spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and
+           2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide
+           range of bias values that result in similar MSE values (between 0 and 3). In           132 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           summary, this experiment clearly demonstrates the power of the bias signal
+           to conﬁgure the ESN reservoir according to the mapping task.
+
+             5.3 System Identiﬁcation.This section presents a function approxima-
+           tion task where the aim is to identify a nonlinear dynamical system. The
+           unknown system is deﬁned by the difference equation
+
+               y(n+1)=0.3y(n)+0.6y(n−1)+f(u(n)),
+
+           where
+
+                f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu).
+
+           The input to the system is chosen to be sin(2πn/25).
+             We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with
+           30 internal units and a single input unit. TheWmatrix of each ESN is scaled
+           suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN
+           where the entries ofWmatrix are set to 0, 0.35,−0.35 with probabilities 0.8,
+           0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or−1 with
+           equal probability, and direct connections from the input to the output are
+           allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated
+           using equation 2.4. The MSE values (averaged over 100 realizations) for R-
+           ESN and ASE-ESN are 1.23x10 −5 and 1.83x10 −6 , respectively. The addition
+           of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10 −6
+           to 3.27x10 −9 .
+
+           6 Discussion
+
+           The great appeal of echo state networks (ESNs) and liquid state machine
+           (LSM) is their ability to construct arbitrary mappings of signals with rich
+           and time-varying temporal structures without requiring adaptation of the
+           free parameters of the recurrent layer. The echo state condition allows the
+           recurrent connections to be ﬁxed with training limited to the linear output
+           layer. However, the literature did not elucidate on how to properly choose
+           the recurrent parameters for system identiﬁcation applications. Here, we
+           provide an alternate framework that interprets the echo states as a set
+           of functional bases formed by ﬁxed nonlinear combinations of the input.
+           The linear readout at the output stage simply computes the projection of
+           the desired output space onto this representation space. We further in-
+           troduce an information-theoretic criterion, ASE, to better understand and
+           evaluate the capability of a given ESN to construct such a representation
+           layer. The average entropy of the distribution of the echo states quantiﬁes
+           thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest
+           to achieve the smallest correlation among the bases and be able to cope with           Analysis and Design of Echo State Networks 133
+
+
+           arbitrary mappings. However, not all function approximation problems re-
+           quire the same memory depth, which is coupled to the spectral radius. The
+           effective spectral radius of an ESN can be optimized for the given problem
+           with the help of an external bias signal that is adapted using the joint input-
+           output space information. The interesting property of this method when
+           applied to ESN built from sigmoidal nonlinearities is that it allows the ﬁne
+           tuning of the system dynamics for a given problem with a single external
+           adaptive bias input and without changing internal system parameters. In
+           our opinion, the combination of the largest possible ASE and the adapta-
+           tion of the spectral radius by the bias produces the most parsimonious pole
+           location of the linearized ESN when no knowledge about the mapping is
+           available to optimally locate the bass functionals. Moreover, the bias can be
+           easily trained with either a line search method or a gradient-based method
+           since it is one-dimensional. We have illustrated experimentally that the de-
+           sign of the ESN using the maximization of ASE with the adaptation of the
+           spectral radius by the bias has provided consistently better performance
+           across tasks that require different memory depths. This means that these
+           two parameters’ design methodology is preferred to the spectral radius
+           criterion proposed by Jaeger, and it is still easily incorporated in the ESN
+           design.
+             Experiments demonstrate that the ASE for ESN with uniform linearized
+           poles is maximized when the spectral radius of the recurrent weight matrix
+           approaches one (instability). It is interesting to relate this observation with
+           the computational properties found in dynamical systems “at the edge of
+           chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchﬁeld, 1993;
+           Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨
+           tomata rules are evolved to perform a complex computation, evolution will
+           tend to select rules with “critical” parameter values, which correlate with
+           a phase transition between ordered and chaotic regimes. Recently, similar
+           conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨
+           Langton’s interpretation of edge of chaos was questioned by Mitchell et al.
+           (1993). Here, we provide a system-theoretic view and explain the computa-
+           tional behavior with the diversity of dynamics achieved with linearizations
+           that have poles close to the unit circle. According to our results, the spectral
+           radiusoftheoptimalESNinfunctionapproximationisproblemdependent,
+           and in general it is impossible to forecast the computational performance
+           as the system approaches instability (the spectral radius of the recurrent
+           weight matrix approaches one). However, allowing the system to modu-
+           late the spectral radius by either the output or internal biasing may allow
+           a system close to instability to solve various problems requiring different
+           spectral radii.
+             Our emphasis here is mostly on ESNs without output feedback connec-
+           tions. However, the proposed design methodology can also be applied to
+           ESNs with output feedback. Both feedforward and feedback connections
+           contribute to specify the bases to create the projection space. At the same           134 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           time, there are applications where the output feedback contributes to the
+           system dynamics in a different fashion. For example, it has been shown that
+           a ﬁxed weight (fully trained) RNN with output feedback can implement a
+           family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992).
+           In meta-learning, the role of output feedback in the network is to bias the
+           system to different regions of dynamics, providing multiple input-output
+           mappings required (Santiago & Lendaris, 2004). However, results could not
+           be replicated with ESNs (Prokhorov, 2005). We believe that more work has
+           to be done on output feedback in the context of ESNs but also suspect that
+           the echo state condition may be a restriction on the system dynamics for
+           this type of problem.
+             There are many interesting issues to be researched in this exciting new
+           area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s
+           representation layer in an unsupervised fashion. In fact, we can easily adapt
+           withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild,
+           and Principe (2003): extra weights linking the outputs of recurrent states to
+           maximize output entropy. Output entropy maximization is a well-known
+           metric to create independent components (Bell & Sejnowski, 1995), and
+           here it means that the echo states will become as independent as possible.
+           This would circumvent the linearization of the dynamical system to set the
+           recurrent weights and would ﬁne-tune continuously in an unsupervised
+           manner the parameters of the ESN among different inputs. However, it
+           goes against the idea of a ﬁxed ESN reservoir.
+             The reservoir of recurrent PEs can be thought of as a new form of a time-
+           to-space mapping. Unlike the delay line that forms an embedding (Takens,
+           1981), this mapping may have the advantage of ﬁltering noise and produce
+           representations with better SNRs to the peaks of the input, which is very
+           appealing for signal processing and seems to be used in biology. However,
+           further theoretical work is necessary in order to understand the embedding
+           capabilities of ESNs. One of the disadvantages of the ESN correlated basis
+           is in the design of the readout. Gradient-based algorithms will be very
+           slow to converge (due to the large eigenvalue spread of modes), and even
+           if recursive methods are used, their stability may be compromised by the
+           condition number of the matrix. However, our recent results incorporating
+           anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of
+           solving this problem.
+             Finally we would like to brieﬂy comment on the implications of these
+           models to neurobiology and computational neuroscience. The work by
+           Pouget and Sejnowski (1997) has shown that the available physiological
+           data are consistent with the hypothesis that the response of a single neuron
+           in the parietal cortex serves as a basis function generated by the sensory
+           input in a nonlinear fashion. In other words, the neurons transform the
+           sensory input into a format (representation space) such that the subsequent
+           computation is simpliﬁed. Then, whenever a motor command (output of
+           the biological system) needs to be generated, this simple computation to           Analysis and Design of Echo State Networks 135
+
+
+           read out the neuronal activity is done. There is an intriguing similarity
+           betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski
+           and our interpretation of echo states in ESN. We believe that similar ideas
+           can be applied to improve the design of microcircuit implementations of
+           LSMs. First, the framework of functional space interpretation (bases and
+           projections) is also applicable to microcircuits. Second, the ASE measure
+           may be directly utilized for LSM states because the states are normally low-
+           pass-ﬁltered before the readout. However, the control of ASE by changing
+           the liquid dynamics is unclear. Perhaps global control of thresholds or bias
+           current will be able to accomplish bias control as in ESN with sigmoid
+           PEs.
+
+
+           Acknowledgments
+
+           ThisworkwaspartiallysupportedbyNSFECS-0422718,NSFCNS-0540304,
+           and ONR N00014-1-1-0405.
+
+
+           References
+
+           Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer.
+           Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor-
+             ical perception, and probability learning: Some applications of a neural model.
+             Psychological Review, 84, 413–451.
+           Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach
+             to blind separation and blind deconvolution.Neural Computation, 7(6), 1129–
+             1159.
+           Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨
+             in recurrent neural networks.Neural Computation, 16(7), 1413–1436.
+           Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal
+             of Physics, 14(1), 1–13.
+           de Vries, B. (1991).Temporal processing with neural networks—the development of the
+             gamma model. Unpublished doctoral dissertation, University of Florida.
+           Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural
+             network for system identiﬁcation and control.IEEE Proceedings of Control Theory
+             and Applications, 142(4), 307–314.
+           Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179–211.
+           Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation:
+             Stochastic information gradient.Signal Processing Letters, 10(8), 242–245.
+           Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for
+             adaptive system training.IEEE Transactions on Neural Networks, 13(5), 1035–1044.
+           Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream
+             Kalman ﬁlter training for recurrent networks. In J. Suykens, & J. Vandewalle
+             (Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 29–53). Dordrecht,
+             Netherlands: Kluwer.           136 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle
+             River, NJ. Prentice Hall.
+           Haykin, S. (2001).Adaptive ﬁlter theory(4th ed.). Upper Saddle River, NJ: Prentice
+             Hall.
+           Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa-
+             tion, 9(8), 1735–1780.
+           Hopﬁeld, J. (1984). Neurons with graded response have collective computational
+             properties like those of two-state neurons.Proceedings of the National Academy of
+             Sciences, 81, 3088–3092.
+           Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math-
+             ematics, 5(1), 189–203.
+           Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural
+             networks(Tech. Rep. No. 148). Bremen: German National Research Center for
+             Information Technology.
+           Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152).
+             Bremen: German National Research Center for Information Technology.
+           Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL,
+             EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German
+             National Research Center for Information Technology.
+           Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
+             and saving energy in wireless communication.Science, 304(5667), 78–80.
+           Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems.
+             Proceedings of the Royal Society of London, A 196, 453–461.
+           Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall.
+           Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit
+             Theory, 1(3), 29–39.
+           Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks
+             for adaptive communication channel equalization.IEEE Transactions on Neural
+             Networks, 5(2), 267–278.
+           Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks.
+             IEEE Transactions on Neural Networks, 6(5), 1000–1004.
+           Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation
+             theory(2nd ed.). New York: Springer-Verlag.
+           Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 12–37.
+           Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the
+             computational power and generalization capability of neural microcircuits. In
+             L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing
+             systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press.
+           Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨
+             stable states: A new framework for neural computation based on perturbations.
+             Neural Computation, 14(11), 2531–2560.
+           Mitchell, M., Hraber, P., & Crutchﬁeld, J. (1993). Revisiting the edge of chaos:
+             Evolving cellular automata to perform computations.Complex Systems, 7, 89–
+             130.
+           Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J.
+             Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293–
+             301). Singapore: World Scientiﬁc.           Analysis and Design of Echo State Networks 137
+
+
+           Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex
+             using basis functions.Journal of Cognitive Neuroscience, 9(2), 222–237.
+           Principe, J. (2001). Dynamic neural networks and optimal signal processing. In
+             Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6–
+             28). Boca Raton, FL: CRC Press.
+           Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma ﬁlter—a new
+             class of adaptive IIR ﬁlters with restricted feedback.IEEE Transactions on Signal
+             Processing, 41(2), 649–656.
+           Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin
+             (Ed.),Unsupervised adaptive ﬁltering(pp. 265–319). Hoboken, NJ: Wiley.
+           Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter-
+             national Joint Conference on Neural Networks(pp. 1463–1466). Montreal, Canada.
+           Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with ﬁxed
+             weights in recurrent neural networks: An overview. InProc. of International Joint
+             Conference on Neural Networks(pp. 2018–2022). Honolulu, Hawaii.
+           Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys-
+             tems with Kalman ﬁlter trained recurrent networks.IEEE Transactions on Neural
+             Networks, 5(2), 279–297.
+           Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap-
+             plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 1407–1420.
+           Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev,
+             M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with
+             echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and
+             Signal Processing. Philadelphia.
+           Renyi, A. (1970).Probability theory. New York: Elsevier.
+           Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis.
+             Unpublished doctoral dissertation, University of Florida.
+           Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net-
+             works: Reformulating ﬁxed weight neural networks. InProc. of International Joint
+             Conference on Neural Networks(pp. 189–194). Budapest, Hungary.
+           Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in
+             multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 10–18.
+           Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical
+             Journal, 27, 623–656.
+           Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc-
+             toral dissertation, Rutgers University.
+           Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied
+             Mathematics Letters, 4(6), 77–80.
+           Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended
+             Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process-
+             ing systems, 1(pp. 133–140). San Mateo, CA: Morgan Kaufmann.
+           Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S.
+             Young (Eds.),Dynamical systems and turbulence(pp. 366–381). Berlin: Springer.
+           Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub-
+             lished master’s thesis, University of Florida.
+           Werbos, P. (1990). Backpropagation through time: What it does and how to do it.
+             Proceedings of IEEE, 78(10), 1550–1560.           138 M. Ozturk, D. Xu, and J. Pr´ıncipe
+
+
+           Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua-
+             tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 65–89). New
+             York: Van Nostrand Reinhold.
+           Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall.
+           Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running
+             fully recurrent neural networks.Neural Computation, 1, 270–280.
+
+
+           Received December 28, 2004; accepted June 1, 2006.
\ No newline at end of file
diff --git a/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt b/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt
new file mode 100644
index 0000000..430d70b
Binary files /dev/null and b/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt differ
diff --git a/Corpus/Neural_Ordinary_Differential_Equations.txt b/Corpus/CORPUS.txt
similarity index 100%
rename from Corpus/Neural_Ordinary_Differential_Equations.txt
rename to Corpus/CORPUS.txt
diff --git a/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt b/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt
new file mode 100644
index 0000000..9906917
--- /dev/null
+++ b/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt	
@@ -0,0 +1,391 @@
+                 Channel Pruning for Accelerating Very Deep Neural Networks
+
+
+                     Yihui He *               Xiangyu Zhang              Jian Sun
+              Xi’an Jiaotong University          Megvii Inc.               Megvii Inc.
+                Xi’an, 710049, China       Beijing, 100190, China     Beijing, 100190, China
+              heyihui@stu.xjtu.edu.cn    zhangxiangyu@megvii.com      sunjian@megvii.com
+
+
+
+                        Abstract                         W1
+
+          In this paper, we introduce a new channel pruning      number of  channels
+                                                                                   nonlinear method to accelerate very deep convolutional neural net-
+        works. Given a trained CNN model, we propose an it-
+        erative two-step algorithm to effectively prune each layer,         W2
+        by a LASSO regression based channel selection and least                                    nonlinear
+        square reconstruction. We further generalize this algorithm
+        to multi-layer and multi-branch cases. Our method re-        W3
+        duces the accumulated error and enhance the compatibility
+        with various architectures. Our pruned VGG-16 achieves         (a)                           (b)                        (c)                       (d)
+        the state-of-the-art results by5×speed-up along with only   Figure 1. Structured simpliﬁcation methods that accelerate CNNs: 0.3% increase of error. More importantly, our method is   (a) a network with 3 conv layers. (b) sparse connection deacti-
+        able to accelerate modern networks like ResNet, Xception   vates some connections between channels. (c) tensor factorization
+        and suffers only 1.4%, 1.0% accuracy loss under2×speed-   factorizes a convolutional layer into several pieces. (d) channel
+        up respectively, which is signiﬁcant.                   pruning reduces number of channels in each layer (focus of this
+                                                   paper).
+
+        1. Introduction                              a network into thinner one, as shown in Fig.1(d). It is efﬁ-
+          Recent CNN acceleration works fall into three cate-   cient on both CPU and GPU because no special implemen-
+        gories: optimized implementation (e.g., FFT [47]), quan-   tation is required.
+        tization (e.g., BinaryNet [8]), and structured simpliﬁcation     Pruning channels is simple but challenging because re-
+        that convert a CNN into compact one [22]. This work fo-   moving channels in one layer might dramatically change
+        cuses on the last one.                             the input of the following layer. Recently,training-based
+          Structured simpliﬁcation mainly involves: tensor fac-   channel pruning works [1,48] have focused on imposing
+        torization [22], sparse connection [17], and channel prun-   sparse constrain on weights during training, which could
+        ing [48]. Tensor factorization factorizes a convolutional   adaptively determine hyper-parameters. However, training
+        layer into several efﬁcient ones (Fig.1(c)). However, fea-   from scratch is very costly and results for very deep CNNs
+        ture map width (number of channels) could not be reduced,   on ImageNet have been rarely reported.Inference-timeat-
+        which makes it difﬁcult to decompose1×1convolutional   tempts [31,3] have focused on analysis of the importance
+        layer favored by modern networks (e.g., GoogleNet [45],   of individual weight. The reported speed-up ratio is very
+        ResNet [18], Xception [7]). This type of method also intro-   limited.
+        duces extra computation overhead. Sparse connection deac-     In this paper, we propose a new inference-time approach
+        tivates connections between neurons or channels (Fig.1(b)).   for channel pruning, utilizing redundancy inter channels.
+        Though it is able to achieves high theoretical speed-up ratio,   Inspired by tensor factorization improvement by feature
+        the sparse convolutional layers have an ”irregular” shape   maps reconstruction [52], instead of analyzing ﬁlter weights
+        which is not implementation friendly. In contrast, channel   [22,31], we fully exploits redundancy within feature maps.
+        pruning directly reduces feature map width, which shrinks   Speciﬁcally, given a trained CNN model, pruning each layer
+                                                   is achieved by minimizing reconstruction error on its output
+          * This work was done when Yihui He was an intern at Megvii Inc.      feature maps, as showned in Fig.2. We solve this mini-
+
+
+
+                                                 1389                A                                                B                                                                      C           maps. There are several training-based approaches. [1,48]
+                                 W                 regularize networks to improve accuracy. Channel-wise
+                                                   SSL [48] reaches high compression ratio for ﬁrst few conv
+                                                   layers of LeNet [30] and AlexNet [26]. However,training- kh kc  w              basedapproaches are more costly, and the effectiveness for
+                         c           n             very deep networks on large datasets is rarely exploited. nonlinear                 nonlinear
+        Figure 2. Channel pruning for accelerating a convolutional layer.     Inference-time channel pruning is challenging, as re-
+        We aim to reduce the width of feature map B, while minimizing   ported by previous works [2,39]. Some works [44,34,19]
+        the reconstruction error on feature map C. Our optimization algo-   focus on model size compression, which mainly operate the
+        rithm (Sec. 3.1) performs within the dotted box, which does not   fully connected layers. Data-free approaches [31,3] results
+        involve nonlinearity. This ﬁgure illustrates the situation that two   for speed-up ratio (e.g.,5×) have not been reported, and
+        channels are pruned for feature map B. Thus corresponding chan-   requires long retraining procedure. [3] select channels via
+        nels of ﬁltersWcan be removed. Furthermore, even though not   over 100 random trials, however it need long time to evalu- directly optimized by our algorithm, the corresponding ﬁlters in   ate each trial on a deep network, which makes it infeasible the previous layer can also be removed (marked by dotted ﬁlters).   to work on very deep models and large datasets. [31] is even c,n: number of channels for feature maps B and C,kh ×kw :   worse than naive solution from our observation sometimes kernel size.                                    (Sec.4.1.1).
+
+        mization problem by two alternative steps: channels selec-   3. Approach
+        tion and feature map reconstruction. In one step, we ﬁgure     In this section, we ﬁrst propose a channel pruning al-out the most representative channels, and prune redundant   gorithm for a single layer, then generalize this approach toones, based on LASSO regression. In the other step, we   multiple layers or the whole model. Furthermore, we dis-reconstruct the outputs with remaining channels with linear   cuss variants of our approach for multi-branch networks.least squares. We alternatively take two steps. Further, we
+        approximate the network layer-by-layer, with accumulated   3.1. Formulation
+        error accounted. We also discuss methodologies to prune
+        multi-branch networks (e.g., ResNet [18], Xception [7]).       Fig.2illustrates our channel pruning algorithm for a sin-
+          For VGG-16, we achieve4×acceleration, with only   gle convolutional layer. We aim to reduce the width of
+        1.0%increase of top-5 error. Combined with tensor factor-   feature map B, while maintaining outputs in feature map
+        ization, we reach5×acceleration but merely suffer0.3%   C. Once channels are pruned, we can remove correspond-
+        increase of error, which outperforms previous state-of-the-   ing channels of the ﬁlters that take these channels as in-
+        arts. We further speed up ResNet-50 and Xception-50 by   put. Also, ﬁlters that produce these channels can also be
+        2×with only1.4%, 1.0%accuracy loss respectively.       removed. It is clear that channel pruning involves two key
+                                                   points. The ﬁrst is channel selection, since we need to select
+        2. Related Work                             most representative channels to maintain as much informa-
+                                                   tion. The second is reconstruction. We need to reconstruct
+          There has been a signiﬁcant amount of work on acceler-   the following feature maps using the selected channels.
+        ating CNNs. Many of them fall into three categories: opti-     Motivated by this, we propose an iterative two-step al-
+        mized implementation [4], quantization [40], and structured   gorithm. In one step, we aim to select most representative
+        simpliﬁcation [22].                              channels. Since an exhaustive search is infeasible even for
+          Optimized implementation based methods [35,47,27,4]   tiny networks, we come up with a LASSO regression based
+        accelerate convolution, with special convolution algorithms   method to ﬁgure out representative channels and prune re-
+        like FFT [47]. Quantization [8,40] reduces ﬂoating point   dundant ones. In the other step, we reconstruct the outputs
+        computational complexity.                         with remaining channels with linear least squares. We alter-
+          Sparse connection eliminates connections between neu-   natively take two steps.
+        rons [17,32,29,15,14]. [51] prunes connections based on     Formally, to prune a feature map withcchannels, we
+        weights magnitude. [16] could accelerate fully connected   consider applyingn×c×kh ×kw convolutional ﬁltersWon
+        layers up to50×. However, in practice, the actual speed-up   N×c×kh ×kw input volumesXsampled from this feature
+        maybe very related to implementation.                 map, which producesN×noutput matrixY. Here,Nis
+          Tensor factorization [22,28,13,24] decompose weights   the number of samples,nis the number of output channels,
+        into several pieces. [50,10,12] accelerate fully connected   andkh ,k w are the kernel size. For simple representation,
+        layers with truncated SVD. [52] factorize a layer into3×3   bias term is not included in our formulation. To prune the
+        and1×1combination, driven by feature map redundancy.    input channels fromcto desiredc′ (0≤c′ ≤c), while
+          Channel pruning removes redundant channels on feature   minimizing reconstruction error, we formulate our problem
+
+
+
+                                                 1390        as follow:                                    penalty, and β  =c. We gradually increaseλ. For each 0                           change ofλ, we iterate these two steps until β  is stable.
+                      1               2                                            0      c                    After β  ≤c′ satisﬁes, we obtain the ﬁnal solutionWarg min    Y−   β                         0i Xi W⊤  i              from{ββ,W 2N                    (1)         i Wi }. In practice, we found that the two steps it- i=1       F           eration is time consuming. So we apply (i) multiple times,subject to β  ≤c′
+                         0                         until β  ≤c′ satisﬁes. Then apply (ii) just once, to obtain 0
+           ·  is Frobenius norm.X                      the ﬁnal result. From our observation, this result is compa-
+            F               i isN×kh kw matrix sliced
+        fromith channel of input volumesX,i= 1,...,c.W     rable with two steps iteration’s. Therefore, in the following i is
+        n×k                                       experiments, we adopt this approach for efﬁciency. h kw ﬁlter weights sliced fromith channel ofW.βis
+        coefﬁcient vector of lengthcfor channel selection, andβ      Discussion: Some recent works [48,1,17] (though train- i
+        isith entry ofβ. Notice that, ifβ                     ing based) also introduceℓ1 -norm or LASSO. However, we i = 0,Xi will be no longer
+        useful, which could be safely pruned from feature map.W    must emphasis that we use different formulations. Many of i
+        could also be removed.                           them introduced sparsity regularization into training loss,
+        Optimization                                 instead of explicitly solving LASSO. Other work [1] solved
+        Solving thisℓ                                  LASSO, while feature maps or data were not considered 0 minimization problem in Eqn.1is NP-hard.
+        Therefore, we relax theℓ                          during optimization. Because of these differences, our ap- 0 toℓ1 regularization:            proach could be applied at inference time.              
+                   1       c        2
+            arg min    Y−   β                      3.2. Whole Model Pruning i Xi W⊤  
+                                 i   +λ β 1β,W 2N                        (2) i=1       F                Inspired by [52], we apply our approach layer by layersubject to β  ≤c′ ,∀i W  = 1 0        iF                  sequentially. For each layer, we obtain input volumes from
+                                                   the current input feature map, and output volumes from theλis a penalty coefﬁcient. By increasingλ, there will be   output feature map of the un-pruned model. This could bemore zero terms inβand one can get higher speed-up ratio.   formalized as:We also add a constrain∀i Wi   = 1to this formulation, F which avoids trivial solution.                                                      
+          Now we solve this problem in two folds. First, we ﬁxW,                 1       c        2
+                                                           arg min    Y′ −   βsolveβfor channel selection. Second, we ﬁxβ, solveWto                            i Xi W⊤  i  
+                                                            β,W 2N                    (5)
+        reconstruct error.                                                    i=1       F
+          (i) The subproblem ofβ. In this case,Wis ﬁxed. We           subject to β  ≤c′
+                                                                    0
+        solveβfor channel selection. This problem can be solved     Different from Eqn.1,Yis replaced byY′ , which is fromby LASSO regression [46,5], which is widely used for   feature map of the original model. Therefore, the accumu-model selection.                                lated error could be accounted during sequential pruning.                    2      c      βˆLASSO           1(λ) = argmin    Y−   β    +λ β      3.3. Pruning Multi­Branch Networks
+                     β  2N       i Zi         1
+                                i=1    F             The whole model pruning discussed above is enough for
+         subject to β  ≤c′
+                  0                                single-branch networks like LeNet [30], AlexNet [26] and(3)   VGG Nets [43]. However, it is insufﬁcient for multi-branch HereZi = X i W⊤ i (sizeN×n). We will ignoreith channels   networks like GoogLeNet [ 45] and ResNet [18]. We mainlyifβi = 0.                                    focus on pruning the widely used residual structure (e.g.,(ii) The subproblem ofW. In this case,βis ﬁxed. We   ResNet [18], Xception [7]). Given a residual block shownutilize the selected channels to minimize reconstruction er-   in Fig.3(left), the input bifurcates into shortcut and residualror. We can ﬁnd optimized solution by least squares:        branch. On the residual branch, there are several convolu-
+                                                  tional layers (e.g., 3 convolutional layers which have spatialarg min Y−X′ (W ′ )⊤  2        (4) F              size of1×1,3×3,1×1, Fig.3, left). Other layers ex- W′                            cept the ﬁrst and last layer can be pruned as is described
+        HereX′ = [β1 X1 β2 X2 ... β i Xi ... β c Xc ](size   previously. For the ﬁrst layer, the challenge is that the large
+        N×ck h kw ). W′ isn×ck h kw reshapedW,W′ =   input feature map width (for ResNet, 4 times of its output)
+        [W 1 W2 ...Wi ...Wc ]. After obtained resultW′ , it is re-   can’t be easily pruned, since it’s shared with shortcut. For
+        shaped back toW. Then we assignβi ←βi  Wi   ,W      the last layer, accumulated error from the shortcut is hard to F  i ←
+        Wi / Wi   . Constrain∀i W                      be recovered, since there’s no parameter on the shortcut. To F            i   = 1satisﬁes. F We alternatively optimize (i) and (ii). In the beginning,   address these challenges, we propose several variants of our
+        Wis initialized from the trained model,λ= 0, namely no   approach as follows.
+
+
+
+                                                 1391                                    c              ers, which need special library implementation support. We
+                      Input (c) sampled (c')  0              do not adopt it in the following experiments. c             0      0 
+             0
+                          channel     sampler
+                          sampler 1x1,c                   c'0               4. Experiment 1
+            c                       1x1 1   relu                c' 3x3,c                   1   relu           We evaluation our approach for the popular VGG Nets 2
+            c                       3x3 2   relu                                [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR- c'1x1                    2   relu         10 [25] and PASCAL VOC 2007 [11]. 1x1                For Batch Normalization [21], we ﬁrst merge it into con- Y2   Y          volutional weights, which do not affect the outputs of the Y+Y    1
+                                   1 2              networks. So that each convolutional layer is followed by
+        Figure 3. Illustration of multi-branch enhancement for residual   ReLU [36]. We use Caffe [23] for deep network evalua-
+        block.Left: original residual block.Right: pruned residual block   tion, and scikit-learn [38] for solvers implementation. For
+        with enhancement,cx denotes the feature map width. Input chan-   channel pruning, we found that it is enough to extract 5000 nels of the ﬁrst convolutional layer are sampled, so that the large   images, and 10 samples per image. On ImageNet, we eval- input feature map width could be reduced. As for the last layer,   uate the top-5 accuracy with single view. Images are re- rather than approximateY2 , we try to approximateY1 + Y 2 di-   sized such that the shorter side is 256. The testing is on rectly (Sec.3.3Last layer of residual branch).               center crop of224×224pixels. We could gain more per-
+                                                   formance with ﬁne-tuning. We use a batch size of 128 and
+                                                   learning rate1e−5 . We ﬁne-tune our pruned models for 10Last layer of residual branch: Shown in Fig.3, the   epoches. The augmentation for ﬁne-tuning is random cropoutput layer of a residual block consists of two inputs: fea-   of224×224and mirror.ture mapY1 andY2 from the shortcut and residual branch.
+        We aim to recoverY1 + Y 2 for this block. Here,Y1 ,Y2   4.1. Experiments with VGG­16 are the original feature maps before pruning.Y2 could be
+        approximated as in Eqn.1. However, shortcut branch is     VGG-16 [43] is a 16 layers single path convolutional
+        parameter-free, thenY                            neural network, with 13 convolutional layers. It is widely 1 could not be recovered directly. To
+        compensate this error, the optimization goal of the last layer   used in recognition, detection and segmentation,etc. Single
+        is changed fromY                               view top-5 accuracy for VGG-16 is 89.9% 1 .2 toY1 −Y′ +Y, which does not change 1  2
+        our optimization. Here,Y′ is the current feature map after 1 previous layers pruned. When pruning, volumes should be   4.1.1 Single Layer Pruning
+        sampled correspondingly from these two branches.         In this subsection, we evaluate single layer acceleration per-First layer of residual branch: Illustrated in   formance using our algorithm in Sec.3.1. For better under-Fig.3(left), the input feature map of the residual block   standing, we compare our algorithm with two naive chan-could not be pruned, since it is also shared with the short-   nel selection strategies.ﬁrst kselects the ﬁrstkchannels.cut branch. In this condition, we could performfeature   max responseselects channels based on corresponding ﬁl-map samplingbefore the ﬁrst convolution to save compu-   ters that have high absolute weights sum [31]. For fair com-tation. We still apply our algorithm as Eqn.1. Differently,   parison, we obtain the feature map indexes selected by eachwe sample the selected channels on the shared feature maps   of them, then perform reconstruction (Sec. 3.1(ii)). We to construct a new input for the later convolution, shown   hope that this could demonstrate the importance of channelin Fig.3(right). Computational cost for this operation could   selection. Performance is measured by increase of error af-be ignored. More importantly, after introducingfeature map   ter a certain layer is pruned without ﬁne-tuning, shown insampling, the convolution is still ”regular”.              Fig.4.Filter-wise pruningis another option for the ﬁrst con-     As expected, error increases as speed-up ratio increases.volution on the residual branch. Since the input channels   Our approach is consistently better than other approaches inof parameter-free shortcut branch could not be pruned, we   different convolutional layers under different speed-up ra-apply our Eqn.1to each ﬁlter independently (each ﬁl-   tio. Unexpectedly, sometimesmax responseis even worseter chooses its own representative input channels). Under   thanﬁrst k. We argue thatmax responseignores correla-single layer acceleration,ﬁlter-wise pruningis more accu-   tions between different ﬁlters. Filters with large absoluterate than our original one. From our experiments, it im-   weight may have strong correlation. Thus selection based proves 0.5% top-5 accuracy for2×ResNet-50 (applied on   on ﬁlter weights is less meaningful. Correlation on featurethe ﬁrst layer of each residual branch) without ﬁne-tuning.   maps is worth exploiting. We can ﬁnd that channel selectionHowever, after ﬁne-tuning, there’s no noticeable improve-
+        ment. In addition, it outputs ”irregular” convolutional lay-     1 http://www.vlfeat.org/matconvnet/pretrained/
+
+
+
+                                                 1392                          conv1_1                 conv2_1                 conv3_1 5
+                             first k                  first k                  first k
+                             max response              max response              max response 4          ours                   ours                   ours
+
+
+
+
+
+                 increase of error (%) 3
+
+                  2
+
+                  1
+
+                  0
+
+                          conv3_2                 conv4_1                 conv4_2 5
+                             first k                  first k             first k
+                             max response              max response        max response 4          ours                   ours             ours
+
+
+
+
+
+                 increase of error (%) 3
+
+                  2
+
+                  1
+
+                  01.0 1.5 2.0 2.5 3.0 3.5 4.0  1.0 1.5 2.0 2.5 3.0 3.5 4.0  1.0 1.5 2.0 2.5 3.0 3.5 4.0
+                         speed-up ratio               speed-up ratio               speed-up ratio
+        Figure 4. Single layer performance analysis under different speed-up ratios (without ﬁne-tuning), measured by increase of error. To verify
+        the importance of channel selection refered in Sec.3.1, we considered two naive baselines.ﬁrst kselects the ﬁrstkfeature maps.max
+        responseselects channels based on absolute sum of corresponding weights ﬁlter [31]. Our approach is consistently better (smaller is
+        better).
+
+
+            Increase of top-5 error (1-view, baseline 89.9%)       periments above, we pruning more aggressive for shal-
+                 Solution          2×  4×  5×    lower layers. Remaining channels ratios for shallow lay-
+         Jaderberget al. [22] ([52]’s impl.)   -   9.7  29.7    ers (conv1_xtoconv3_x) and deep layers (conv4_x)
+                Asym. [52]         0.28  3.84   -     is1 : 1.5.conv5_xare not pruned, since they only con-
+              Filter pruning [31]                        tribute 9% computation in total and are not redundant.0.8  8.6  14.6(ﬁne-tuned, our impl.)                         After ﬁne-tuning, we could reach2×speed-up without
+            Ours (without ﬁne-tune)     2.7  7.9  22.0    losing accuracy. Under4×, we only suffers 1.0% drops.
+              Ours (ﬁne-tuned)        0   1.0  1.7    Consistent with single layer analysis, our approach outper-
+        Table 1. Accelerating the VGG-16 model [43] using a speedup   forms previous channel pruning approach (Liet al. [31]) by
+        ratio of2×,4×, or5×(smaller is better).                 large margin. This is because we fully exploits channel re-
+                                                   dundancy within feature maps. Compared with tensor fac-
+        affects reconstruction error a lot. Therefore, it is important   torization algorithms, our approach is better than Jaderberg
+        for channel pruning.                             et al. [22], without ﬁne-tuning. Though worse than Asym.
+          Also notice that channel pruning gradually becomes   [52], our combined model outperforms its combined Asym.
+        hard, from shallower to deeper layers. It indicates that shal-   3D (Table2). This may indicate that channel pruning is
+        lower layers have much more redundancy, which is consis-   more challenging than tensor factorization, since removing
+        tent with [52]. We could prune more aggressively on shal-   channels in one layer might dramatically change the input
+        lower layers in whole model acceleration.               of the following layer. However, channel pruning keeps the
+                                                   original model architecture, do not introduce additional lay-
+                                                   ers, and the absolute speed-up ratio on GPU is much higher4.1.2 Whole Model Pruning                      (Table 3).
+        Shown in Table1, whole model acceleration results under     Since our approach exploits a new cardinality, we further
+        2×,4×,5×are demonstrated. We adopt whole model   combine our channel pruning with spatial factorization [22]
+        pruning proposed in Sec.3.2. Guided by single layer ex-   and channel factorization [52]. Demonstrated in Table2,
+
+
+
+                                                 1393               Increase of top-5 error (1-view, 89.9%)          scratch. This coincides with architecture design researches
+                     Solution        4×  5×          [20,1] that the model could be easier to train if there are
+                  Asym. 3D [52]      0.9  2.0          more channels in shallower layers. However, channel prun-
+              Asym. 3D (ﬁne-tuned) [52]  0.3  1.0          ing favors shallower layers.
+                     Our 3C        0.7  1.3            For from scratch (uniformed), the ﬁlters in each layers
+                 Our 3C (ﬁne-tuned)    0.0  0.3          is reduced by half (eg. reduceconv1_1from 64 to 32).
+        Table 2. Performance of combined methods on the VGG-16 model   We can observe that normal setting networks of the same
+        [43] using a speed-up ratio of4×or5×. Our 3C solution outper-   complexity couldn’t reach same accuracy either. This con-
+        forms previous approaches (smaller is better).               solidates our idea that there’s much redundancy in networks
+                                                   while training. However, redundancy can be opt out at
+                                                   inference-time. This maybe an advantage of inference-timeour 3 cardinalities acceleration (spatial, channel factoriza-   acceleration approaches over training-based approaches.tion, and channel pruning, denoted by 3C) outperforms pre-     Notice that there’s a 0.6% gap between the from scratchvious state-of-the-arts. Asym. 3D [52] (spatial and chan-   model and uniformed one, which indicates that there’s roomnel factorization), factorizes a convolutional layer to three   for model exploration. Adopting our approach is muchparts:1×3,3×1,1×1.                         faster than training a model from scratch, even for a thin-We apply spatial factorization, channel factorization, and   ner one. Further researches could alleviate our approach to our channel pruning together sequentially layer-by-layer.   do thin model exploring.We ﬁne-tune the accelerated models for 20 epoches, since
+        they are 3 times deeper than the original ones. After ﬁne-
+        tuning, our4×model suffers no degradation. Clearly, a   4.1.5 Acceleration for Detection
+        combination of different acceleration techniques is better   VGG-16 is popular among object detection tasks [42,41,than any single one. This indicates that a model is redun-   33]. We evaluate transfer learning ability of our2×/4×dant in each cardinality.                           pruned VGG-16, for Faster R-CNN [42] object detections.
+                                                   PASCAL VOC 2007 object detection benchmark [11] con-
+        4.1.3 Comparisons of Absolute Performance          tains 5k trainval images and 5k test images. The per-
+                                                   formance is evaluated by mean Average Precision (mAP).We further evaluate absolute performance of acceleration   In our experiments, we ﬁrst perform channel pruning foron GPU. Results in Table3are obtained under Caffe [23],   VGG-16 on the ImageNet. Then we use the pruned modelCUDA8 [37] and cuDNN5 [6], with a mini-batch of 32   as the pre-trained model for Faster R-CNN.on a GPU (GeForce GTX TITAN X). Results are averaged     The actual running time of Faster R-CNN is 220ms / im-from 50 runs. Tensor factorization approaches decompose   age. The convolutional layers contributes about 64%. Weweights into too many pieces, which heavily increase over-   got actual time of 94ms for4×acceleration. From Table5,head. They could not gain much absolute speed-up. Though   we observe 0.4% mAP drops of our2×model, which is notour approach also encountered performance decadence, it   harmful for practice consideration.generalizes better on GPU than other approaches. Our re-
+        sults for tensor factorization differ from previous research   4.2. Experiments with Residual Architecture Nets
+        [52,22], maybe because current library and hardware pre-     For Multi-path networks [45,18,7], we further explorefer single large convolution instead of several small ones.     the popular ResNet [18] and latest Xception [7], on Ima-
+                                                   geNet and CIFAR-10. Pruning residual architecture nets is
+        4.1.4 Comparisons with Training from Scratch        more challenging. These networks are designed for both ef-
+                                                   ﬁciency and high accuracy. Tensor factorization algorithmsThough training a compact model from scratch is time-   [52,22] have difﬁcult to accelerate these model. Spatially,consuming (usually 120 epoches), it worths comparing our   1×1convolution is favored, which could hardly be factor-approach and from scratch counterparts. To be fair, we eval-   ized.uated both from scratch counterpart, and normal setting net-
+        work that has the same computational complexity and same   4.2.1 ResNet Pruningarchitecture.
+          Shown in Table4, we observed that it’s difﬁcult for   ResNet complexity uniformly drops on each residual block.
+        from scratch counterparts to reach competitive accuracy.   Guided by single layer experiments (Sec. 4.1.1), we still
+        our model outperforms from scratch one. Our approach   prefer reducing shallower layers heavier than deeper ones.
+        successfully picks out informative channels and constructs     Following similar setting as Filter pruning [31], we
+        highly compact models. We can safely draw the conclu-   keep 70% channels for sensitive residual blocks (res5
+        sion that the same model is difﬁcult to be obtained from   and blocks close to the position where spatial size
+
+
+
+                                                 1394                       Model             Solution          Increased err.  GPU time/ms
+                       VGG-16              -                 0        8.144
+                                Jaderberget al. [22] ([52]’s impl.)     9.7     8.051(1.01×)
+                                        Asym. [52]            3.8     5.244(1.55×)
+                     VGG-16 (4×)        Asym. 3D [52]           0.9     8.503(0.96×)
+                                  Asym. 3D (ﬁne-tuned) [52]       0.3     8.503(0.96×)
+                                      Ours (ﬁne-tuned)           1.0     3.264 (2.50×)
+        Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is
+        better).
+
+
+           Original (acc. 89.9%)   Top-5 err.  Increased err.                Solution         Increased err.
+             From scratch        11.9       1.8            Filter pruning [31] (our impl.)     92.8
+         From scratch (uniformed)   12.5       2.4                Filter pruning [31]         4.3Ours          18.0       7.9               (ﬁne-tuned, our impl.)
+            Ours (ﬁne-tuned)      11.1       1.0                     Ours             2.9
+        Table 4. Comparisons with training from scratch, under4×accel-            Ours (ﬁne-tuned)         1.0
+        eration. Our ﬁne-tuned model outperforms scratch trained coun-   Table 7. Comparisons for Xception-50, under2×acceleration ra-
+        terparts (smaller is better).                           tio. The baseline network’s top-5 accuracy is 92.8%. Our ap-
+                                                   proach outperforms previous approaches. Most structured sim-
+                                                   pliﬁcation methods are not effective on Xception architecture
+                  Speedup  mAP  ∆mAP              (smaller is better).
+                  Baseline  68.7    -
+                    2×    68.3   0.4
+                    4×    66.9   1.8               4.2.2 Xception Pruning
+          Table 5.2×,4×acceleration for Faster R-CNN detection.
+                                                   Since computational complexity becomes important in
+                                                   model design, separable convolution has been payed muchSolution      Increased err.          attention [49,7]. Xception [7] is already spatially optimizedOurs           8.0             and tensor factorization on1×1convolutional layer is de-Ours           4.0             structive. Thanks to our approach, it could still be acceler-(enhanced)                         ated with graceful degradation. For the ease of comparison,Ours           1.4             we adopt Xception convolution on ResNet-50, denoted by(enhanced, ﬁne-tuned)                     Xception-50. Based on ResNet-50, we swap all convolu- Table 6.2×acceleration for ResNet-50 on ImageNet, the base-   tional layers with spatial conv blocks. To keep the same line network’s top-5 accuracy is 92.2% (one view). We improve   computational complexity, we increase the input channels performance with multi-branch enhancement (Sec.3.3,smaller is   of allbranch2blayers by2×. The baseline Xception- better).                                      50 has a top-5 accuracy of 92.8% and complexity of 4450
+                                                   MFLOPs.
+                                                     We apply multi-branch variants of our approach as de-change, e.g. res3a,res3d). As for other blocks,   scribed in Sec.3.3, and adopt the same pruning ratio settingwe keep 30% channels. With multi-branch enhance-   as ResNet in previous section. Maybe because of Xcep-ment, we prunebranch2amore aggressively within   tion block is unstable, Batch Normalization layers must beeach residual block. The remaining channels ratios for   maintained during pruning. Otherwise it becomes nontrivialbranch2a,branch2b,branch2cis2 : 4 : 3(e.g.,   to ﬁne-tune the pruned model.Given 30%, we keep 40%, 80%, 60% respectively).          Shown in Table7, after ﬁne-tuning, we only suffer1.0%
+          We evaluate performance of multi-branch variants of our   increase of error under2×. Filter pruning [31] could also
+        approach (Sec. 3.3). From Table6, we improve 4.0%   apply on Xception, though it is designed for small speed-
+        with our multi-branch enhancement. This is because we   up ratio. Without ﬁne-tuning, top-5 error is 100%. After
+        accounted the accumulated error from shortcut connection   training 20 epochs which is like training from scratch, in-
+        which could broadcast to every layer after it. And the large   creased error reach 4.3%. Our results for Xception-50 are
+        input feature map width at the entry of each residual block   not as graceful as results for VGG-16, since modern net-
+        is well reduced by ourfeature map sampling.             works tend to have less redundancy by design.
+
+
+
+                                                 1395                     Solution       Increased err.            [4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn:
+                  Filter pruning [31]                            Lookup-based convolutional neural network.arXiv preprint 1.3(ﬁne-tuned, our impl.)                           arXiv:1611.06473, 2016.2
+                    From scratch         1.9                [5] L. Breiman. Better subset regression using the nonnegative
+                       Ours            2.0                   garrote.Technometrics, 37(4):373–384, 1995.3
+                  Ours (ﬁne-tuned)       1.0                [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
+         Table 8.2×speed-up comparisons for ResNet-56 on CIFAR-10,       B. Catanzaro, and E. Shelhamer. cudnn: Efﬁcient primitives
+         the baseline accuracy is 92.8% (one view). We outperforms previ-       for deep learning.CoRR, abs/1410.0759, 2014.6
+         ous approaches and scratch trained counterpart (smaller is better).    [7] F. Chollet. Xception: Deep learning with depthwise separa-
+                                                            ble convolutions.arXiv preprint arXiv:1610.02357, 2016. 1,
+                                                            2,3,4,6,7
+         4.2.3 Experiments on CIFAR-10                      [8] M. Courbariaux and Y. Bengio. Binarynet: Training deep
+                                                            neural networks with weights and activations constrained to+
+         Even though our approach is designed for large datasets, it       1 or-1.arXiv preprint arXiv:1602.02830, 2016.1,2
+         could generalize well on small datasets. We perform ex-    [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
+         periments on CIFAR-10 dataset [25], which is favored by       Fei. Imagenet: A large-scale hierarchical image database.
+         many acceleration researches. It consists of 50k images for       InComputer Vision and Pattern Recognition, 2009. CVPR
+         training and 10k for testing in 10 classes.                     2009. IEEE Conference on, pages 248–255. IEEE, 2009. 4
+           We reproduce ResNet-56, which has accuracy of 92.8%    [10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-
+         (Serve as a reference, the ofﬁcial ResNet-56 [18] has ac-       gus. Exploiting linear structure within convolutional net-
+         curacy of 93.0%). For2×acceleration, we follow similar       works for efﬁcient evaluation. InAdvances in Neural In-
+                                                            formation Processing Systems, pages 1269–1277, 2014.2 setting as Sec.4.2.1(keep the ﬁnal stage unchanged, where
+         the spatial size is8×8). Shown in Table8, our approach    [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
+                                                            and A. Zisserman. The PASCAL Visual Object Classes is competitive with scratch trained one, without ﬁne-tuning,       Challenge 2007 (VOC2007) Results. http://www.pascal- under2×speed-up. After ﬁne-tuning, our result is signif-       network.org/challenges/VOC/voc2007/workshop/index.html. icantly better than Filter pruning [31] and scratch trained       4,6
+         one.                                            [12] R. Girshick. Fast r-cnn. InProceedings of the IEEE Inter-
+                                                            national Conference on Computer Vision, pages 1440–1448,
+         5. Conclusion                                      2015.2
+                                                         [13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress-
+           To conclude, current deep CNNs are accurate with high       ing deep convolutional networks using vector quantization.
+         inference costs. In this paper, we have presented an       arXiv preprint arXiv:1412.6115, 2014.2
+         inference-time channel pruning method for very deep net-    [14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for
+         works. The reduced CNNs are inference efﬁcient networks       efﬁcient dnns. InAdvances In Neural Information Process-
+         while maintaining accuracy, and only require off-the-shelf       ing Systems, pages 1379–1387, 2016.2
+         libraries. Compelling speed-ups and accuracy are demon-    [15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,
+         strated for both VGG Net and ResNet-like networks on Im-       and W. J. Dally. Eie: efﬁcient inference engine on com-
+         ageNet, CIFAR-10 and PASCAL VOC.                      pressed deep neural network. InProceedings of the 43rd
+                                                            International Symposium on Computer Architecture, pages In the future, we plan to involve our approaches into       243–254. IEEE Press, 2016. 2 training time, instead of inference time only, which may    [16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- also accelerate training procedure.                          pressing deep neural network with pruning, trained quantiza-
+                                                            tion and huffman coding.CoRR, abs/1510.00149, 2, 2015.
+         References                                         2
+                                                         [17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights
+          [1] J. M. Alvarez and M. Salzmann. Learning the number of       and connections for efﬁcient neural network. InAdvances in
+            neurons in deep networks. InAdvances in Neural Informa-       Neural Information Processing Systems, pages 1135–1143,
+            tion Processing Systems, pages 2262–2270, 2016. 1,2,3,       2015.1,2,3
+            6                                           [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
+          [2] S. Anwar, K. Hwang, and W. Sung. Structured prun-       ing for image recognition.arXiv preprint arXiv:1512.03385,
+            ing of deep convolutional neural networks. arXiv preprint       2015. 1,2,3,4,6,8
+            arXiv:1512.08571, 2015.2                          [19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim-
+          [3] S. Anwar and W. Sung. Compact deep convolutional       ming: A data-driven neuron pruning approach towards efﬁ-
+            neural networks with coarse pruning.  arXiv preprint       cient deep architectures. arXiv preprint arXiv:1607.03250,
+            arXiv:1610.09639, 2016.1,2                            2016.2
+
+
+
+
+                                                       1396         [20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,    [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
+            A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al.       B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
+            Speed/accuracy trade-offs for modern convolutional object       V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
+            detectors.arXiv preprint arXiv:1611.10012, 2016. 6            M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
+         [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating       chine learning in Python.Journal of Machine Learning Re-
+            deep network training by reducing internal covariate shift.       search, 12:2825–2830, 2011.4
+            arXiv preprint arXiv:1502.03167, 2015.4                [39] A. Polyak and L. Wolf. Channel-level acceleration of deep
+         [22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up       face representations.IEEE Access, 3:2163–2175, 2015.2
+            convolutional neural networks with low rank expansions.    [40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
+            arXiv preprint arXiv:1405.3866, 2014.1,2,5,6,7              net: Imagenet classiﬁcation using binary convolutional neu-
+         [23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-       ral networks. InEuropean Conference on Computer Vision,
+            shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-       pages 525–542. Springer, 2016. 2
+            tional architecture for fast feature embedding.arXiv preprint    [41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. arXiv:1408.5093, 2014. 4,6                            You only look once: Uniﬁed, real-time object detection.
+         [24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin.       CoRR, abs/1506.02640, 2015. 6
+            Compression of deep convolutional neural networks for    [42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:fast and low power mobile applications. arXiv preprint       towards real-time object detection with region proposal net-arXiv:1511.06530, 2015.2                             works.CoRR, abs/1506.01497, 2015.6 [25] A. Krizhevsky and G. Hinton. Learning multiple layers of    [43] K. Simonyan and A. Zisserman. Very deep convolutionalfeatures from tiny images. 2009.4,8                       networks for large-scale image recognition. arXiv preprint[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet       arXiv:1409.1556, 2014.3,4,5,6classiﬁcation with deep convolutional neural networks. In    [44] S. Srinivas and R. V. Babu. Data-free parameter pruningAdvances in neural information processing systems, pages       for deep neural networks.arXiv preprint arXiv:1507.06149,1097–1105, 2012.2,3                                2015.2[27] A. Lavin. Fast algorithms for convolutional neural networks.    [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,arXiv preprint arXiv:1509.09308, 2015.2                   D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and       Going deeper with convolutions. InProceedings of the IEEEV. Lempitsky. Speeding-up convolutional neural net-       Conference on Computer Vision and Pattern Recognition,works using ﬁne-tuned cp-decomposition. arXiv preprint       pages 1–9, 2015.1,3,6arXiv:1412.6553, 2014.2                          [46] R. Tibshirani. Regression shrinkage and selection via the[29] V. Lebedev and V. Lempitsky. Fast convnets using group-       lasso. Journal of the Royal Statistical Society. Series Bwise brain damage.arXiv preprint arXiv:1506.02515, 2015.       (Methodological), pages 267–288, 1996.32                                           [47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-       antino, and Y. LeCun. Fast convolutional nets withbased learning applied to document recognition. Proceed-       fbfft: A gpu performance evaluation.  arXiv preprintings of the IEEE, 86(11):2278–2324, 1998.2,3                arXiv:1412.7580, 2014.1,2[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.
+            Graf. Pruning ﬁlters for efﬁcient convnets. arXiv preprint    [48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning
+            arXiv:1608.08710, 2016.1,2,4,5,6,7,8                   structured sparsity in deep neural networks. InAdvances In
+         [32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.       Neural Information Processing Systems, pages 2074–2082,
+            Sparse convolutional neural networks. InProceedings of the       2016.1,2,3
+            IEEE Conference on Computer Vision and Pattern Recogni-    [49] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated´
+            tion, pages 806–814, 2015.2                            residual transformations for deep neural networks. arXiv
+         [33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed,       preprint arXiv:1611.05431, 2016.7
+            C. Fu, and A. C. Berg. SSD: single shot multibox detector.    [50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural
+            CoRR, abs/1512.02325, 2015.6                          network acoustic models with singular value decomposition.
+         [34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint       InINTERSPEECH, pages 2365–2369, 2013.2
+            arXiv:1511.05077, 2015.2                          [51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-
+         [35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training       efﬁcient convolutional neural networks using energy-aware
+            of convolutional networks through ffts.  arXiv preprint       pruning.arXiv preprint arXiv:1611.05128, 2016.2
+            arXiv:1312.5851, 2013.2                          [52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very
+         [36] V. Nair and G. E. Hinton. Rectiﬁed linear units improve       deep convolutional networks for classiﬁcation and detection.
+            restricted boltzmann machines. InProceedings of the 27th       IEEE transactions on pattern analysis and machine intelli-
+            international conference on machine learning (ICML-10),       gence, 38(10):1943–1955, 2016.1,2,3,5,6,7
+            pages 807–814, 2010.4
+         [37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable
+            parallel programming with CUDA.ACM Queue, 6(2):40–53,
+            2008.6
+
+
+
+
+                                                       1397
\ No newline at end of file
diff --git a/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt b/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt
new file mode 100644
index 0000000..a4ec71b
Binary files /dev/null and b/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt differ
diff --git a/Corpus/DEEP DOUBLE DESCENT - Preetum Nakkiran.txt b/Corpus/DEEP DOUBLE DESCENT - Preetum Nakkiran.txt
new file mode 100644
index 0000000..282e671
Binary files /dev/null and b/Corpus/DEEP DOUBLE DESCENT - Preetum Nakkiran.txt differ
diff --git a/Corpus/Deep Residual Learning for Image Recognition.txt b/Corpus/Deep Residual Learning for Image Recognition.txt
new file mode 100644
index 0000000..6cb144d
Binary files /dev/null and b/Corpus/Deep Residual Learning for Image Recognition.txt differ
diff --git a/Corpus/Direct Feedback Alignment Scales toModern Deep Learning Tasks and Architectures.txt b/Corpus/Direct Feedback Alignment Scales toModern Deep Learning Tasks and Architectures.txt
new file mode 100644
index 0000000..8b3ad5c
--- /dev/null
+++ b/Corpus/Direct Feedback Alignment Scales toModern Deep Learning Tasks and Architectures.txt	
@@ -0,0 +1,1161 @@
+                            Direct Feedback Alignment Scales to
+                     Modern Deep Learning Tasks and Architectures
+
+
+
+
+                      Julien Launay 1;2  Iacopo Poli 1  François Boniface 1  Florent Krzakala 1;2
+
+                                     1 LightOn   2 École Normale Supérieure
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+     arXiv:2006.12878v1  [stat.ML]  23 Jun 2020                         {julien, iacopo, francois, florent}@lighton.ai
+
+
+
+                                               Abstract
+
+                       Despite being the workhorse of deep learning, the backpropagation algorithm is
+                       no panacea. It enforces sequential layer updates, thus preventing efﬁcient paral-
+                       lelization of the training process. Furthermore, its biological plausibility is being
+                       challenged. Alternative schemes have been devised; yet, under the constraint of
+                       synaptic asymmetry, none have scaled to modern deep learning tasks and architec-
+                       tures. Here, we challenge this perspective, and study the applicability of Direct
+                       Feedback Alignment to neural view synthesis, recommender systems, geometric
+                       learning, and natural language processing. In contrast with previous studies lim-
+                       ited to computer vision tasks, our ﬁndings show that it successfully trains a large
+                       range of state-of-the-art deep learning architectures, with performance close to
+                       ﬁne-tuned backpropagation. At variance with common beliefs, our work supports
+                       that challenging tasks can be tackled in the absence of weight transport.
+
+
+                 1 Introduction
+
+                 While the backpropagation algorithm (BP) [1,2] is at the heart of modern deep learning achievements,
+                 it is not without pitfalls. For one, its weight updates are non-local and rely on upstream layers. Thus,
+                 they cannot be easily parallelized [3], incurring important memory and compute costs. Moreover,
+                 its biological implementation is problematic [4,5]. For instance, BP relies on the transpose of the
+                 weights to evaluate updates. Hence, synaptic symmetry is required between the forward and backward
+                 path: this is implausible in biological brains, and known as the weight transport problem [6].
+                 Consequently, alternative training algorithms have been developed. Some of these algorithms are
+                 explicitly biologically inspired [7–13], while others focus on making better use of available compute
+                 resources [3,14–19]. Despite these enticing characteristics, none has been widely adopted, as they
+                 are often demonstrated on a limited set of tasks. Moreover, as assessed in [20], their performance on
+                 challenging datasets under the constraint of synaptic asymmetry is disappointing.
+                 We seek to broaden this perspective, and demonstrate the applicability of Direct Feedback Alignment
+                 (DFA) [19] in state-of-the-art settings: from applications of fully connected networks such as neural
+                 view synthesis and recommender systems, to geometric learning with graph convolutions, and natural
+                 language processing with Transformers. Our results deﬁne new standards for learning without weight
+                 transport and show that challenging tasks can indeed be tackled under synaptic asymmetry.
+                 All code needed to reproduce our experiments is available athttps://github.com/lightonai/
+                  dfa-scales-to-modern-deep-learning.
+
+
+
+
+
+
+                  Preprint. Under review.                 1.1 Related work
+
+                 Training a neural network is a credit assignment problem: an update is derived for each parameter
+                 from its contribution to a cost function. To solve this problem, a spectrum of algorithms exists [21].
+
+                 Biologically motivated methods Finding a training method applicable under the constraints of
+                 biological brains remains an open problem. End-to-end propagation of gradients is unlikely to occur
+                 [22], implying local learning is required. Furthermore, the weight transport problem enforces synaptic
+                 asymmetry [6]. Inspired by auto-encoders, target propagation methods (TP) [10–12] train distinct
+                 feedback connections to invert the feedforward ones. Feedback alignment (FA) [13] replaces the
+                 transpose of the forward weights used in the backward pass by a random matrix. Throughout training,
+                 the forward weights learn toalignwith the arbitrary backward weights, eventually approximating BP.
+
+                 Beyond biological considerations As deep learning models grow bigger, large-scale distributed
+                 training is increasingly desirable. Greedy layer-wise training [14] allows networks to be built layer
+                 by layer, limiting the depth of backpropagation. To enable parallelization of the backward pass,
+                 updates must only depend on local quantities. Unsupervised learning is naturally suited for this,
+                 as it relies on local losses such as Deep InfoMax [17] and Greedy InfoMax [18]. More broadly,
+                 synthetic gradient methods, like decoupled neural interfaces [3,15] and local error signals (LES)
+                 [16], approximate gradients using layer-wise trainable feedback networks. DFA [19] expands on FA
+                 and directly projects a global error to each layer. A shared feedback path is still needed, but it only
+                 depends on a simple random projection operation.
+
+                 Performance of alternative methods Local training methods are successful in unsupervised learn-
+                 ing [18]. Even in a supervised setting, they scale to challenging datasets like CIFAR-100 or ImageNet
+                 [14,16]. Thus, locality is not too penalizing. However, TP, FA, and DFA are unable to scale to these
+                 tasks [20]. In fact, DFA is unable to train convolutional layers [23]. To enable feedback alignment
+                 techniques to perform well on challenging datasets, some form of weight transport is necessary:
+                 either by explicitly sharing sign information [24–26], or by introducing dedicated phases of alignment
+                 for the forward and backward weights where some information is shared [27]. To the best of our
+                 knowledge, no method compatible with the weight transport problem has ever been demonstrated on
+                 challenging tasks.
+
+                 1.2 Motivations and contributions
+
+                 We focus on DFA, a compromise between biological and computational considerations. Notably,
+                 DFA is compatible with synaptic asymmetry: this asymmetry raises important challenges, seemingly
+                 preventing learning in demanding settings. Moreover, it allows for asynchronous weight updates,
+                 and puts a single operation at the center of the training stage. This enables new classes of training
+                 co-processors [28, 29], leveraging dedicated hardware to perform the random projection.
+
+                 Extensive survey We apply DFA in a large variety of settings matching current trends in machine
+                 learning. Previous works have found that DFA is unsuitable for computer vision tasks [20,23]; but
+                 computer vision alone cannot be the litmus test of a training method. Instead, we consider four vastly
+                 different domains, across eight tasks, and with eleven different architectures. This constitutes a survey
+                 of unprecedented scale for an alternative training method, and makes a strong case for the possibility
+                 of learning without weight transport in demanding scenarios.
+
+                 Challenging settings We demonstrate the ability of DFA to tackle challenging tasks. We success-
+                 fully learn and render real-world 3D scenes (section 3.1.1); we perform recommendation at scale
+                 (section 3.1.2); we explore graph-based citation networks (section 3.2); and we consider language
+                 modelling with a Transformer (section 3.3). We study tasks at the state-of-the-art level, that have
+                 only been recently successfully tackled with deep learning.
+
+                 Modern architectures We prove that the previously established failure of DFA to train convolutions
+                 does not generalize. By evaluating performance metrics, comparing against a shallow baseline,
+                 measuring alignment, and visualizing t-SNE embeddings, we show that learning indeed occurs in
+                 layers involving graph convolutions and attention. This signiﬁcantly broadens the applicability of
+                 DFA–previously thought to be limited to simple problems like MNIST and CIFAR-10.
+
+                                                  2                 2 Methods
+
+                 Forward pass In a fully connected network, at layeriout ofN, neglecting its biases, withWi its
+                 weight matrix,fi its non-linearity, andhi its activations, the forward pass is:
+                                   8i2[i;:::;N] :ai =Wi hi 1 ;hi =fi (ai ):               (1)
+                 h0 =Xis the input data, andhN =f(aN ) =^yare the predictions. A task-speciﬁc cost function
+                 L(^y;y)is computed to quantify the quality of the predictions with respect to the targetsy.
+
+                  Backward pass with BP The weight updates are computed by backpropagation of the error vector.
+                 Using the chain-rule of derivatives, each neuron is updated based on its contribution to the cost
+                 function. Leaving aside the speciﬁcs of the optimizer used, the equation for the weight updates is:
+                                     @L                             @L W              Ti =     = [(W   a                              (2)@W       i+1 i+1 ) f0 (ai i )]hT ; ai 1  i =
+                                       i                             @ai
+
+                  Backward pass with DFA The gradient signalWT  ai+1 i+1 of the (i+1)-th layer violates synaptic
+                 asymmetry. DFA replaces it with a random projection of the topmost derivative of the loss, ay .
+                 For common classiﬁcation and regression losses such as the mean squared error or the negative log
+                 likelihood, this corresponds to a random projection of the global errore=^y y. WithBi , a ﬁxed
+                 random matrix of appropriate shape drawn at initialization for each layers:
+                                                                @L Wi = [(Bi  ay ) f0 (a      ai i )]hT ;  i 1  y =                  (3)@ay
+
+                 3 Experiments
+
+                 We study the applicability of DFA to a diverse set of applications requiring state-of-the-art architec-
+                 tures. We start with fully connected networks, where DFA has already been demonstrated, and address
+                 new challenging settings. We then investigate geometric learning: we apply DFA to graph neural net-
+                 works in classiﬁcation tasks on citation networks, as well as graph autoencoders. These architectures
+                 feature graph convolutions and attention layers. Finally, we use DFA to train a transformer-based
+                 Natural Language Processing (NLP) model on a dataset of more than 100 million tokens.
+
+                 3.1 Fully connected architectures
+
+                 DFA has been successful at training fully connected architectures, with performance on-par with
+                 backpropagation [19,20]. However, only computer vision tasks have been considered, where fully
+                 connected networks considerably underperform their convolutional counterpart. Here, we focus on
+                 tasks where fully connected architectures are state-of-the-art. Moreover, the architectures considered
+                 are deeper and more complex than those necessary to solve a simple task like MNIST.
+
+                 3.1.1 Neural view synthesis with Neural Radiance Fields
+                 The most recent state-of-the-artneural view synthesismethods are based on large fully connected
+                 networks: this is an ideal setting for a ﬁrst evaluation of DFA on a challenging task.
+
+                 Background There has been growing interest in methods capable of synthesising novel renders of
+                 a 3D scene using a dataset of past renders. The network is trained to learn an inner representation of
+                 the scene, and a classical rendering system can then query the model to generate novel views. With
+                 robust enough methods, real-world scenes can also be learned from a set of pictures.
+                 Until recently, most successful neural view synthesis methods were based on sampled volumetric
+                 representations [30–32]. In this context, Convolutional Neural Networks (CNNs) can be used to
+                 smooth out the discrete sampling of 3D space [33,34]. However, these methods scale poorly to
+                 higher resolutions, as they still require ﬁner and ﬁner sampling. Conversely, alternative schemes
+                 based on a continuous volume representation have succeeded in generating high-quality renders [35],
+                 even featuring complex phenomenons such as view-dependant scattering [36]. These schemes make
+                 point-wise predictions, and use fully connected neural networks to encode the scene.
+
+                                                  3                 Figure 1: Comparisons of NeRF-DFA with state-of-the-art methods trained with BP on the most
+                 challenging synthetic and real-world scenes. While NeRF-DFA generates render of lower quality,
+                 they maintain multi-view consistency and exhibit no geometric artefacts. BP results from [36].
+
+
+                 Setting We employ Neural Radiance Fields (NeRF) [36], the state-of-the-art for neural view
+                 synthesis. NeRF represents scenes as a continuous 5D function of space–three spatial coordinates,
+                 two viewing angles–and outputs a point-wise RGB radiance and opacity. A ray-casting renderer can
+                 then query the network to generate arbitrary views of the scene. The network modeling the continuous
+                 function is 10 layers deep. Two identical networks are trained: thecoarsenetwork predictions inform
+                 the renderer about the spatial coordinates that theﬁnenetwork should preferentially evaluate to avoid
+                 empty space and occluded regions.
+
+                 Results We report quantitative results of training NeRF with DFA in Table 1. Neural view synthesis
+                 methods are often better evaluated qualitatively: we showcase some renders in Figure 1.
+                 On a dataset of renders featuring complex scenes with non-Lambertian materials (NeRF-Synthetic
+                 [36]), NeRF-DFA outperforms two previous ﬁne-tuned state-of-the-art methods–Scene Representation
+                 Networks (SRN) [35] and Local Light Field Fusion (LLFF) [32]–and nearly matches the performance
+                 of Neural Volumes (NV) [34]. While DFA underperforms alternative methods trained with BP on
+                 the real world view dataset (LLFF-Real [32]), its performance remains signiﬁcant: real world view
+                 synthesis is a challenging tasks, and this level of PSNR indicates that learning is indeed happening.
+                 In particular, we ﬁnd that NeRF-DFA retains the key characteristics of NeRF-BP: it can render view-
+                 dependant effects, and is multi-view consistent. The last point is an especially important achievement,
+                 and most visible in videos, as it is a challenge for most algorithms [30–32,35]. The main drawback
+                 of NeRF-DFA appears to be a seemingly lower render deﬁnition. The NeRF architecture has not
+
+
+                 Table 1: Peak Signal to Noise Ratio (PSNR, higher is better) of neural view synthesis methods
+                 trained with backpropagation against NeRF trained with DFA. Even when trained with DFA, NeRF
+                 outperforms two state-of-the-art methods on a synthetic dataset (NeRF-Synthetic), and achieves fair
+                 performance on a challenging real world views datasets (LLFF-Real). BP results from [36].
+
+                                            NV SRN LLFF NeRF
+                                            BP BP BP BP DFA
+                              NeRF-Synthetic 26.05 22.26 24.88 31.01 25.41
+                              LLFF-Real      / 22.84 24.13 26.50 20.77
+
+
+                                                  4                 been ﬁne-tuned to achieve these results: DFA works out-of-the-box on this advanced method. Future
+                 research focusing on architectural changes to NeRF could improve performance with DFA; some
+                 preliminary results are included in the supplementary material.
+
+                 3.1.2 Click-through rate prediction with recommender systems
+                 We have demonstrated that DFA can train large fully connected networks on the difﬁcult task of neural
+                 view synthesis. We now seek to use DFA in more complex heterogeneous architectures, combining
+                 the use of fully connected networks with other machine learning methods.Recommender systemsare
+                 an ideal application for such considerations.
+
+                 Background Recommender systems are used to model the behavior of users and predict future
+                 interactions. In particular, in the context of click-through rate (CTR) prediction, these systems model
+                 the probability of a user clicking on a given item. Building recommender systems is hard [37]: their
+                 input is high-dimensional and sparse, and the model must learn to extract high-order combinatorial
+                 features from the data. Moreover, they need to do so efﬁciently, as they are used to make millions of
+                 predictions and the training data may contain billions of examples.
+                 Factorization Machines (FM) [38] use inner-products of latent vectors between features to extract
+                 pairwise feature interactions. They constitute an excellent baseline for shallow recommender systems,
+                 but fail to efﬁciently transcribe higher-level features. To avoid extensive feature engineering, it has
+                 been suggested that deep learning can be used in conjunction with wide shallow models to extract
+                 these higher-level features [39]. In production, these systems are regularly retrained on massive
+                 datasets: the speedup allowed by backward unlocking in DFA is thus of particular interest.
+
+                 Setting Deep Factorization Machines (DeepFM) [40] combine FM and a deep fully connected
+                 neural network, which we train with DFA. The input embedding is still trained directly via gradient
+                 descent, as weight transport is not necessary to backpropagate through the FM. Deep & Cross
+                 Networks (DCN) [41] replace the FM with a Cross Network, a deep architecture without non-
+                 linearities capable of extracting high-degree interactions across features. We train the fully connected
+                 network, the deep cross network, and the embeddings with DFA. Finally, Adaptative Factorization
+                 Network (AFN) [42] uses Logarithmic Neural Networks [43] to enhance the representational power
+                 of its deep component. We evaluate these methods on the Criteo dataset [44], which features nearly
+                 46 million samples of one million sparse features. This is a difﬁcult task, where performance
+                 improvements of the AUC on the0.001-levelcan enhance CTR signiﬁcantly [39].
+
+                 Results Performance metrics are reported in Table 2. To obtain these results, a simple hyperpa-
+                 rameter grid search over optimization and regularization parameters was performed for BP and DFA
+                 independently. DFA successfully trains all methods above the FM baseline, and in fact matches BP
+                 performance in both DeepFM and AFN. Because of their complexity, recommender systems require
+                 intensive tuning and feature engineering to perform at the state-of-the-art level–and reproducing
+                 existing results can be challenging [45]. Hence, it is not surprising that a performance gap exists with
+                 Deep&Cross–further ﬁne-tuning may be necessary for DFA to reach BP performance.
+                 Alignment measurements corroborate that learning is indeed occurring in the special layers of
+                 Deep&Cross and AFN–see supplementary for details. Our results on recommender systems support
+                 that DFA can learn in a large variety of settings, and that weight transport is not necessary to solve a
+                 difﬁcult recommendation task.
+
+
+                 Table 2: AUC (higher is better) and log loss (lower is better) of recommender systems trained on the
+                 Criteo dataset [44]. Even in complex heterogeneous architectures, DFA performance is in line with
+                 BP. Values inboldindicate DFA AUC within 0.001 from the BP AUC or better.
+
+                                FM DeepFM Deep&Cross AFN
+                                       BP DFA BP DFA BP DFA
+                          AUC 0.7915 0.7954 0.7956 0.8104 0.8009 0.7933 0.7924
+                          Loss 0.4687 0.4610 0.4624 0.4414 0.4502 0.4630 0.4621
+
+
+                                                  5                 3.2 Geometric Learning with Graph Convolutional Networks
+
+                 The use of sophisticated architectures beyond fully connected layers is necessary for certain tasks,
+                 such asgeometric learning[46], where information lies in a complex structured domain. To address
+                 geometric learning tasks, methods capable of handling graph-based data are commonly needed.
+                 Graph convolutional neural networks (GCNNs) [47–50] have demonstrated the ability to process
+                 large-scale graph data efﬁciently. We study the applicability of DFA to these methods, including
+                 recent architectures based on an attention mechanism. Overall, this is an especially interesting setting,
+                 as DFA fails to train more classic 2D image convolutional layers [23].
+
+                 Background Complex data like social networks or brain connectomes lie on irregular or non-
+                 Euclidean domains. They can be represented as graphs, and efﬁcient processing in the spectral
+                 domain is possible. Non-spectral techniques to apply neural networks to graphs have also been
+                 developed [51–53], but they exhibit unfavorable scaling properties. The success of CNNs in deep
+                 learning can be attributed to their ability to efﬁciently process structured high-dimensional data
+                 by sharing local ﬁlters. Thus, a generalization of the convolution operator to the graph domain is
+                 desirable: [47] ﬁrst proposed a spectral convolution operation for graphs, and [48] introduced a form
+                 of regularization to enforce spatial locality of the ﬁlters. We use DFA to train different such GCNNs
+                 implementations. We study both spectral and non-spectral convolutions, as well as methods inspired
+                 by the attention mechanism. We consider the task of semi-supervised node classiﬁcation: nodes from
+                 a graph are classiﬁed using their relationship to other nodes as well as node-wise features.
+
+                 Setting Fast Localized Convolutions (ChebConv) [49] approximate the graph convolution kernel
+                 with Chebyshev polynomials, and are one of the ﬁrst scalable convolution methods on graph. Graph
+                 Convolutions (GraphConv) [50] remove the need for an explicit parametrization of the kernel by
+                 enforcing linearity of the convolution operation on the graph Laplacian spectrum. It is often considered
+                 as the canonical graph convolution. More recent methods do not operate in the spectral domain. Spline
+                 Convolutions (SplineConv) [54] use a spline-based kernel, enabling the inclusion of information
+                 about the relative positioning of nodes, enhancing their representational power–for instance in the
+                 context of 3D meshes. Graph Attention Networks (GATConv) [55] use self-attention [56] layers to
+                 enable predictions at a given node toattendmore speciﬁcally to certain parts of its neighborhood.
+                 Finally, building upon Jumping Knowledge Network [57], Just Jump (DNAConv) [58] uses multi-
+                 head attention [59] to enhance the aggregation process in graph convolutions and enable deeper
+                 architectures. We use PyTorch Geometric [60] for reference implementation of all of these methods.
+                 We evaluate performance on three citation network datasets: Cora, CiteSeer, and PubMed [61].
+
+                 Results We report classiﬁcation accuracy in Table 3. BP and DFA regularization and optimiza-
+                 tion hyperparameters are ﬁne-tuned separately on the Cora dataset. In general, we ﬁnd that less
+                 regularization and lower learning rates are needed with DFA. DFA successfully trains all graph
+                 methods, independent of whether they use the spectral domain or not, and even if they use attention.
+                 Furthermore, for GraphConv, SplineConv, and GATConv DFA performance nearly matches BP.
+                 As GCNNs struggle with learning meaningful representations when stacking many layers [62], all
+                 architectures but DNAConv are quite shallow (two layers). However, DFA performance is still
+                 signiﬁcantly higher than that of a shallow training method–see supplementary for details. The lower
+                 performance on DNAConv is not a failure to learn: alignment measurements show that learning is
+                 indeed occurring. It may be explained instead by a need for more in-depth ﬁne-tuning, as this is a
+                 deep architecture with 5 successive attention layers.
+
+                 Table 3: Classiﬁcation accuracy (%, higher is better) of graph convolution methods trained with BP
+                 and DFA, on citation networks [61]. But for ChebConv and DNAConv, DFA performance nearly
+                 matches BP performance. Values inboldwhen DFA is within 2.5% of BP.
+
+                              ChebConv GraphConv SplineConv GATConv DNAConv
+                               BP DFA BP DFA BP DFA BP DFA BP DFA
+                      Cora    79.2 75.4 80.1  79.9  81.0 77.7 82.6  80.6  84.6  82.9
+                      CiteSeer 69.5  67.6  71.6  69.4  70.0  69.8  72.0  71.2  73.4 70.8
+                      PubMed 79.5 75.7 78.8  77.8  77.5  77.2  77.7  77.1  87.2 79.9
+
+
+                                                  6                                      GAE
+                                    BP DFA
+                              AUC 0.918 0.900Cora    AP 0.918 0.900
+                              AUC 0.886 0.879CiteSeer AP 0.895 0.889
+                              AUC 0.967 0.945PubMed AP 0.966 0.945
+
+                      Table 4: AUC and Average Precision Figure 2: t-SNE visualization of the hidden layer
+                      (AP, higher is better) for a Graph- activations of a two-layer GraphConv trained on
+                      Conv GAE trained with BP or DFA Cora with DFA. Classes forms clear clusters, in-
+                      on citation networks. DFA repro- dicating that a useful intermediary representation
+                      duces BP performance.         is learned. Colors represent different classes.
+
+
+                 We further demonstrate that DFA helps graph convolutions learn meaningful representations by
+                 aplying t-SNE [63,64] to the hidden layer activations in GraphConv (Figure 2). Cluster of classes
+                 are well-separated, indicating that a useful intermediary representation is derived by the ﬁrst layer.
+
+                 Graph autoencoders We consider one last application of graph convolutions, in the context of
+                 graph autoencoders (GAE). We train a non-probabilistic GAE [65] based on GraphConv with DFA,
+                 and report results in Table 4. DFA performance is always in line with BP.
+
+                 3.3 Natural Language Processing with Transformers
+
+                 We complete our study by training a Transformer [59] on a language modelling task. Transformers
+                 have proved successful in text, image, music generation, machine translation, and many supervised
+                 NLP tasks [59,66–69]. Here, we demonstrate that DFA can train them, and we show the inﬂuence of
+                 tuning the optimizer hyperparameters in narrowing the gap with BP.
+
+                 Background NLP has largely beneﬁted from advances in deep learning. Recurrent Neural Net-
+                 works were responsible for early breakthroughs, but their sequential nature prevented efﬁcient
+                 parallelization of data processing. Transformers are attention-based models that do not rely on
+                 recurrence or convolution. Their ability to scale massively has allowed the training of models with
+                 several billion parameters [70,71], obtaining state-of-the-art results on all NLP tasks: Transformers
+                 now top the prominent SQuAD 2.0 [72,73] and SuperGLUE [74] benchmarks. In parallel, transfer
+                 learning in NLP has leaped forward thanks to language modelling, the unsupervised task of predicting
+                 the next word. It can leverage virtually unlimited data from web scraping [75]. This enabled the
+                 training ofuniversal language models[76] on extremely large and diversiﬁed text corpora. These
+                 models are useful across a wide range of domains, and can solve most NLP tasks after ﬁne-tuning.
+
+                 Setting The prominence of both language modelling and Transformers gives us the ideal candidate
+                 for our NLP experiments: we train a Transformer to predict the next word on the WikiText-103
+                 dataset [77], a large collection ofgoodandfeaturedWikipedia articles. We use byte-pair-encoding
+                 [78] with 32,000 tokens. Our setup is similar to GPT [66]: we adapt the Transformer, originally an
+                 encoder-decoder model designed for machine translation, to language modelling. We keep only the
+                 encoder and mask the tokens to predict. Our architecture consists in 6 layers, 8 attention heads, a
+                 model dimension of 512, and a hidden size of 2048 in the feed-forward blocks. The text is sliced
+                 in chunks of 128 tokens and batches of 64 such chunks, resulting in 8192 tokens per batch. Our
+                 baseline is trained with BP using the optimization setup of [59]. We found perplexity after 20 epochs
+                 to be an excellent indicator of perplexity at convergence; to maximize the number of experiments
+                 we could perform, we report the best validation perplexity after 20 epochs. We study two ways of
+                 implementing DFA: applying the feedback after every encoder block (macro) or after every layer in
+                 those blocks (micro). The input embedding layer receives gradients from the next feedback point
+                 through BP. This leaves some amount of weight transport even in themicrocase.
+
+                                                  7                 Table 5: Best validation perplexity after 20 epochs of a Transformer trained on WikiText-103 (lower
+                 is better). The BP and DFA baselines share all hyper-parameters. InMacrothe feedback is applied
+                 after every transformer layer, while inMicrothe feedback is applied after every sub-layer. The
+                 learning rate of Adam without the learning rate scheduler is5:10  5 . With the scheduler, the initial
+                 learning rate is1:10  4 and it is multiplied by 0.2 when performance plateaus, with a patience of 1.
+                 * score after 22 epochs to let the learning rate scheduler take effect
+
+                                          DFA BP
+                          Baseline + Adam + 2 = 0:999 + LR schedule Baseline + 2 = 0:999
+                   Macro 95.0 77.1 55.0 52.0          34.4 29.8Micro  182 166 99.9 93.3*
+
+
+                 Results Our results are summarized in Table 5. Hyper-parameters ﬁne-tuned for BP did not fare
+                 well with DFA, but changes in the optimizer narrowed the gap between BP and DFA considerably.
+                 The learning rate schedule used on top of Adam [79] in [59] proved detrimental. Using Adam alone
+                 required reducing the learning rate between BP and DFA. Increasing 2 from 0.98 [59] to 0.999
+                 improved performance signiﬁcantly. Finally, a simple scheduler that reduces the learning rate when
+                 the validation perplexity plateaus helped reducing it further. Considering that the perplexity of the
+                 shallow baseline is over 400, DFA is clearly able to train Transformers. However, our results are not
+                 on par with BP, especially in themicrosetting. A substantial amount of work remains to make DFA
+                 competitive with BP, even more so in a minimal weight transport scenario. The large performance
+                 improvements brought by small changes in the optimizer indicate that intensive ﬁne-tuning, common
+                 in publications introducing state-of-the-art results, could close the gap between BP and DFA.
+
+                 4 Conclusion and outlooks
+
+                 We conducted an extensive study demonstrating the ability of DFA to train modern architectures. We
+                 considered a broad selection of domains and tasks, with complex models featuring graph convolutions
+                 and attention. Our results on large networks like NeRF and Transformers are encouraging, suggesting
+                 that with further tuning, such leading architectures can be effectively trained with DFA. Future work
+                 on principled training with DFA–in particular regarding the inﬂuence of common practices and
+                 whether new procedures are required–will help close the gap with BP.
+                 More broadly, we veriﬁed for the ﬁrst time that learning under synaptic asymmetry is possible beyond
+                 fully-connected layers, and in tasks signiﬁcantly more difﬁcult than previously considered. This
+                 addresses a notable concern in biologically-plausible architectures. DFA still requires an implausible
+                 global feedback pathway; however, local training has already been demonstrated at scale. The next
+                 step towards biologically-compatible learning is a local method without weight transport.
+                 While the tasks and architectures we have considered are not biologically inspired, they constitute
+                 a good benchmark forbehavioural realism[20]. Any learning algorithm claiming to approximate
+                 the brain should reproduce its ability to solve complex and unseen task. Furthermore, even though
+                 the current implementation of mechanisms like attention is devoid of biological considerations, they
+                 represent broader concepts applicable to human brains [80]. Understanding how our brain learns is a
+                 gradual process, and future research could incorporate further realistic elements, like spiking neurons.
+                 Finally, unlocking the backward pass in large architectures like Transformers is promising. More opti-
+                 mized implementation of DFA–built at a lower-level of existing ML libraries–could unlock signiﬁcant
+                 speed-up. Leveraging the use of a single random projection as the cornerstone of training, dedicated
+                 accelerators may employ more exotic hardware architectures. This will open new possibilities in the
+                 asynchronous training of massive models.
+
+
+
+
+
+
+
+
+
+
+                                                  8                 Broader Impact
+
+                 Of our survey This study is the ﬁrst experimental validation of DFA as an effective training method
+                 in a wide range of challenging tasks and neural networks architectures. This signiﬁcantly broadens the
+                 applications of DFA, and more generally brings new insight on training techniques alternative to back-
+                 propagation. From neural rendering and recommender systems, to natural language processing or
+                 geometric learning, each of these applications has its own potential impact. Our task selection process
+                 was motivated by current trends in deep learning, as well as by technically appealing mechanisms
+                 (graph convolutions, attention). A limit of our survey is that our–arguably biased–selection of tasks
+                 cannot be exhaustive. Our experiments required substantial cloud compute resources, with state-of-
+                 the-art GPU hardware. Nevertheless, as this study provides new perspectives for hardware accelerator
+                 technologies, it may favor the application of neural networks in ﬁelds previously inaccessible because
+                 of computational limits. Future research on DFA should continue to demonstrate its use in novel
+                 contexts of interest as they are discovered.
+
+                 Of the considered applications Each of the applications considered in our study has a wide
+                 potential impact, consider for example the impact of textual bias in pretrained word embeddings [81].
+                 We refer to [82] and references therein for a discussion of ethical concerns of AI applications.
+
+                 Of DFA as a training method DFA enables parallelization of the backward pass and places a
+                 single operation at the center of the training process, opening the prospect of reducing the power
+                 consumption of training chips by an order of magnitude [28]. Not only is more efﬁcient training a
+                 path to more environmentally responsible machine learning [83], but it may lower the barrier of entry,
+                 supporting equality and sustainable development goals. A signiﬁcant downside of moving from BP to
+                 DFA is a far more limited understanding of how to train models and how the trained models behave.
+                 There is a clear empirical understanding of the impact of techniques such as batch normalization
+                 or skip connections on the performance of BP; new insights need to be obtained for DFA. BP also
+                 enjoys decades of works on topics like adversarial attacks, interpretability, and fairness. Much of
+                 this work has to be cross-checked for alternative training methods, something we encourage further
+                 research to consider as the next step towards safely and responsively scaling up DFA.
+
+                 Of biologically motivated method Finally, a key motivation for this study was to demonstrate that
+                 learning challenging tasks was possible without weight transport. Biologically motivated methods
+                 are a more foundational research direction, and as such the possible long-term impact of our ﬁndings
+                 is harder to estimate under this light. However, fundamental research of this kind is important to open
+                 new pathways for ML and neuroscience.
+
+                 Acknowledgments and Disclosure of Funding
+
+                 We thank Igor Carron and Laurent Daudet for the general guidance on the subject of this investigation
+                 and the insightful comments, as well as the larger LightOn team for their support.
+
+                 References
+                  [1]P. J. Werbos.Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
+                     Sciences. PhD thesis, Harvard University, 1974.
+                  [2]D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error
+                     propagation. InParallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
+                  [3]Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves,
+                     David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients.
+                     InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages
+                     1627–1635, 2017.
+                  [4]Francis Crick. The recent excitement about neural networks.Nature, 337(6203):129–132, 1989.
+                  [5]Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep
+                     learning and neuroscience.Frontiers in computational neuroscience, 10:94, 2016.
+
+                                                  9                  [6]Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance.
+                     Cognitive science, 11(1):23–63, 1987.
+                  [7]Javier R Movellan. Contrastive hebbian learning in the continuous hopﬁeld model. InConnec-
+                     tionist models, pages 10–17. Elsevier, 1991.
+                  [8]Randall C O’Reilly. Biologically plausible error-driven learning using local activation differ-
+                     ences: The generalized recirculation algorithm.Neural computation, 8(5):895–938, 1996.
+                  [9]Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. InArtiﬁcial intelligence
+                     and statistics, pages 448–455, 2009.
+                 [10]Yann Le Cun. Learning process in an asymmetric threshold network. InDisordered systems
+                     and biological organization, pages 233–240. Springer, 1986.
+                 [11]Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target
+                     propagation.arXiv preprint arXiv:1407.7906, 2014.
+                 [12]Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propaga-
+                     tion. InJoint european conference on machine learning and knowledge discovery in databases,
+                     pages 498–515. Springer, 2015.
+                 [13]Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synap-
+                     tic feedback weights support error backpropagation for deep learning.Nature communications,
+                     7(1):1–10, 2016.
+                 [14]Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can
+                     scale to imagenet. InInternational Conference on Machine Learning, pages 583–593, 2019.
+                 [15]Wojciech M Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan
+                     Pascanu. Sobolev training for neural networks. InAdvances in Neural Information Processing
+                     Systems, pages 4278–4287, 2017.
+                 [16]Arild Nøkland and Lars Hiller Eidnes. Training neural networks with local error signals. In
+                     International Conference on Machine Learning, pages 4839–4850, 2019.
+                  [17]R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,
+                     Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information
+                     estimation and maximization. InInternational Conference on Learning Representations, 2019.
+                     URLhttps://openreview.net/forum?id=Bklr3j0cKX.
+                 [18]Sindy Löwe, Peter O’Connor, and Bastiaan Veeling. Putting an end to end-to-end: Gradient-
+                     isolated learning of representations. InAdvances in Neural Information Processing Systems,
+                     pages 3033–3045, 2019.
+                 [19] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In
+                     Advances in neural information processing systems, pages 1037–1045, 2016.
+                 [20]Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy
+                     Lillicrap. Assessing the scalability of biologically-motivated deep learning algorithms and
+                     architectures. InAdvances in Neural Information Processing Systems, pages 9368–9378, 2018.
+                 [21]Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton.
+                     Backpropagation and the brain.Nature Reviews Neuroscience, pages 1–12, 2020.
+                 [22]Natalia Caporale and Yang Dan. Spike timing–dependent plasticity: a hebbian learning rule.
+                     Annu. Rev. Neurosci., 31:25–46, 2008.
+                 [23]Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with
+                     direct feedback alignment.arXiv preprint arXiv:1906.04554, 2019.
+                 [24]Qianli Liao, Joel Z Leibo, and Tomaso Poggio. How important is weight symmetry in back-
+                     propagation? InThirtieth AAAI Conference on Artiﬁcial Intelligence, 2016.
+
+                                                  10                 [25]Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep
+                     convolutional networks.arXiv preprint arXiv:1812.06488, 2018.
+
+                 [26]Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. Biologically-plausible learning
+                     algorithms can scale to large datasets. InInternational Conference on Learning Representations,
+                     2019. URLhttps://openreview.net/forum?id=SygvZ209F7.
+
+                 [27]Mohamed Akrout, Collin Wilson, Peter C Humphreys, Timothy Lillicrap, and Douglas Tweed.
+                     Using weight mirrors to improve feedback alignment.arXiv preprint arXiv:1904.05391, 2019.
+
+                 [28]Julien Launay, Iacopo Poli, Kilian Müller, Igor Carron, Laurent Daudet, Florent Krzakala, and
+                     Sylvain Gigan. Light-in-the-loop: using a photonics co-processor for scalable training of neural
+                     networks, 2020.
+
+                 [29]Charlotte Frenkel.Bottom-Up and Top-Down Neuromorphic Processor Design: Unveiling
+                     Roads to Embedded Cognition. PhD thesis, UCL-Université Catholique de Louvain, 2020.
+
+                 [30]Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis.ACM Transactions on
+                     Graphics (TOG), 36(6):1–11, 2017.
+
+                  [31]John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck,
+                     Noah Snavely, and Richard Tucker. Deepview: View synthesis with learned gradient descent.
+                     InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
+                     2367–2376, 2019.
+
+                 [32]Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi
+                     Ramamoorthi, Ren Ng, and Abhishek Kar. Local light ﬁeld fusion: Practical view synthesis
+                     with prescriptive sampling guidelines.ACM Transactions on Graphics (TOG), 38(4):1–14,
+                     2019.
+
+                 [33]Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael
+                     Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. InProceedings of the IEEE
+                     Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019.
+
+                 [34]Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and
+                     Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.ACM
+                     Transactions on Graphics (TOG), 38(4):65, 2019.
+
+                 [35]Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks:
+                     Continuous 3d-structure-aware neural scene representations. InAdvances in Neural Information
+                     Processing Systems, pages 1119–1130, 2019.
+
+                  [36]Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi,
+                     and Ren Ng. Nerf: Representing scenes as neural radiance ﬁelds for view synthesis.arXiv
+                     preprint arXiv:2003.08934, 2020.
+
+                 [37]H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady,
+                     Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view
+                     from the trenches. InProceedings of the 19th ACM SIGKDD international conference on
+                     Knowledge discovery and data mining, pages 1222–1230, 2013.
+
+                 [38]Steffen Rendle. Factorization machines. In2010 IEEE International Conference on Data
+                     Mining, pages 995–1000. IEEE, 2010.
+
+                 [39]Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye,
+                     Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for
+                     recommender systems. InProceedings of the 1st workshop on deep learning for recommender
+                     systems, pages 7–10, 2016.
+
+                 [40]Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a
+                     factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247,
+                     2017.
+
+                                                  11                 [41]Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click
+                     predictions. InProceedings of the ADKDD’17, ADKDD’17, New York, NY, USA, 2017.
+                     Association for Computing Machinery. ISBN 9781450351942. doi: 10.1145/3124749.3124754.
+                     URLhttps://doi.org/10.1145/3124749.3124754.
+                 [42]Weiyu Cheng, Yanyan Shen, and Linpeng Huang. Adaptive factorization network: Learning
+                     adaptive-order feature interactions. InThirty-Fourth AAAI Conference on Artiﬁcial Intelligence,
+                     2020.
+                 [43]J Wesley Hines. A logarithmic neural network architecture for unbounded non-linear function
+                     approximation. InProceedings of International Conference on Neural Networks (ICNN’96),
+                     volume 2, pages 1245–1250. IEEE, 1996.
+                 [44]Criteo. Kaggle contest dataset is now available for academic use!http://labs.criteo.com/
+                     2014/09/kaggle-contest-dataset-now-available-academic-use/, 2014. accessed
+                     on the 2020-05-20.
+                 [45]Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much
+                     progress? a worrying analysis of recent neural recommendation approaches. InProceedings of
+                     the 13th ACM Conference on Recommender Systems, pages 101–109, 2019.
+                 [46]Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
+                     Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Magazine, 34
+                     (4):18–42, 2017.
+                 [47]Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally
+                     connected networks on graphs. InInternational Conference on Learning Representations, pages
+                     http–openreview, 2014.
+                 [48]Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured
+                     data.arXiv preprint arXiv:1506.05163, 2015.
+                  [49]Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks
+                     on graphs with fast localized spectral ﬁltering. InAdvances in neural information processing
+                     systems, pages 3844–3852, 2016.
+                 [50]Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional
+                     networks. InInternational Conference on Learning Representations (ICLR), 2017.
+                 [51]Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph
+                     domains. InProceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.,
+                     volume 2, pages 729–734. IEEE, 2005.
+                 [52]Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
+                     The graph neural network model.IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
+                  [53]Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural
+                     networks. InInternational Conference on Learning Representations, 2016.
+                 [54]Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric
+                     deep learning with continuous b-spline kernels. InProceedings of the IEEE Conference on
+                     Computer Vision and Pattern Recognition, pages 869–877, 2018.
+                 [55]Petar Velickoviˇ  c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua´
+                     Bengio. Graph attention networks. InInternational Conference on Learning Representations,
+                     2018. URLhttps://openreview.net/forum?id=rJXMpikCZ.
+                 [56] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
+                     learning to align and translate. In3rd International Conference on Learning Representations,
+                     ICLR 2015, 2015.
+                 [57]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
+                     networks? InInternational Conference on Machine Learning, 2018.
+
+                                                  12                 [58]Matthias Fey. Just jump: Dynamic neighborhood aggregation in graph neural networks. In
+                     ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
+
+                 [59]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+                     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information
+                     processing systems, pages 5998–6008, 2017.
+
+                 [60]Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric.
+                     InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
+
+                 [61]Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-
+                     Rad. Collective classiﬁcation in network data.AI magazine, 29(3):93–93, 2008.
+
+                 [62]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
+                     networks? InInternational Conference on Learning Representations, 2019. URLhttps:
+                     //openreview.net/forum?id=ryGs6iA5Km.
+
+                  [63]Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine
+                     learning research, 9(Nov):2579–2605, 2008.
+
+                 [64]David M Chan, Roshan Rao, Forrest Huang, and John F Canny. Gpu accelerated t-distributed
+                     stochastic neighbor embedding.Journal of Parallel and Distributed Computing, 131:1–13,
+                     2019.
+
+                 [65]Thomas N Kipf and Max Welling. Variational graph auto-encoders.NIPS Workshop on Bayesian
+                     Deep Learning, 2016.
+
+                 [66]Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language
+                     understanding with unsupervised learning.Technical report, OpenAI, 2018.
+
+                 [67]Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku,
+                     and Dustin Tran. Image transformer.ArXiv, abs/1802.05751, 2018.
+
+                 [68]Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya
+                     Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341, 2020.
+
+                 [69]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
+                     deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer-
+                     ence of the North American Chapter of the Association for Computational Linguistics: Human
+                     Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis,
+                     Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.
+                     URLhttps://www.aclweb.org/anthology/N19-1423.
+
+                 [70]Mohammad Shoeybi, Mostofa Ali Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and
+                     Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model
+                     parallelism.ArXiv, abs/1909.08053, 2019.
+
+                 [71]Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
+                     Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
+                     few-shot learners.arXiv preprint arXiv:2005.14165, 2020.
+
+                  [72]Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+
+                     questions for machine comprehension of text. InProceedings of the 2016 Conference on
+                     Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Novem-
+                     ber 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL
+                     https://www.aclweb.org/anthology/D16-1264.
+
+                  [73]Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable
+                     questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for
+                     Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia,
+                     July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL
+                     https://www.aclweb.org/anthology/P18-2124.
+
+                                                  13                 [74]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix
+                     Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose
+                     language understanding systems. InAdvances in Neural Information Processing Systems, pages
+                     3261–3275, 2019.
+                 [75]The Common Crawl Team. Common Crawl.https://commoncrawl.org, 2020.
+                 [76]Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁca-
+                     tion. InACL. Association for Computational Linguistics, 2018. URLhttp://arxiv.org/
+                     abs/1801.06146.
+                  [77]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
+                     models.ArXiv, abs/1609.07843, 2017.
+                 [78]Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare
+                     words with subword units. InProceedings of the 54th Annual Meeting of the Association
+                     for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany,
+                     August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL
+                     https://www.aclweb.org/anthology/P16-1162.
+                  [79]Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International
+                     Conference on Learning Representations, 12 2014.
+                  [80]Grace W Lindsay. Attention in psychology, neuroscience, and machine learning.Frontiers in
+                     Computational Neuroscience, 14:29, 2020.
+                 [81]Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.
+                     Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In
+                     Advances in neural information processing systems, pages 4349–4357, 2016.
+                 [82]Alexandra Luccioni and Yoshua Bengio. On the morality of artiﬁcial intelligence.arXiv preprint
+                     arXiv:1912.11945, 2019.
+                 [83]Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for
+                     deep learning in nlp.arXiv preprint arXiv:1906.02243, 2019.
+                 [84]Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer:
+                     Rethinking self-attention in transformer models.arXiv preprint arXiv:2005.00743, 2020.
+                 [85]Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao,
+                     and Jiawei Han. On the variance of the adaptive learning rate and beyond.arXiv preprint
+                     arXiv:1908.03265, 2019.
+                 [86]Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. Fixed encoder self-attention patterns
+                     in transformer-based machine translation.arXiv preprint arXiv:2002.10260, 2020.
+                 [87]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
+                     Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
+                     Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
+                     Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-
+                     performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
+                     Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32,
+                     pages 8024–8035. Curran Associates, Inc., 2019. URLhttp://papers.neurips.cc/paper/
+                     9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
+                     pdf.
+
+
+
+
+
+
+
+
+
+
+                                                  14                 Appendix
+
+
+                 We ﬁrst provide additional elements to corroborate our ﬁndings: alignment measurement (Section
+                 A), and shallow baselines (Section B). We then discuss the process of adapting the considered
+                 architectures for DFA (Section C), and the issue of weight transport in attention layers (Section D).
+                 We provide some supplementary results for NeRF (Section E), including details of performance on
+                 each scene of each datatset, and a discussion on possible mitigation of DFA shortcomings. Finally,
+                 we outline steps necessary for reproduction of this work (Section F).
+
+                 A Alignment
+
+                 Alignment measurement In feedback alignment methods, the forward weights learn toalignwith
+                 the random backward weights, making the delivered updates useful. This alignment can be quantiﬁed
+                 by measuring the cosine similarity between the gradient signal delivered by DFABi  ay and the
+                 gradient signal BP would have deliveredWT  ai+1 i+1 . For learning to occur and DFA to work as
+                 a training method, there must be alignment. This can be measured numerically [23]. Measuring
+                 alignments allows to check whether or not the layers are effectively being trained by DFA, regardless
+                 of performance metrics. We note that any alignment value superior to 0 signiﬁes that learning is
+                 occuring. Values closer to 1 indicate a better match with BP, but small alignment values are sufﬁcient
+                 to enable learning. We report values measured at the deepest DFA layer.
+
+                 Recommender systems We measure alignment on the Criteo dataset, in the two architectures
+                 featuring non-conventional fully-connected layers: Deep & Cross and AFN. Alignment is measured
+                 after 15 epochs of training, and averaged over a random batch of 512 samples. Results are reported in
+                 table A.1. These alignment measurements indicate that learning is indeed occurring in the cross and
+                 logarithmic layers. High-variance of alignment in the cross layers is unique: it may be explained by
+                 the absence of non-linearity, and account for the difference in performance between BP and DFA on
+                 this architecture–which is higher than on the others.
+
+                 Table A.1: Alignment cosine similarity (higher is better, standard deviation in parenthesis) of
+                 recommender systems as measured on the Criteo dataset. Learning occurs in both architectures, and
+                 high variance may explain the larger performance gap on Deep & Cross compared to other methods.
+
+                                              Deep & Cross AFN
+                                    Alignment   0.40 (0.91) 0.49 (0.08)
+
+
+                 Graph convolutions We measure alignment on the Cora dataset, after 250 epochs of training,
+                 averaging values over every sample available–train, validation, and test split included. Results are
+                 reported in Table A.2. We observe high alignment values in all architectures, indicative that learning
+                 is indeed occuring. Slightly lower values in SplineConv and GATConv may be explained by the use
+                 of the Exponential Linear Unit (ELU) instead of the Rectiﬁed Linear Unit (ReLU) used as activation
+                 in other architectures.
+                 Table A.2: Alignment cosine similarity (standard deviation in parenthesis) of various graph convolu-
+                 tions architectures as measured on the Cora dataset. These values corroborate that DFA successfully
+                 trains all architectures considered.
+
+                                ChebConv GraphConv SplineConv GATConv DNAConv
+                      Alignment  0.87 (0.12) 0.77 (0.25) 0.56 (0.22) 0.63 (0.18) 0.92 (0.30)
+
+
+                 B Shallow baselines
+
+                 Shallow learning We compare DFA to BP, but also to shallow learning–where only the topmost
+                 layer is trained. While DFA may not reach the performance level of BP, it should still vastly
+
+                                                  15                 Figure A.1: Comparisons of Tiny-NeRF trained with BP, DFA, and a shallow approach. Shallow
+                 training is insufﬁcient to learn scene geometry. Lego scene from the NeRF synthetic dataset.
+
+
+                 outperform shallow learning: failure to do so would mean that the weight updates delivered by DFA
+                 are useless. On a simple task like MNIST, a shallow baseline may be as high as 90%. However, given
+                 the difﬁculty of the tasks we consider, the shallow baseline is here usually much lower.
+
+                 NeRF Because NeRF models are expensive to train–up to 15 hours on a V100–we consider a
+                 simpliﬁed setup for the shallow baseline, NeRF-Tiny. This setup operates at half the full resolution
+                 of the training images available, runs for 5000 iterations only, and does away with view-dependant
+                 characteristics. Furthermore, the network is cut down to 3 layers of half the width of NeRF, and
+                 no coarse network is used to inform the sampling. We train this network on the Lego scene of the
+                 NeRF-Synthetic dataset, and compare results.
+                 Figure A.1 presents renders generated by NeRF-Tiny trained with BP, DFA, and a shallow approach.
+                 While BP and DFA delivers similar renders, shallow training fails to reproduce even basic scene
+                 geometry, instead outputting a diffuse cloud of colors. This highlights that while DFA may not reach
+                 a level of performance on-par with BP on NeRF, it nonetheless delivers meaningful updates enabling
+                 the learning of complex features.
+
+                 Recommender systems Because recommender systems require ﬁne-tuning, we perform the same
+                 hyperparameter search for shallow learning than for DFA and BP. Results are detailed in Table A.3.
+                 Performance of shallow training is always well under BP and DFA–remember that0.001-levelmatter
+                 in recommender systems. In particular, in Deep & Cross, where there was the biggest gap between
+                 BP and DFA, the performance of the shallow method is extremely poor, well below the FM baseline.
+                 Finally, it is expected to see that DeepFM recovers more or less the performance of FM even with a
+                 shallow baseline.
+
+                 Table A.3: Shallow baseline for recommender system models on the Criteo dataset. Performance is
+                 always well below BP and DFA, as expected.
+
+                                         DeepFM Deep&Cross AFN
+                                    AUC 0.7920 0.7324 0.7859
+                                    Loss 0.4682 0.5010 0.4685
+
+
+                 Graph convolutions We use the same hyperparameters as for DFA to produce the shallow baseline
+                 on graph datasets. Results are reported in Table A.4. Performance is always much worse than BP
+                 and DFA. GATConv recovers the best performance: random attention layers may still deliver useful
+                 features [84], as do random convolutions.
+
+                 Transformers In the baseline setting (optimizer and hyper-parameters of [59]), a Transformer
+                 trained in the shallow regime yields a perplexity of 428 on WikiText-103. We do not consider
+
+                                                  16                 Table A.4: Shallow baseline for GCNNs on Cora, CiteSeer, and PubMed [61]. Performance is always
+                 well below BP and DFA.
+
+                               ChebConv GraphConv SplineConv GATConv DNAConv
+                       Cora      23.3 37.0 39.6 59.4 30.2
+                       CiteSeer    27.4 33.8 30.1 49.8 24.0
+                       PubMed    37.6 44.8 44.2 67.8 42.2
+
+
+
+                 other settings, as the cost of training a Transformer is high and we do not expect any meaningful
+                 improvements–as with NeRF above.
+
+
+                 C Adapting architectures to DFA
+
+                 NeRF We use an architecture identical to the one used in [36], but based on the effective code
+                 implementation rather than the description in the paper 1 . During our tests, we have found that
+                 lowering the learning rate to1:10  4 rather than5:10  4 works best with DFA.
+
+
+                 Recommender systems For all training methods (BP, DFA, and shallow), we have conducted
+                 independent hyperparameter searches. We performed a grid search over the learning rate, from
+                 1:10  4 to1:10  3 in1:10  4 steps, as well as over the dropout probability, from0:1to0:5in0:1steps
+                 (where applicable). On DeepFM, this search leads to reduce the learning rate from3:10  4 with BP
+                 to5:10  5 with DFA, but to keep the 0.5 dropout rate. On Deep & Cross, we reduce learning rate
+                 from2:10  4 to5:10  5 , with no dropout in both cases. In AFN, we reduce dropout from4:10  4 to
+                 3:10  4 and dropout from 0.3 to 0.
+
+
+                 Graph convolutions We manually test for a few hyperparameters conﬁguration on the Cora dataset,
+                 focusing on learning rate, weight decay, and dropout. We do not consider architectural changes, such
+                 as changing the number of ﬁlters or of attention heads. For ChebConv and GraphConv, we reduce
+                 weight decay to1:10  4 instead of5:10  4 , and set the dropout rate to0and0:1respectively, instead
+                 of0:5with BP. For SplineConv, we ﬁnd that no change in the hyperparameters are necessary. For
+                 GATConv, we reduce weight decay to1:10  4 instead of5:10  4 and reduce dedicated dropout layer
+                 to0:1instead of0:6but keep the0:6dropout rate within the GAT layer. Finally, on DNAConv we
+                 disable weight decay entirely, instead of an original value of5:10  4 , double the learning rate from
+                 5:10  3 to1:10  2 , and disable dropout entirely. In all cases, we share the backward random matrix
+                 across all nodes in a graph.
+
+
+                 Transformers The model hyper-parameters were ﬁxed across all of our experiments, except for
+                 the number of attention heads in one case, that we will precise below, and dropout. We tested several
+                 values of dropout probability between 0 and 0.5, but found the original value of 0.1 to perform
+                 best. We manually tested a number of optimizers, optimizer parameters and attention mechanisms.
+                 We tested four combinations of optimizers and schedulers : Adam with the scheduler used in [59],
+                 Adam alone, RAdam [85] alone, and Adam with a scheduler that reduces the learning rate when
+                 the validation perplexity plateaus. We found it necessary to reduce the initial learning rate of Adam
+                 from1:10  4 to5:10  5 , although it could be set back to1:10  4 with a scheduler. We tried two values
+                 of 2 : 0.98 and 0.999. We also tried to change 1 and observed some small differences that were
+                 not signiﬁcant enough for the main text. Finally, we tried three attention mechanisms in addition to
+                 the standard multihead scaled dot-product attention: the dense and random (learnable) Synthesizers
+                 of [84], as well as the ﬁxed attention patterns of [86]. The latter needed to be adapted to language
+                 modelling to prevent attending to future tokens, which led us to reduced the number of attention
+                 heads to 4. The backward random matrix is always shared across all tokens and batches.
+
+
+                    1 https://github.com/bmild/nerf/issues/11
+
+                                                  17                 D Weight transport and attention
+
+                 We consider an attention layer operating on inputx. The queries, keys, and values are respectively
+                 q=xW Q ;k=xW K ;v=xW V , anddk is the dimension of the queries and keys. The layer
+                 performs:                                     qk T
+                                     Attention(q;k;v) =softmax p   v                 (4)dk
+
+                 When using DFA on attention, we deliver the random feedback to the top of the layer. Accordingly,
+                 to obtain updates toWQ ;WK ;andWV we still to have to backpropagate through the attention
+                 mechanism itself. This involves weight transport onWV , sacriﬁcing some biological realism for
+                 simplicity. Overall weight transport between layers still does not occur, and updating the layers in
+                 parallel remains possible.
+                 Beside using FA or DFA within the attention layer, alternative mechanisms like the synthesizer
+                 [84]–which uses random attention in place of the query and key system–or ﬁxed attention [86] can
+                 remove the need for weight transport. Implementing these mechanisms in DFA-trained Transformers,
+                 or other attention-powered architectures, will require further research.
+
+
+                 E Supplementary NeRF results
+
+                 Quantitative results We report per-scene scores for each dataset in Table A.5. BP values are taken
+                 from [36]. On three scenes of the synthetic datasets, NeRF-DFA even outperforms past state-of-the-art
+                 methods trained with BP. Note that Neural Volumes (NV) is not applicable to forward-facing view
+                 synthesis–as is required in LLFF-Real–and thus no results are reported.
+
+                 Qualitative results We report sample renders from the NeRF-Synthetic dataset (Figure A.2) and
+                 the LLFF-Real dataset (Figure A.2), for every scene available. However, we recommend readers to
+                 consult the supplementary video to make better sense of characteristics like multi-view consistency
+                 and view-dependent effects (most visible on the LLFF-Real Room scene).
+
+
+                 Table A.5: Per-scene PSNR for NeRF DFA and BP against other state-of-the-art methods on the
+                 Nerf-Synthetic and LLFF-Real. DFA performance is fairly homogeneous across each dataset and in
+                 line with the differences in other methods.
+
+                                            NV SRN LLFF NeRF
+                                            BP BP BP BP DFA
+                              NeRF-Synthetic 26.05 22.26 24.88 31.01 25.41
+                              Chair 28.33 26.96 28.72 33.00 28.74
+                              Drums 22.58 17.18 21.13 25.01 22.15
+                              Ficus 24.79 20.73 21.79 30.13 25.61
+                              Hotdog 30.71 26.81 31.41 36.18 28.03
+                              Lego 26.08 20.85 24.54 32.54 24.93
+                              Materials 24.22 18.09 20.72 29.62 25.15
+                              Mic 27.78 26.85 27.48 32.91 25.43
+                              Ship 23.93 20.60 23.22 28.65 23.25
+                              LLFF-Real 22.84 24.13 26.50 20.77
+                              Room 27.29 28.42 32.70 24.20
+                              Fern 21.37 22.95 25.17 21.82
+                              Leaves 18.24 19.52 20.92 16.50
+                              Fortress 26.63 29.40 31.16 25.16
+                              Orchids 17.37 18.52 20.36 16.73
+                              Flower 26.63 25.46 27.40 21.55
+                              T-Rex 22.87 24.15 26.80 19.43
+                              Horns 24.33 24.70 27.45 20.75
+
+
+                                                  18                 Possible future directions Despite retranscribing scene geometry in a multi-view consistent way,
+                 NeRF produces renders of a lower quality when trained with DFA instead of BP. In particular, it
+                 struggles to transcribe small-scale details, resulting in "blurry" renders. Moreover, it displays high-
+                 frequency artefacts: not in the scene geometry, but in individual pixels taking values very distant from
+                 their neighborhood. Interestingly, this noise phenomenon is unique to NeRF-DFA: it is not observed
+                 on NeRF-BP with similar PSNR values (achieved during training) or on other methods with similar
+                 or lower PSNR. This leads us to hypothesize this is an aspect unique to DFA, possibly due to the
+                 alignment process. Indeed, DFA creates a bias on the weights, by encouraging them to be "aligned"
+                 with an arbitrary values dependant on the random matrix used. It is possible this could introduce
+                 random noise in the ﬁnal renders–though we leave a more principled experiment to future research.
+                 To attempt to alleviate this issue, we ﬁrst consider NeRF-Dual. In NeRF-Dual, we average the
+                 pixel-wise prediction between the ﬁne and coarse network, to attempt to remove some of the noise.
+                 To do so, we ﬁrst still use the coarse network to create a probability distribution for the hierarchical
+                 sampling. Then, we evaluate again both the coarse and ﬁne networks at the locations informed by
+                 this probability distribution. Compared to vanilla NeRF, this requires an extra batch of evaluation of
+                 the coarse network for all rays–rougly speaking, this increases inference time by 30-50% depending
+                 on the coarse network architecture considered. We note that this is not applied during training, so that
+                 training times remain identical.
+                 Figure A.2 and Figure A.3 showcase comparisons between NeRF and NeRF-Dual trained with DFA
+                 on all scenes. When viewed at high resolution–such as in our supplementary video–the NeRF-Dual
+                 renders are more pleasing, especially for the full scenes. They remove most of the high-frequency
+                 noise, leading to smoother renders. However, this averaging process further blurs small-scale details in
+                 the render. This is especially visible in the NeRF-Synthetic dataset, on scenes like Ficus. Furthermore,
+                 NeRF-Dual introduces novel artefacts in the Mic and Ship scenes, with areas improperly colored
+                 with a violet tint. The cause for these artefacts is unknown, but they show that NeRF-Dual is far from
+                 a silver bullet. The PSNR is also minimally increased, by less than 0.5 per scene. Nevertheless, this
+                 shows some promise in possibilities to allievate the shortcomings of NeRF-DFA. It is possible that
+                 changes to the overall rendering process, or the use of classic image processing techniques, may help
+                 enhance the NeRF-DFA images.
+                 Finally, we also experimented with increasing the capacity of the ﬁne network, by widening its layers
+                 to 512 neurons. We call this architecture NeRF-XL. However, we have not succeeded in getting
+                 PSNR values higher than with vanilla NeRF on DFA. In particular, the training process becomes
+                 much more cumbersome, as multi-GPU parallelism is needed to ﬁt the model. It is possible that
+                 higher network capacity may help learning both the task at hand and to align simultaneously, but
+                 further work is required.
+
+
+                 F Reproducibility
+
+                 Hardware used All main experiments require at most a single NVIDIA V100 GPU with 16GB
+                 of memory to reproduce. Alignment measurement on large architectures (NeRF and Transformers)
+                 require a second identical GPU to keep a copy of the network to evaluate BP gradients.
+                 We estimate that a total of around 10,000 GPU-hours on V100s were necessary for this paper.
+                 Accordingly, we estimate the cloud-computing carbon impact of this paper to be of 1700 kgCO 2 eq 2 .
+                 However, without hyperparameter searches, our results can be reproduced with less than 500 GPU-
+                 hours on V100s, with most of that budget going to NeRF and Transformers.
+
+                 Implementation We use the shared random matrix trick from [23] to reduce memory use in DFA
+                 and enable its scaling to large networks. We use PyTorch [87] for all experiments. For reference
+                 implementation of the methods considered, we relied on various sources. Our NeRF implementation
+                 is based on the PyTorch implementation by Krishna Murthy 3 , with modiﬁcations to allow for proper
+                 test and validation, as well as DFA and multi-GPU support. For recommender systems, we use
+
+                    2 https://mlco2.github.io/impact#compute
+                    3 https://github.com/krrish94/nerf-pytorch
+
+                                                  19                 thetorchfmpackage 4 . Finally, we use PyTorch Geometric [60] for all graph operations. Our
+                 Transformer implementation is our own. Our code is available as supplementary material.
+
+                 NeRF We provide training, testing, and rendering code along with the conﬁgurations used to obtain
+                 our results. An example to reproduce our results is given in the supplementary code repository. Given
+                 the computing cost associated with training a NeRF, we also provide our trained models.
+
+                 Recommender systems We provide bash scripts to reproduce the results in Table 2 and A.3, with
+                 the results of our hyperparameter search. We provide code to reproduce the results in Table A.1.
+
+                 Graph convolutions We provide the code to reproduce all of our results. Note that the t-SNE
+                 results are not exactly reproducible, as the CUDA implementation used is non-deterministic.
+
+                 Transformers We provide bash scripts to reproduce Table 5 and the shallow results.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                    4 https://github.com/rixwew/pytorch-fm
+
+                                                  20                 Figure A.2: Sample renders for every scene of the NeRF-Synthetic dataset, for NeRF and NeRF-Dual
+                 trained with DFA.
+
+
+
+
+
+
+
+
+
+
+                                                  21                 Figure A.3: Sample renders for every scene of the LLFF-Real dataset, for NeRF and NeRF-Dual
+                 trained with DFA.
+
+
+
+
+
+
+
+
+
+
+
+                                                  22
\ No newline at end of file
diff --git a/Corpus/Efficient Behavior of Small-World Networks.txt b/Corpus/Efficient Behavior of Small-World Networks.txt
new file mode 100644
index 0000000..18b01f0
Binary files /dev/null and b/Corpus/Efficient Behavior of Small-World Networks.txt differ
diff --git a/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt b/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt
new file mode 100644
index 0000000..319bda1
Binary files /dev/null and b/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt differ
diff --git a/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt b/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt
new file mode 100644
index 0000000..64f926a
Binary files /dev/null and b/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt differ
diff --git a/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt b/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt
new file mode 100644
index 0000000..2c16ab6
--- /dev/null
+++ b/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt	
@@ -0,0 +1,261 @@
+                 Energy and Policy Considerations for Deep Learning in NLP
+
+
+                      Emma Strubell Ananya Ganesh Andrew McCallum
+                            College of Information and Computer Sciences
+                                University of Massachusetts Amherst
+                       {strubell, aganesh, mccallum}@cs.umass.edu
+
+
+
+
+
+                        Abstract                Consumption CO 2 e (lbs)
+                                               Air travel, 1 passenger, NY↔SF 1984 Recent progress in hardware and methodol-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+    arXiv:1906.02243v1  [cs.CL]  5 Jun 2019                                          Human life, avg, 1 year 11,023 ogy for training neural networks has ushered
+             in a new generation of large networks trained      American life, avg, 1 year 36,156
+             on abundant data. These models have ob-      Car, avg incl. fuel, 1 lifetime 126,000
+             tained notable gains in accuracy across many
+             NLP tasks. However, these accuracy improve-      Training one model (GPU)
+             ments depend on the availability of exception-      NLP pipeline (parsing, SRL) 39 ally large computational resources that neces-       w/ tuning & experimentation 78,468 sitate similarly substantial energy consump-      Transformer (big) 192 tion. As a result these models are costly to
+             train and develop, both ﬁnancially, due to the       w/ neural architecture search 626,155
+             cost of hardware and electricity or cloud com-     Table 1: Estimated COpute time, and environmentally,due to the car-                   2 emissions from training com-
+                                              mon NLP models, compared to familiar consumption. 1 bon footprint required to fuel modern tensor
+             processing hardware. In this paper we bring
+             this issue to the attention of NLP researchers     NLP models could be trained and developed on by quantifying the approximate ﬁnancial and     a commodity laptop or server, many now require environmental costs of training a variety of re-
+             cently successful neural network models for     multiple instances of specialized hardware such as
+             NLP. Based on these ﬁndings, we propose ac-     GPUs or TPUs, therefore limiting access to these
+             tionable recommendations to reduce costs and     highly accurate models on the basis of ﬁnances.
+             improve equity in NLP research and practice.       Even when these expensive computational re-
+           1 Introduction                       sources are available, model training also incurs a
+                                              substantial cost to the environment due to the en-
+           Advances in techniques and hardware for train-  ergy required to power this hardware for weeks or
+           ing deep neural networks have recently en-  months at a time. Though some of this energy may
+           abled impressive accuracy improvements across  come from renewable or carbon credit-offset re-
+           many fundamental NLP tasks ( Bahdanau et al.,  sources, the high energy demands of these models
+           2015; Luong et al., 2015; Dozat and Man-  are still a concern since (1) energy is not currently
+           ning, 2017; Vaswani et al., 2017), with the  derived from carbon-neural sources in many loca-
+           most computationally-hungry models obtaining  tions, and (2) when renewable energy is available,
+           the highest scores (Peters et al.,2018;Devlin et al.,  it is still limited to the equipment we have to pro-
+           2019;Radford et al.,2019;So et al.,2019). As  duce and store it, and energy spent training a neu-
+           a result, training a state-of-the-art model now re-  ral network might better be allocated to heating a
+           quires substantial computational resources which  family’s home. It is estimated that we must cut
+           demand considerable energy, along with the as-  carbon emissions by half over the next decade to
+           sociated ﬁnancial and environmental costs. Re-  deter escalating rates of natural disaster, and based
+           search and development of new models multiplies  on the estimated CO 2 emissions listed in Table 1,
+           these costs by thousands of times by requiring re-
+           training to experiment with model architectures    1 Sources: (1) Air travel and per-capita consump-
+                                              tion: https://bit.ly/2Hw0xWc; (2) car lifetime: and hyperparameters. Whereas a decade ago most  https://bit.ly/2Qbr0w1.           model training and development likely make up   Consumer Renew. Gas Coal Nuc.
+           a substantial portion of the greenhouse gas emis-   China 22% 3% 65% 4%
+           sions attributed to many NLP researchers.         Germany 40% 7% 38% 13%
+            To heighten the awareness of the NLP commu-   United States 17% 35% 27% 19%
+           nity to this issue and promote mindful practice and   Amazon-AWS 17% 24% 30% 26%
+           policy, we characterize the dollar cost and carbon   Google 56% 14% 15% 10%
+           emissions that result from training the neural net-   Microsoft 32% 23% 31% 10%
+           works at the core of many state-of-the-art NLP
+           models. We do this by estimating the kilowatts  Table 2: Percent energy sourced from: Renewable (e.g.
+           of energy required to train a variety of popular  hydro, solar, wind), natural gas, coal and nuclear for
+           off-the-shelf NLP models, which can be converted  the top 3 cloud compute providers (Cook et al.,2017),
+           to approximate carbon emissions and electricity  compared to the United States, 4 China 5 and Germany
+           costs. To estimate the even greater resources re-  (Burger,2019).
+           quired to transfer an existing model to a new task
+           or develop new models, we perform a case study    We estimate the total time expected for mod-
+           of the full computational resources required for the  els to train to completion using training times and
+           development and tuning of a recent state-of-the-art  hardware reported in the original papers. We then
+           NLP pipeline (Strubell et al.,2018). We conclude  calculate the power consumption in kilowatt-hours
+           with recommendations to the community based on  (kWh) as follows. Letpc be the average power
+           our ﬁndings, namely: (1) Time to retrain and sen-  draw (in watts) from all CPU sockets during train-
+           sitivity to hyperparameters should be reported for  ing, letpr be the average power draw from all
+           NLP machine learning models; (2) academic re-  DRAM (main memory) sockets, letpg be the aver-
+           searchers need equitable access to computational  age power draw of a GPU during training, and let
+           resources; and (3) researchers should prioritize de-  gbe the number of GPUs used to train. We esti-
+           veloping efﬁcient models and hardware.         mate total power consumption as combined GPU,
+                                              CPU and DRAM consumption, then multiply this
+           2 Methods                          by Power Usage Effectiveness (PUE), which ac-
+                                              counts for the additional energy required to sup-To quantify the computational and environmen-  port the compute infrastructure (mainly cooling).tal cost of training deep neural network mod-  We use a PUE coefﬁcient of 1.58, the 2018 globalels for NLP, we perform an analysis of the en-  average for data centers (Ascierto,2018). Then theergy required to train a variety of popular off-  total powerpthe-shelf NLP models, as well as a case study of           t required at a given instance during
+                                              training is given by:the complete sum of resources required to develop
+           LISA (Strubell et al.,2018), a state-of-the-art NLP             1.58t(pp       c +pr +gp g )
+           model from EMNLP 2018, including all tuning          t =                    (1)1000
+           and experimentation.                      The U.S. Environmental Protection Agency (EPA)We measure energy use as follows. We train the  provides average COmodels described in§2.1using the default settings                 2 produced (in pounds per
+                                              kilowatt-hour) for power consumed in the U.S.provided, and sample GPU and CPU power con-  (EPA,2018), which we use to convert power tosumption during training. Each model was trained  estimated COfor a maximum of 1 day. We train all models on           2 emissions:
+
+           a single NVIDIA Titan X GPU, with the excep-             CO 2 e = 0.954pt         (2)
+           tion of ELMo which was trained on 3 NVIDIA  This conversion takes into account the relative pro-GTX 1080 Ti GPUs. While training, we repeat-  portions of different energy sources (primarily nat-edly query the NVIDIA System Management In-  ural gas, coal, nuclear and renewable) consumedterface 2 to sample the GPU power consumption  to produce energy in the United States. Table2and report the average over all samples. To sample  lists the relative energy sources for China, Ger-CPU power consumption, we use Intel’s Running  many and the United States compared to the topAverage Power Limit interface. 3
+                                                5 U.S. Dept. of Energy:https://bit.ly/2JTbGnI
+            2 nvidia-smi:https://bit.ly/30sGEbi        5 China Electricity Council; trans. China Energy Portal:
+            3 RAPL power meter:https://bit.ly/2LObQhV   https://bit.ly/2QHE5O3           three cloud service providers. The U.S. break-  ence. Devlin et al.(2019) report that the BERT
+           down of energy is comparable to that of the most  base model (110M parameters) was trained on 16
+           popular cloud compute service, Amazon Web Ser-  TPU chips for 4 days (96 hours). NVIDIA reports
+           vices, so we believe this conversion to provide a  that they can train a BERT model in 3.3 days (79.2
+           reasonable estimate of CO 2 emissions per kilowatt  hours) using 4 DGX-2H servers, totaling 64 Tesla
+           hour of compute energy used.                V100 GPUs (Forster et al.,2019).
+                                              GPT-2. This model is the latest edition of
+           2.1 Models                           OpenAI’s GPT general-purpose token encoder,
+           We analyze four models, the computational re-  also based on Transformer-style self-attention and
+           quirements of which we describe below. All mod-  trained with a language modeling objective (Rad-
+           els have code freely available online, which we  ford et al.,2019). By training a very large model
+           used out-of-the-box. For more details on the mod-  on massive data,Radford et al.(2019) show high
+           els themselves, please refer to the original papers.  zero-shot performance on question answering and
+                                              language modeling benchmarks. The large modelTransformer. The Transformer model (Vaswani  described inRadford et al.(2019) has 1542M pa-et al.,2017) is an encoder-decoder architecture  rameters and is reported to require 1 week (168primarily recognized for efﬁcient and accurate ma-  hours) of training on 32 TPUv3 chips. 6 chine translation. The encoder and decoder each
+           consist of 6 stacked layers of multi-head self-
+           attention. Vaswani et al.(2017) report that the  3 Related work
+           Transformerbasemodel (65M parameters) was
+           trained on 8 NVIDIA P100 GPUs for 12 hours,  There is some precedent for work characterizing
+           and the Transformerbigmodel (213M parame-  the computational requirements of training and in-
+           ters) was trained for 3.5 days (84 hours; 300k  ference in modern neural network architectures in
+           steps). This model is also the basis for recent  the computer vision community.Li et al.(2016)
+           work on neural architecture search (NAS) for ma-  present a detailed study of the energy use required
+           chine translation and language modeling (So et al.,  for training and inference in popular convolutional
+           2019), and the NLP pipeline that we study in more  models for image classiﬁcation in computer vi-
+           detail in§4.2(Strubell et al.,2018). So et al.  sion, including ﬁne-grained analysis comparing
+           (2019) report that their full architecture search ran  different neural network layer types. Canziani
+           for a total of 979M training steps, and that their  et al.(2016) assess image classiﬁcation model ac-
+           base model requires 10 hours to train for 300k  curacy as a function of model size and gigaﬂops
+           steps on one TPUv2 core. This equates to 32,623  required during inference. They also measure av-
+           hours of TPU or 274,120 hours on 8 P100 GPUs.   erage power draw required during inference on
+                                              GPUs as a function of batch size. Neither work an-ELMo. The ELMo model (Peters et al.,2018)  alyzes the recurrent and self-attention models thatis based on stacked LSTMs and provides rich  have become commonplace in NLP, nor do theyword representations in context by pre-training on  extrapolate power to estimates of carbon and dol-a large amount of data using a language model-  lar cost of training.ing objective. Replacing context-independent pre-
+           trained word embeddings with ELMo has been    Analysis of hyperparameter tuning has been
+           shown to increase performance on downstream  performed in the context of improved algorithms
+           tasks such as named entity recognition, semantic  for hyperparameter search (Bergstra et al.,2011;
+           role labeling, and coreference.Peters et al.(2018)  Bergstra and Bengio,2012;Snoek et al.,2012). To
+           report that ELMo was trained on 3 NVIDIA GTX  our knowledge there exists to date no analysis of
+           1080 GPUs for 2 weeks (336 hours).           the computation required for R&D and hyperpa-
+                                              rameter tuning of neural network models in NLP.BERT.The BERT model (Devlin et al.,2019) pro-
+           vides a Transformer-based architecture for build-
+           ing contextual representations similar to ELMo,    6 Via the authorson Reddit.
+                                                7 GPU lower bound computed using pre-emptible but trained with a different language modeling ob-  P100/V100 U.S. resources priced at $0.43–$0.74/hr, upper
+           jective. BERT substantially improves accuracy on  bound uses on-demand U.S. resources priced at $1.46–
+           tasks requiring sentence-level representations such  $2.48/hr. We similarly use pre-emptible ($1.46/hr–$2.40/hr)
+                                              and on-demand ($4.50/hr–$8/hr) pricing as lower and upper as question answering and natural language infer-  bounds for TPU v2/3; cheaper bulk contracts are available.           Model Hardware Power (W) Hours kWh·PUE CO 2 e Cloud compute cost
+           Transformer base P100x8 1415.78 12 27 26 $41–$140
+           Transformer big  P100x8 1515.43 84 201 192 $289–$981
+           ELMo P100x3 517.66 336 275 262 $433–$1472
+           BERT base     V100x64 12,041.51 79 1507 1438 $3751–$12,571
+           BERT base     TPUv2x16 — 96 — — $2074–$6912
+           NAS P100x8 1515.43 274,120 656,347 626,155 $942,973–$3,201,722
+           NAS TPUv2x1 — 32,623 — — $44,055–$146,848
+           GPT-2 TPUv3x32 — 168 — — $12,902–$43,008
+
+           Table 3: Estimated cost of training a model in terms of CO 2 emissions (lbs) and cloud compute cost (USD). 7 Power
+           and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware.
+
+
+           4 Experimental results                                  Estimated cost (USD)
+                                               Models Hours Cloud compute Electricity4.1 Cost of training                     1 120 $52–$175 $5Table3lists CO 2 emissions and estimated cost of   24 2880 $1238–$4205 $118training the models described in§2.1. Of note is   4789 239,942 $103k–$350k $9870that TPUs are more cost-efﬁcient than GPUs on
+           workloads that make sense for that hardware (e.g.  Table 4: Estimated cost in terms of cloud compute and
+           BERT). We also see that models emit substan-  electricity for training: (1) a single model (2) a single
+           tial carbon emissions; training BERT on GPU is  tune and (3) all models trained during R&D.
+           roughly equivalent to a trans-American ﬂight.So
+           et al.(2019) report that NAS achieves a new state-  about 60 GPUs running constantly throughout theof-the-art BLEU score of 29.7 for English to Ger-  6 month duration of the project. Table4lists upperman machine translation, an increase of just 0.1  and lower bounds of the estimated cost in termsBLEU at the cost of at least $150k in on-demand  of Google Cloud compute and raw electricity re-compute time and non-trivial carbon emissions.    quired to develop and deploy this model. 9 We see
+                                              that while training a single model is relatively in-4.2 Cost of development: Case study        expensive, the cost of tuning a model for a newTo quantify the computational requirements of  dataset, which we estimate here to require 24 jobs,R&D for a new model we study the logs of  or performing the full R&D required to developall training required to develop Linguistically-  this model, quickly becomes extremely expensive.Informed Self-Attention (Strubell et al.,2018), a
+           multi-task model that performs part-of-speech tag-  5 Conclusions
+           ging, labeled dependency parsing, predicate detec-
+           tion and semantic role labeling. This model makes  Authors should report training time and
+           for an interesting case study as a representative  sensitivity to hyperparameters.
+           NLP pipeline and as a Best Long Paper at EMNLP.  Our experiments suggest that it would be beneﬁ-
+            Model training associated with the project  cial to directly compare different models to per-
+           spanned a period of 172 days (approx. 6 months).  form a cost-beneﬁt (accuracy) analysis. To ad-
+           During that time 123 small hyperparameter grid  dress this, when proposing a model that is meant
+           searches were performed, resulting in 4789 jobs  to be re-trained for downstream use, such as re-
+           in total. Jobs varied in length ranging from a min-  training on a new domain or ﬁne-tuning on a new
+           imum of 3 minutes, indicating a crash, to a maxi-  task, authors should report training time and com-
+           mum of 9 days, with an average job length of 52  putational resources required, as well as model
+           hours. All training was done on a combination of  sensitivity to hyperparameters. This will enable
+           NVIDIA Titan X (72%) and M40 (28%) GPUs. 8   direct comparison across models, allowing subse-
+            The sum GPU time required for the project  quent consumers of these models to accurately as-
+           totaled 9998 days (27 years). This averages to  sess whether the required computational resources
+            8 We approximate cloud compute cost using P100 pricing.    9 Based on average U.S cost of electricity of $0.12/kWh.           are compatible with their setting. More explicit  half the estimated cost to use on-demand cloud
+           characterization of tuning time could also reveal  GPUs. Unlike money spent on cloud compute,
+           inconsistencies in time spent tuning baseline mod-  however, that invested in centralized resources
+           els compared to proposed contributions. Realiz-  would continue to pay off as resources are shared
+           ing this will require: (1) a standard, hardware-  across many projects. A government-funded aca-
+           independent measurement of training time, such  demic compute cloud would provide equitable ac-
+           as gigaﬂops required to convergence, and (2) a  cess to all researchers.
+           standard measurement of model sensitivity to data
+           and hyperparameters, such as variance with re-  Researchers should prioritize computationally
+           spect to hyperparameters searched.            efﬁcient hardware and algorithms.
+                                              We recommend a concerted effort by industry and
+           Academic researchers need equitable access to   academia to promote research of more computa-
+           computation resources.                   tionally efﬁcient algorithms, as well as hardware
+                                              that requires less energy. An effort can also beRecent advances in available compute come at a  made in terms of software. There is already ahigh price not attainable to all who desire access.  precedent for NLP software packages prioritizingMost of the models studied in this paper were de-  efﬁcient models. An additional avenue throughveloped outside academia; recent improvements in  which NLP and machine learning software de-state-of-the-art accuracy are possible thanks to in-  velopers could aid in reducing the energy asso-dustry access to large-scale compute.           ciated with model tuning is by providing easy-Limiting this style of research to industry labs  to-use APIs implementing more efﬁcient alterna-hurts the NLP research community in many ways.  tives to brute-force grid search for hyperparameterFirst, it stiﬂes creativity. Researchers with good  tuning, e.g. random or Bayesian hyperparameterideas but without access to large-scale compute  search techniques (Bergstra et al.,2011;Bergstrawill simply not be able to execute their ideas,  and Bengio,2012;Snoek et al.,2012). Whileinstead constrained to focus on different prob-  software packages implementing these techniqueslems. Second, it prohibits certain types of re-  do exist, 10 they are rarely employed in practicesearch on the basis of access to ﬁnancial resources.  for tuning NLP models. This is likely becauseThis even more deeply promotes the already prob-  their interoperability with popular deep learninglematic “rich get richer” cycle of research fund-  frameworks such as PyTorch and TensorFlow ising, where groups that are already successful and  not optimized, i.e. there are not simple exam-thus well-funded tend to receive more funding  ples of how to tune TensorFlow Estimators usingdue to their existing accomplishments. Third, the  Bayesian search. Integrating these tools into theprohibitive start-up cost of building in-house re-  workﬂows with which NLP researchers and practi-sources forces resource-poor groups to rely on  tioners are already familiar could have notable im-cloud compute services such as AWS, Google  pact on the cost of developing and tuning in NLP.Cloud and Microsoft Azure.
+            While these services provide valuable, ﬂexi-  Acknowledgements
+           ble, and often relatively environmentally friendly  We are grateful to Sherief Farouk and the anony- compute resources, it is more cost effective for  mous reviewers for helpful feedback on earlieracademic researchers, who often work for non-  drafts. This work was supported in part by theproﬁt educational institutions and whose research  Centers for Data Science and Intelligent Infor-is funded by government entities, to pool resources  mation Retrieval, the Chan Zuckerberg Initiativeto build shared compute centers at the level of  under the Scientiﬁc Knowledge Base Construc-funding agencies, such as the U.S. National Sci-  tion project, the IBM Cognitive Horizons Networkence Foundation. For example, an off-the-shelf  agreement no. W1668553, and National ScienceGPU server containing 8 NVIDIA 1080 Ti GPUs  Foundation grant no. IIS-1514053. Any opinions,and supporting hardware can be purchased for  ﬁndings and conclusions or recommendations ex-approximately $20,000 USD. At that cost, the  pressed in this material are those of the authors andhardware required to develop the model in our  do not necessarily reﬂect those of the sponsor.case study (approximately 58 GPUs for 172 days)
+           would cost $145,000 USD plus electricity, about    10 For example, theHyperopt Python library.           References                              Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
+                                                     Gardner, Christopher Clark, Kenton Lee, and LukeRhonda Ascierto. 2018.Uptime Institute Global Data    Zettlemoyer. 2018. Deep contextualized word rep-Center Survey. Technical report, Uptime Institute.     resentations. InNAACL.
+           Dzmitry Bahdanau, KyunghyunCho, and Yoshua Ben-
+             gio. 2015. Neural Machine Translation by Jointly  Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
+             Learning to Align and Translate. In3rd Inter-    Dario Amodei, and Ilya Sutskever. 2019.Language
+             national Conference for Learning Representations    models are unsupervised multitask learners.
+             (ICLR), San Diego, California, USA.            Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
+           James Bergstra and Yoshua Bengio. 2012. Random    2012. Practical bayesian optimization of machine
+             search for hyper-parameter optimization.Journal of    learning algorithms. InAdvances in neural informa-
+             Machine Learning Research, 13(Feb):281–305.       tion processing systems, pages 2951–2959.
+
+           James S Bergstra, R´emi Bardenet, Yoshua Bengio, and  David R. So, Chen Liang, and Quoc V. Le. 2019.
+             Bal´azs K´egl. 2011. Algorithms for hyper-parameter    The evolved transformer. InProceedings of the
+             optimization. InAdvances in neural information    36th InternationalConference on Machine Learning
+             processing systems, pages 2546–2554.             (ICML).
+
+           Bruno Burger. 2019.Net Public Electricity Generation  Emma Strubell, Patrick Verga, Daniel Andor,
+             in Germany in 2018. Technical report, Fraunhofer    David Weiss, and Andrew McCallum. 2018.
+             Institute for Solar Energy Systems ISE.             Linguistically-Informed Self-Attention for Se-
+                                                     mantic Role Labeling. InConference on Empir-Alfredo Canziani, Adam Paszke, and Eugenio Culur-    ical Methods in Natural Language Processingciello. 2016. An analysis of deep neural network    (EMNLP), Brussels, Belgium. models for practical applications .
+                                                   Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobGary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John    Uszkoreit, Llion Jones, Aidan N Gomez, LukaszDeans, Brian Johnson, Elizabeth Jardim, and Brian    Kaiser, and Illia Polosukhin. 2017. Attention is allJohnson. 2017. Clicking Clean: Who is winning    you need. In31st Conference on Neural Informationthe race to build a green internet?Technical report,    Processing Systems (NIPS).Greenpeace.
+            Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
+             Kristina Toutanova. 2019. BERT: Pre-training of
+             Deep Bidirectional Transformers for Language Un-
+             derstanding. InNAACL.
+            Timothy Dozat and Christopher D. Manning. 2017.
+             Deep biafﬁne attention for neural dependency pars-
+             ing. InICLR.
+            EPA. 2018. Emissions & Generation Resource Inte-
+             grated Database (eGRID). Technical report, U.S.
+             Environmental Protection Agency.
+            Christopher Forster, Thor Johnsen, Swetha Man-
+             dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie
+             Bernauer, Allison Gray, Sharan Chetlur, and Raul
+             Puri. 2019. BERT Meets GPUs. Technical report,
+             NVIDIA AI.
+            Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong.
+             2016. Evaluating the energy efﬁciency of deep con-
+             volutional neural networks on cpus and gpus.2016
+             IEEE International Conferences on Big Data and
+             Cloud Computing (BDCloud), Social Computing
+             and Networking (SocialCom), Sustainable Comput-
+             ing and Communications (SustainCom) (BDCloud-
+             SocialCom-SustainCom), pages 477–484.
+           Thang Luong, Hieu Pham, and Christopher D. Man-
+             ning. 2015.Effective approaches to attention-based
+             neural machine translation. InProceedings of the
+             2015 Conference on Empirical Methods in Natural
+             Language Processing, pages 1412–1421. Associa-
+             tion for Computational Linguistics.
\ No newline at end of file
diff --git a/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt b/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt
new file mode 100644
index 0000000..e2f2323
--- /dev/null
+++ b/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt	
@@ -0,0 +1,793 @@
+                                                               IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005                                                                                                                                                                                                                                                                                                                                                                                                                1381
+     Finite-Element Neural Networks for Solving
+             Differential Equations
+    Pradeep Ramuhalli, Member, IEEE, Lalita Udpa, Senior Member, IEEE, and Satish S. Udpa, Fellow, IEEE
+
+   Abstract—The solution of partial differential equations (PDE)
+  arises in a wide variety of engineering problems. Solutions to most
+  practical problems use numerical analysis techniques such as ﬁ-
+  nite-element or ﬁnite-difference methods. The drawbacks of these
+  approaches include computational costs associated with the mod-
+  eling of complex geometries. This paper proposes a ﬁnite-element
+  neural network (FENN) obtained by embedding a ﬁnite-element
+  model in a neural network architecture that enables fast and ac-
+  curate solution of the forward problem. Results of applying the
+  FENN to severalsimpleelectromagnetic forward and inverseprob-
+  lems are presented. Initial results indicate that the FENN perfor-
+  mance as a forward model is comparable to that of the conven-
+  tional ﬁnite-element method (FEM). The FENN can also be used
+  in an iterative approach to solve inverse problems associated with Fig. 1. Iterative inversion method for solving inverse problems. the PDE. Results showing the ability of the FENN to solve the in-
+  verse problem given the measured signal are also presented. The
+  parallel nature of the FENN also makes it an attractive solution resulting in the corresponding solution to the forward problem
+  for parallel implementation in hardware and software.    . The model output is compared to the measurement ,
+   Index Terms—Finite-element method (FEM), ﬁnite-element using a cost function  .If  is less than a toler-
+  neural network (FENN), inverse problems.      ance, the estimateis used as the desired solution. If not,
+                    is updated to minimize the cost function.
+  S     I. I           Although ﬁnite-element methods (FEMs) [3], [4] are ex- NTRODUCTION       tremely popular for solving differential equations, their majorOLUTIONS of differential equations arise in a widedrawback is computational complexity. This problem becomesvariety of engineering applications in electromagnetics,more acute when three-dimensional (3-D) ﬁnite-elementsignal processing, computational ﬂuid dynamics, etc. Thesemodels are used in an iterative algorithm for solving the inverseequations are typically solved using either analytical or numer-problem. Recently, several authors have suggested the use ofical methods. Analytical solution methods are however feasibleneural networks (MLP or RBF networks [5]) for solving differ-only for simple geometries, which limits their applicability. Inential equations [6]–[9]. In these techniques, a neural networkmost practical problems with complex boundary conditions,is trained using a large database containing the input data andnumerical analysis methods are required in order to obtain athe solution of the differential equation. The neural networkreasonable solution. An example is the solution of Maxwell’sduring generalization learns the mapping corresponding toequations in electromagnetics. Solutions to Maxwell’s equa-the PDE. Alternatively, in [10], the solution to a differentialtions are used in a variety of applications for calculating theequation is written as a constant term, and an adjustable term interaction of electromagnetic (EM) ﬁelds with different typeswith parameters that need to be determined. A neural networkof media.               is used to determine the optimal values of the parameters.Very often, the solution to differential equations is necessaryThis approach is applicable only to problems with regularfor solving the corresponding inverse problems. Inverse prob-boundaries. An extension of the approach to problems withlems in general are ill-posed, lacking continuous dependence ofirregular boundaries is given in [11]. Other neural networkthe measurements on the input. This has resulted in the devel-based differential equation solvers use multilayer perceptronopment of a variety of solution techniques ranging from simplenetworks or variations on the MLP to approximate the unknowncalibration procedures to other direct (analytical) and iterativefunction in a PDE [12]–[14]. A combination of the PDE andapproaches [1]. Iterative methods typically employ a forwardboundary conditions is used to construct an objective functionmodel that simulates the underlying physical process (Fig. 1)that is minimized during the training process.[2]. An initial estimate of the solution of the inverse problem A major limitation of these approaches is that the network ar- (represented byin Fig. 1) is applied to the forward model,chitecture is selected somewhat arbitrarily. A second drawback
+                    is that the performance of the neural networks depends on the
+   Manuscript received January 17, 2004; revised April 2, 2005.    data used in training and testing. As long the test data is sim-
+   The authors are with the Department of Electrical and Computer Engi- ilar to the training data, the network can interpolate between the neering, Michigan State University, East Lansing, MI 48824 USA (e-mail: training data points to obtain a reasonable prediction. However, rpradeep@egr.msu.edu; udpal@egr.msu.edu; udpa@egr.msu.edu).
+   Digital Object Identiﬁer 10.1109/TNN.2005.857945      when the test signal is no longer similar to the training data, the
+                1045-9227/$20.00 © 2005 IEEE                                                                  1382                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+      network is forced to extrapolate and the performance degrades.  Section V draws conclusions from the results and presents
+      One way around this difﬁculty is to ensure that the training data- ideas for future work.
+      base has a diverse set of signals. However, this is difﬁcult to
+      ensure in practice. Alternatively, we have to design neural net-                  II. T HE FENN
+      works that are capable of extrapolation. Extrapolation methods   This section brieﬂy describes the FEM and proposes its refor-are discussed extensively in literature [15]–[18], but the design  mulation into a parallel neural network structure. Details aboutof an extrapolation neural network involves several issues par-  the FEM can be found in [3] and [4].ticularly for ensuring that the error in the network prediction
+      stays within reasonable bounds during the extrapolation proce-  A. The FEMdure.                                          Consider a typical boundary value problem with the gov-An ideal solution to this problem would be to combine the erning differential equationpower of numerical models with the computational speed of
+      neural networks, i.e., to embed a numerical model in a neural                                          (1)network structure. One suchﬁnite-element neural network
+      (FENN) formulation has been reported by Takeuchi and Kosugi  where  is a differential operator,  is the applied source or
+      [19]. This approach, based on error minimization, derives the forcing function, and is the unknown quantity. This differen-
+      neural network using the energy functional resulting from the tial equation can be solved in conjunction with boundary condi-
+      ﬁnite-element formulation. Other reports of FENN combina-  tionson theboundary enclosingthedomain .Thevariational
+      tions are either similar to the Takeuchi method [20], [21] or use  formulation used inﬁnite-element analysis determines the un-
+      Hopﬁeld neural networks to solve the forward problem [22],  known by minimizing the functional [3], [4]
+      [23]. Kalkkuhlet al.[24] provide a description of a FEM-based
+      approach to NARX modeling that may be interpreted both as                                          (2)
+      a local model network, as well as a single layer feedforward
+      network. A slightly different approach to merging numerical  with respect to the trial function . The minimization procedure
+      methods and neural networks is given in [25], where theﬁ-  starts by dividing  into  small subdomains called elements
+      nite-difference time domain (FDTD) method is cast in a neural (Fig. 2) and representing  in each element by means of basis
+      network framework for the purpose of solving electromagnetic  functions deﬁned over the element
+      forward problems. The related problem of mesh generation
+      inﬁnite-element models has also been tackled using neural                                          (3)networks (for instance, [26]). Generally, these networks are
+      designed to solve the forward problem, and must be modiﬁed
+      to solve inverse problems.                          where  is the unknown solution in element ,   is the basis
+        This paper proposes a new approach that embeds aﬁnite-ele-  function associated with node in element ,  is the value
+      ment model commonly used in the solution of differential equa-  of the unknown quantity at node and is the total number of
+      tions in a neural network. The network, called the FENN, can  nodes associated with element . In general, the basis functions
+      solve the forward problem and can also be used in an itera-  (also referred to as interpolation functions or shape functions)
+      tive algorithm to solve inverse problems. The primary advan- can be linear, quadratic, or of higher order. Typically,ﬁnite-el-
+      tage of this approach is that the FEM is represented in a parallel ement models use either linear or polynomial spline basis func-
+      form. Thus, it has the potential to alleviate the computational  tions.
+      cost associated with using the FEM in an iterative algorithm   The functional within an element is expressed as
+      for solving inverse problems. More importantly, the FENN does
+      not need any training, and the computation of the weights is                                          (4)
+      a one-time process. The proposed approach is also different in
+      that the neural network architecture developed can be used to
+      solve the forward and inverse problems. The structure of the By substituting (3) in (4), we obtain the discrete version of the
+      neural network is also simpler than those reported in the litera-  functional within each element
+      ture, making it easier to implement in parallel in both hardware                                          (5)and software.
+        The rest of this paper is organized as follows. Section II  where     is the transpose of a matrix,   is the    ele-brieﬂy describes the FEM, and derives the proposed FENN. In  mental matrix with elements this paper, we focus on the problem of solving typical equa-
+      tions encountered in electromagnetic nondestructive evaluation                                          (6)(NDE). However, the same concepts can be easily applied
+      to solve differential equations encountered in otherﬁelds.
+      Sections III, IV and V present the application of the FENN  and  is an    vector with elements
+      to solving forward and inverse problems, along with initial
+      results. A discussion of the advantages and disadvantages of                                          (7)
+      the proposed FENN architecture is given in Section IV. Finally,                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1383
+
+
+      Combining the values in (5) for each of the elements
+
+                                              (8)
+
+      where  is the      global matrix derived from the terms
+      of the elemental matrices for different elements, and  is the
+      total number of nodes.  , also called the stiffness matrix, is a
+      sparse, banded matrix. Equation (8) is the discrete version of
+      the functional and can be minimized with respect to the nodal
+      parameters by taking the derivative of with respect to and
+      setting it equal to zero, which results in the matrix equation    Fig.2. (a)Schematicrepresentationofdomainandboundary. (b)SampleFEM
+                                                  mesh for the domain.
+                                              (9)
+
+        Boundary conditions for these problems are usually of two
+      types: natural boundary conditions and essential boundary
+      conditions. Essential boundary conditions (also referred to as
+      Dirichlet boundary conditions) impose constraints on the value
+      of the unknown  at several nodes. Natural boundary condi-
+      tions (of which Neumann boundary conditions are a special
+      case) impose constraints on the change in across a boundary.
+      Dirichlet boundary conditions are imposed on the functional
+      minimization (9), by deleting the rows and columns of the
+      matrix corresponding to the nodes on the Dirichlet boundary
+         and modifying  in (9).                         Fig. 3. FEM domain discretization using two elements and four nodes.
+        Natural boundary conditions are applied in the FEM by
+      adding an additional term to the functional. These boundary  This process ensures that natural boundary conditions are im-conditions are then incorporated into the functional and are  plicitlyandautomatically satisﬁedduring theFEMsolutionpro-satisﬁed automatically during the solution procedure. As an  cedure.example, consider the natural boundary condition represented
+      by the following equation [3]                        B. The FENN
+                               on            (10)   This section describes how theﬁnite-element model can be
+                                                  converted intoa parallel network form. Wefocus on solving typ-
+      where   represents the Neumann boundary,  is its outward  ical inverse problems arising in electromagnetic NDE, but the
+      normal unit vector,  is some constant, and , , and  are basicideaisapplicabletootherareas aswell.NDEinverseprob-
+      known parameters associated with the boundary. Assuming that lems can be formulated as the problem ofﬁnding the material
+      the boundary   is made up of   segments, we can deﬁne properties (such as the conductivity or the permeability) within
+      boundary matrices   and  with elements              the domain of the problem. Since the domain is discretized in
+                                                  the FEM method by a large number of elements, the problem
+                                                  can be posed as one ofﬁnding the material properties in each
+                                                  of these elements. These properties are usually embedded in the
+                                                  differential operator , or equivalently, in the global matrix .
+                                                  Thus, in order to be able to iteratively estimate these properties
+                                                  from the measurements, the material properties need to be sep-
+                                                  arated out from  . This separation is easier to achieve at the
+                                                  element matrix level. For nodes and in element
+                                             (11)
+
+      where   are basis functions deﬁned over segment and  is
+      the length of the segment. The elements of   are added to the
+      elementsof  that correspond tothe nodeson the boundary  .
+      Similarly, the elements of   are added to the corresponding
+      elements of . The global matrix (9) is thus modiﬁed as follows
+      before solving for                                                                       (13)
+
+                                                  where   is the parameter representing the material property(12)  in element  and  represents the differential operator at the                                                                  1384                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+        Fig. 4. FENN.
+
+
+        element level without   embedded in it. Substituting (13) into  neurons, corresponding to the    members of the global ma-
+        the functional, we get                                    trix  . The output of each group of hidden layer neurons is the
+                                                               corresponding row vector of  . The weights from the input to
+                                                               the hidden layer are set to the appropriate values of   . Each(14)  neuron in the hidden layer acts as a summation unit, (equivalent
+                                                               toasummationfollowedbyalinearactivationfunction[5]).The
+        If we deﬁne                                            outputs of the hidden layer neurons are the elements    of the
+                                                               global matrix   as given in (15).
+                                                         (15)    Each group of hidden neurons is connected to one output
+                                                               neuron (giving a total of  output neurons) by a set of weights
+                                                                , with each element of  representing the nodal values  .where                                                 Note that the set of weights  between theﬁrst group of hidden
+                                                               neurons and theﬁrst output neuron are the same as the set of(16)else                                   weights between the second group of hidden neurons and the
+                                                               second output neuron (as well as between successive groups
+                                                               of hidden neurons and the corresponding output neuron). Each
+                                                               output neuron is also a summation unit followed by a linear ac-
+                                                               tivation function, and the output of each neuron is equal to  :
+
+
+                                                                                                                (18)
+                                                         (17)
+
+                                                               where the second part of (18) is obtained by using (15). As an
+        Equation (17) expresses the functional explicitly in terms of  .  example, the FENN architecture for a two-element, four-node
+        The assumption that   is constant within each element is im-                 FEM mesh (Fig. 3) is shown in Fig. 4. In this
+        plicit in this expression. This assumption is usually satisﬁed in  case, the FENN has two input neurons, 16 hidden layer neurons
+        problems in NDE where each element in the FEM mesh is de-  and four output neurons. Theﬁgure illustrates the grouping of
+        ﬁned within the conﬁnes of a domain, and at no time does a  the hidden layer neurons, as well as the similarity inherent in
+        single element cross domain boundaries. Furthermore, each el-  the weights that connect each group of hidden layer neurons
+        ement is small enough that minor variations in   within an el-  to the corresponding output neuron. To simplify theﬁgure, the
+        ement may be ignored. Equation (17) can be easily converted  weights between the network input and hidden layer neurons
+        into a parallel network form. The neural network comprises an  are depicted by means of vectors                      (for
+        input, output and hidden layer. In the general case with   el-       , 2, 3, 4 and     , 2), where the individual weight values
+        ements and   nodes in the FEM mesh, the input layer with      are deﬁned as in (16).
+           network inputs takes the  values in each element as input.    1) Boundary Conditions in the FENN: Note that the ele-
+        The hidden layer has    neurons 1 arranged in   groups of    ments of   and   in (11) do not depend on the material prop-
+         1                                                    erties .   and   need to be added appropriately to the global In this paper, we use the term“neurons”in the FENN (in the hidden and
+        output layers) to avoid confusion with the nodes in aﬁnite-element mesh.     matrix   and the source vector  as shown in (12). Equation                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1385
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 5. Geometry of mesh for 1-D FEM.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 6. Flowchart (with example) for designing the FENN for a general PDE.
+
+
+       (12) thus implies that natural boundary conditions can be ap-  layer neurons. These weights will be referred to as the clamped
+       plied in the FENN as bias inputs to the hidden layer neurons  weights, while the remaining weights will be referred to as the
+       that are a part of the boundary, and the corresponding output  free weights. An example of these weights is presented later.
+       neurons. Dirichlet boundary conditions are applied by clamping    The FENN architecture was derived without consideration of
+       the corresponding weights between the hidden layer and output  the dimensionality of the problem at hand, and thus can be used                                                                  1386                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+      for 1-, 2-, 3-, or higher dimensional problems. The number of
+      nodes and elements in the FEM mesh dictates the number of
+      neurons in the different layers. The weights between the input
+      and hidden layer change depending on node-element connec-
+      tivity information.
+        The major drawback of the FENN is the number of neurons
+      and weights necessary. However, the memory requirements can
+      be reduced considerably, since most of the weights between the
+      input and hidden layer are zero. These weights, and the corre-
+      sponding connections, can be discarded. Similarly, most of the Fig. 7. Shielded microstrip geometry. (a) Complete problem description. (b)
+      elements of the  matrix are also zero (  is a banded ma-  Problem description using symmetry considerations.
+      trix). The corresponding neurons in the hidden layer can also
+      be discarded, reducing memory and computation requirements   The network implementation of (23) can be derived as fol-
+      considerably. Furthermore, the weights between each group of  lows. If  and  values at each element are the inputs to the
+      hidden layer neurons and the output layer are the same   .  network,   ,      ,   , and      form the weights
+      Weight-sharing approaches can be used here to further reduce  between the input and hidden layers. The network thus uses
+      the storage requirements.                           inputneuronsand  hiddenneurons.Thevaluesof ateachof
+                                                  thenodesareassigned asweightsbetweenthehidden andoutput
+      C. A 1-D Example                               layers, and the source   is the desired output of this network
+        Consider the 1-D equation                        (corresponding to the  output neurons). Dirichlet boundary
+                                                  conditions on are applied as explained earlier.
+
+                                             (19)  D. General Case
+                                                    Fig. 6 shows aﬂowchart of the general scheme for convertingboundary conditions       on the boundary deﬁned by .  a differential equation into the FENN structure. An exampleand  are constants depending on the material and  is the in two dimensions is also provided next to theﬂowchart. Weapplied source. Laplace’s equation and Poisson’s equation are  start with the differential equation and the boundary conditionsspecial cases of this equation. The FENN formulation for this and formulate the FEM using the variational method. This in-problem starts by discretizing the domain of interest with  el-  volves discretizing the domain of interest with  elements andements and  nodes. In one dimension, each element is deﬁned    nodes, selecting basis functions, writing the functional forby two nodes (Fig. 5). Deﬁne basis functions   and   over  each element and obtaining the element matrices and the sourceeach element and let  is the value of on node in element  vector. The example presented uses the FEM mesh shown in. An example of the basis functions is shown in Fig. 5.      Fig. 3, with      elements, and      nodes, and linearFor these basis functions, i.e.,                      basis functions. The unknown solution to the differential equa-
+                                                  tion   is represented by its values at each of the nodes in the(20)  ﬁnite-element mesh   . The element matrices   are then
+                                                  separated into two parts, with one part dependent on the mate-the element matrices are given by [3]                   rial properties and while the other is independent of them.
+                                                    The FENN is then designed to have   input neurons,
+                                                  hidden neurons, and  output neurons, where is the number
+                                                  of material property parameters. In the example under consid-
+                                                  eration,    , since we have two material property parameters(21)  ( and ). Theﬁrst group of  input neurons takes in the
+                                                  values while the second group takes in the values in each ele-
+                                                  ment. The weights from the input to the hidden layer are set to
+                                                  the appropriate values of  . In the example, since nodes 1, 2,
+                                             (22)  and 3 are part of element 1 (see Fig. 3), the weights from theﬁrst
+                                                  input node   to theﬁrst group of four neurons in the hidden
+      Here,  is the length of element . The global matrix  is then layer are given by
+      constructed by selectively adding the element matrices based
+      on the nodes that form an element. Speciﬁcally,  is a sparse
+      tridiagonal matrix, and its nonzero elements are given by                                             (24)
+
+                                                  The last weight is zero since node 4 is not a part of element 1.
+                                                    Each group of hidden neurons is connected to one output
+                                                  neuron (giving a total of  output neurons) by a set of weights
+                                                   , with each element of representing the nodal values  . The
+                                             (23)  output of each neuron in the output layer is equal to .                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1387
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 8. Forward problem solutions for shielded microstrip problem show the contours of constant potential for: (a) FEM solution and (b) FENN solution. (c) Error
+       between (a) and (b). Thex- andy-axes show the nodes in the FEM discretization of the domain, and thez-axis in (c) shows the error at each of these nodes in volts.
+
+
+
+        III. F ORWARD AND INVERSE PROBLEM FORMULATION USING   where     is the output of the FENN. Then, for a gradient-
+                               FENN                          based approach, the gradients of the error with respect to the
+                                                              free hidden layer weights is given by
+
+          The FENN architecture and algorithm lends itself to solving                                                   (27)both the forward and inverse problems. The forward problem
+       involves determining the weights  given the material parame-  Equation (27) can be used to solve the forward problem. Sim-ters  and  and the applied source  while the inverse problem  ilarly, to solve the inverse problem, the gradients of the errorinvolves determining  and  given  and . Any optimization  with respect to  and  (input of the FENN) are necessary, andapproach can be used to solve both these problems. Suppose we  are given bydeﬁne the error at the output of the FENN as
+
+
+
+
+                                                                                                                (28)
+
+
+
+
+                                                         (26)                                                   (29)                                                                  1388                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+
+                                                          TABLE I
+                                   SUMMARY OF PERFORMANCE OF THE FENN A LGORITHM FOR VARIOUS PDE S
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+        For the forward problem, such an approach is equivalent to the  Dirichlet boundary, with      on the microstrip and     on
+        iterative approaches used to solve for the unknown nodal values  the outer boundary [Fig. 7(b)]. Finally, there is no source term
+        in the FEM [4].                                         in this example (the source term would correspond to a charge
+                                                               distribution in the domain of interest), i.e.,      . In this ex-
+                             IV. R ESULTS                       ample, we assume that        volts and      . Further, we
+                                                               assume that the domain of interest is                  .A. Forward Model Results                                  The solution to the forward problem is presented in Fig. 8,
+          The FENN was tested using both 1- and 2-D versions of  with the FEM solution using 11 nodes in each direction shown
+        Poisson’s equation                                       in Fig. 8(a) and the corresponding FENN solution in Fig. 8(b).
+
+                                                         (30)  Theseﬁgures show contours of constant potential. The error be-
+                                                               tween the FEM and FENN solutions is presented in Fig. 8(c). As
+        where  represents the material property, and  is the applied  seen from theﬁgure, the FENN is seen to match the FEM solu-
+        source. For instance, in electromagnetics  may represent the  tion accurately, with the peak error at any node on the order of
+        permittivity while  represents the charge density.                  .
+          As theﬁrst example, consider the following 2-D equation       Several other examples were also used to test the FENN and
+                                                               the results are summarized in Table I. Column 1 shows the
+                                                         (31)  PDE used to evaluate the FENN performance, while column 2
+                                                               shows the boundary conditions used. The analytic solution to
+        with boundary conditions                                 the problem is indicated in Column 3. The FENN structure and
+
+                                  on                    (32)  the number of iterations for convergence using a gradient de-
+                                                               scent approach are indicated in Columns 4 and 5, respectively.
+        and                                                   The FENN structure, as explained earlier, has    inputs,
+                                                               hidden neurons and  output neurons, where   and  are the
+                                               on       (33)  number of elements and nodes in the FEM mesh, respectively,
+                                                               and  is the number of hidden neurons, and corresponds to the
+        This is the governing equation for the shielded microstrip trans-  number of nonzero elements in the FEM global matrix  . Fi-
+        mission line problem shown in Fig. 7. The forward problem  nally, Columns 6 and 7 present the sum-squared error (SSE) and
+        computes the electric potential due to the shielded microstrip  the maximum error in the solution, respectively, where the er-
+        shown in Fig. 7(a). The potentials are zero on the shielding con-  rors are computed with respect to the analytical solution. These
+        ductor.Sincethegeometryissymmetric,wecansolvetheequiv-  results indicate that the FENN is capable of accurately deter-
+        alent problem shown in Fig. 7(b), by applying the homogeneous  mining the potential . One advantage of the FENN approach
+        Neumann condition on the plane of symmetry. The inner con-  is that the computation of the input-hidden layer weights is a
+        ductor (microstrip) is held at a constant potential of   volts.  one-time process, as long as the differential equation does not
+        Finally, we also assume that the material inside the shielding  change. The only changes necessary to solve the different prob-
+        conductor has a permittivity      , where K is a constant. The  lems are changes in the input    and the desired output   .
+        permittivity in this case corresponds to the material property .
+        Speciﬁcally,            and     . The homogeneous Neu-  B. Inverse Model Results
+        mann boundary condition is equivalent to setting          .    TheFENNwasalsousedtosolveseveralsimpleinverseprob-
+        The microstrip and the shielding conductor correspond to the  lems based on (30). In all cases, the objective was to determine                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1389
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 9. FENN inversion results for Poisson’s equation with initial solutions (a)  = x . (b)  =1+   x .
+
+
+       the value of  and  for given values of  and . Theﬁrst ex-    In order to obtain a unique solution, we need to constrain the
+       ample is a 1-D problem that involves determining  given       value of  at the boundary as well. Consider the same differen-
+       and     ,         for the differential equation             tial equation as (34), but with  and  speciﬁed as follows:
+
+                                                         (34)                           and
+
+       with boundary conditions         and        . The analyt-                                                   (36)
+       ical solution to this inverse problem is                      The analytical solution for this equation is              .To
+                                       and              (35)  solve this problem, we set       and clamp the value of  at
+       As seen from (35), the problem has an inﬁnite number of solu-       and     as follows:          ,                 .
+       tions and we expect the solution procedure to converge to one    The results of the constrained inversion obtained using 11
+       of these solutions depending on the initial value.              nodes and 10 elements in the correspondingﬁnite-element mesh
+          Fig. 9(a) and (b) shows two solutions to this inverse problem  are shown in Fig. 10. Fig. 10(a) shows the comparison between
+       for two different initializations (shown using triangles). In both  the analytical solution (solid line with squares) and the FENN
+       cases, the FENN solution (in stars) is seen to match the analyt-  result (solid line with stars). The initial value of  is shown in
+       ical solution (squares). The SSE in both cases was on the order  theﬁgure as a dashed line. Fig. 10(b) shows the comparison
+       of     .                                              between the actual and desired forcing function  at the FENN                                                                  1390                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+        Fig. 10. Constrained inversion result with eleven nodes. (a) Comparison of analytic and simulation results for  . (b) Comparison of actual and desired NN outputs.
+
+
+        output. This result indicates that the SSE in the forcing function,  weight structure that allows both the forward and inverse prob-
+        as well as the SSE in the inversion result, is fairly large (0.0148  lemstobesolvedusingsimplegradient-basedalgorithms.Initial
+        and 0.0197, respectively). The reason for this was traced back  results indicate that the proposed FENN algorithm is capable of
+        to the mesh discretization. Fig. 11 shows the SSE in the output  accurately solving both the forward and inverse problems. In
+        of the FENN and the SSE in the inverse problem solution as a  addition, the forward problem solution from the FENN is seen
+        function of FEM discretization. It is seen that increasing the dis-  to exactly match the FEM solution, indicating that the FENN
+        cretization signiﬁcantly improves the solution. Similar results  represents theﬁnite-element model exactly in a parallel conﬁg-
+        were observed for other problems.                          uration.
+                                                                 The major advantage of the FENN is that it represents the
+                                                               ﬁnite-element model in a parallel form, enabling parallel imple-
+                    V. D ISCUSSION AND CONCLUSION              mentation in either hardware or software. Further, computing
+                                                               gradients in the FENN is very simple. This is an advantage in
+          The FENN is closely related to theﬁnite-element model used  solving bothforward and inverse problems using gradient-based
+        to solve differential equations. The FENN architecture has a  methods. The gradients can also be computed in parallel and                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1391
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 11. SSE in FENN output and inversion results as a function of discretization.
+
+
+       the lack of nonlinearities in the neuron activation functions    [6] C. A. Jensenet al.,“Inversion of feedforward neural networks: algo-
+       makes the computation of gradients simpler. A major advantage       rithms and applications,”Proc. IEEE, vol. 87, no. 9, pp. 1536–1549,
+       of this approach for solving inverse problems is that it avoids       1999.
+                                                                [7] P. Ramuhalli, L. Udpa, and S. Udpa,“Neural networkalgorithm for elec-
+       inverting the global matrix in each iteration. The FENN also       tromagnetic NDE signal inversion,”inENDE 2000, Budapest, Hungary,
+       does not require any training, since most of its weights can be       Jun. 2000.
+       computed in advance and stored. The weights depend on the    [8] C. H. Barbosa, A. C. Bruno, M. Vellasco, M. Pacheco, J. P. Wikswo Jr.,
+                                                                   and A. P. Ewing,“Automation of SQUID nondestructive evaluation of
+       governing differential equation and its associated boundary       steel plates by neural networks,”IEEE Trans. Appl. Supercond., vol. 9,
+       conditions, and as long as these two factors do not change,       no. 2, pp. 3475–3478, 1999.
+       the weights do not change. This is especially an advantage    [9] W.Qing, S. Xueqin,Y.Qingxin,and Y.Weili,“Usingwaveletneural net-
+                                                                   works for the optimal design of electromagnetic devices,”IEEE Trans.
+       in solving inverse problems in electromagnetic NDE. This       Magn., vol. 33, no. 2, pp. 1928–1930, 1997.
+       approach also reduces the computational effort associated with   [10] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis,“Artiﬁcial neural networks
+       the network.                                                 for solving ordinary and partial differential equations,”IEEE Trans.
+                                                                   Neural Netw., vol. 9, no. 5, pp. 987–1000, 1998.
+          Future work will concentrate on applying the FENN to 3-D   [11] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou,“Neural-network
+       electromagnetic NDE problems. The robustness of the approach       methods for boundary value problems with irregular boundaries,”IEEE
+       will also be tested, since the ability of these approaches to in-       Trans. Neural Netw., vol. 11, no. 5, pp. 1041–1049, 2000.
+                                                                [12] B. P. Van Milligen, V. Tribaldos, and J. A. Jimenez,“Neural network
+       vert practical noisy measurements is important. Furthermore,       differential equation and plasma equilibrium solver,”Phys. Rev. Lett.,
+       the use of better optimization algorithms, like conjugate gra-       vol. 75, no. 20, pp. 3594–3597, 1995.
+       dient methods, is expected to improve the solution speed. In ad-   [13] M. W. M. G. Dissanayake and N. Phan-Thien,“Neural-network-based
+                                                                   approximations for solving partial differential equations,”Commun.
+       dition, parallel implementation of the FENN in both hardware       Numer. Meth. Eng., vol. 10, pp. 195–201, 1994.
+       and software is under investigation. The approach described in   [14] R. Masuoka,“Neural networks learning differential data,”IEICE Trans.
+       this paper is very general in that it can be applied to a variety       Inform. Syst., vol. E83-D, no. 6, pp. 1291–1300, 2000.
+                                                                [15] D.C.Youla,“Generalizedimagerestorationbythemethodofalternating
+       of inverse problems inﬁelds other than electromagnetic NDE.       orthogonal projections,”IEEE Trans. Circuits Syst., vol. CAS-25, no. 9,
+       Some of these other applications will also be investigated to       pp. 694–702, 1978.
+       show the general nature of the proposed method.               [16] D. C. Youla and H. Webb,“Image restoration by the method of convex
+                                                                   projections: part I—theory,”IEEE Trans. Med. Imag., vol. MI-1, no. 2,
+                                                                   pp. 81–94, 1982.
+                            REFERENCES                        [17] A. Lent and H. Tuy,“An iterative method for the extrapolation of band-
+                                                                   limitedfunctions,”J.Math.AnalysisandApplicat.,vol.83, pp.554–565,
+         [1] L. Udpa and S. S. Udpa,“Application of signal processing and pattern       1981.
+            recognition techniques to inverse problems in NDE,”Int. J. Appl. Elec-   [18] W. Chen,“A new extrapolation algorithm for band-limited signals using
+            tromagn. Mechan., vol. 8, pp. 99–117, 1997.                         the regularization method,”IEEE Trans. Signal Process., vol. 41, no. 3,
+         [2] M. Yan, M. Afzal, S. Udpa, S. Mandayam, Y. Sun, L. Udpa, and P.       pp. 1048–1060, 1993.
+            Sacks,“Iterative algorithms for electromagnetic NDE signal inversion,”   [19] J. Takeuchi and Y. Kosugi,“Neural network representation of theﬁnite
+            inENDE ’97, Reggio Calabria, Italy, Sep. 14–16, 1997.                 element method,”Neural Netw., vol. 7, no. 2, pp. 389–395, 1994.
+         [3] J. Jin,The Finite Element Method in Electromagnetics. New York:   [20] R. Sikora, J. Sikora, E. Cardelli, and T. Chady,“Artiﬁcial neural net-
+            Wiley, 1993.                                              work application for material evaluation by electromagnetic methods,”
+         [4] P. Zhou,Numerical Analysis of Electromagnetic Fields. Berlin, Ger-       inProc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 4027–4032.
+            many: Springer-Verlag, 1993.                               [21] G. Xu, G. Littlefair, R. Penson, and R. Callan,“Application of FE-based
+         [5] S. Haykin,Neural Networks: A Comprehensive Foundation. Upper       neural networks to dynamic problems,”inProc. Int. Conf. Neural Infor-
+            Saddle River, NJ: Prentice-Hall, 1994.                             mation Processing, vol. 3, 1999, pp. 1039–1044.                                                                  1392                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+
+         [22] F. Guo, P. Zhang, F. Wang, X. Ma, and G. Qiu,“Finite element anal-                    Lalita Udpa (S’84–M’86–SM’96) received the
+             ysis-based Hopﬁeld neural network model for solving nonlinear elec-                    Ph.D. degree in electrical engineering from Col-
+             tromagneticﬁeld problems,”inProc. Int. Joint Conf. Neural Networks,                    orado State University, Fort Collins, in 1986.
+             vol. 6, 1999, pp. 4399–4403.                                                 She is currently a Professor with the Department
+         [23] H. Lee and I. S. Kang,“Neural algorithm for solving differential equa-                    of Electrical and Computer Engineering, Michigan
+             tions,”J. Computat. Phys., vol. 91, pp. 110–131, 1990.                              State University, East Lansing. She works primarily
+         [24] J. Kalkkuhl, K. J. Hunt, and H. Fritz,“FEM-based neural-network                    in the broad areas of nondestructive evaluation,
+             approach to nonlinear modeling with application to longitudinal vehicle                    signal processing, and biomedical applications. Her
+             dynamics control,”IEEE Trans. Neural Netw., vol. 10, no. 4, pp.                    research interests include various aspects of NDE,
+             885–897, 1999.                                                         such as development of computational models for
+         [25] R. K. Mishra and P. S. Hall,“NFDTD concept,”IEEE Trans. Neural                    the forward problem in NDE, signal and image pro-
+             Netw., vol. 16, no. 2, pp. 484–490, 2005.                      cessing, pattern recognition and neural networks, and development of solution
+         [26] D. G. Triantafyllidis and D. P. Labridis,“Aﬁnite-element mesh gener-  techniques for inverse problems. Her current projects includeﬁnite-element
+             ator based on growing neural networks,”IEEE Trans. Neural Netw., vol.  modeling of electromagnetic NDE phenomena, application of neural network
+             13, no. 6, pp. 1482–1496, 2002.                            and signal processing algorithms to NDE data, and development of image
+                                                               processing techniques for the analysis of NDE and biomedical images.
+                                                                Dr. Udpa is a Member of Eta Kappa Nu and Sigma Xi.
+
+
+
+                                                                                Satish S. Udpa(S’82–M’82–SM’91–F’03) received
+                                                                                the B.Tech. degree in 1975 and the Post Graduate
+                                                                                Diplomainelectricalengineeringin1977fromJ.N.T.
+                                                                                University, Hyderabad, India. He received the M.S.
+                                                                                degree in 1980 and the Ph.D. degree in electrical en-
+                                                                                gineering in 1983, both from Colorado State Univer-
+                                                                                sity, Fort Collins.
+                                                                                  He has been with Michigan State University, East
+                                                                                Lansing, since 2001 and is currently Acting Dean for
+                                                                                the College of Engineering and a Professor with the
+                                                                                Electrical and Computer Engineering Department.
+                                                               Prior to joining Michigan State, he was a Professor with Iowa State University,
+                                                               Ames, from 1990 to 2001 and was associated with the Materials Assessment
+                                                               Research Group. Prior to joining Iowa State, he was an Associate Professor
+                                                               with the Department of Electrical Engineering at Colorado State University.
+                                                               His research interests span the broad area of materials characterization and
+                                                               nondestructive evaluation (NDE). Work done by him to date in the area includes
+                                                               an extensive repertoire of forward models for simulating physical processes
+                                                               underlying several inspection techniques. Coupled with careful experimental
+                         Pradeep Ramuhalli (S’92–M’02) received the  work, such forward models can be used for designing new sensors, optimizing
+                         B.Tech. degree from J.N.T. University, Hyderabad,  test conditions, estimating the probability of detection, assessing designs for
+                         India, in electronics and communications engi-  inspectability and training inverse models for characterizing defects. He has
+                         neering in 1995, and the M.S. and Ph.D. degrees in  also been involved in the development of system-, as well as model-based,
+                         electrical engineering from Iowa State University,  inverse solutions for defect and material property characterization. His interests
+                         Ames, in 1998 and 2002, respectively.           have expanded in recent years to include the development of noninvasive
+                           He is currently an Assistant Professor with the  tools for clinical applications. Work done to date in thisﬁeld includes the
+                         Department of Electrical and Computer Engi-  development of new electromagnetic-acoustic (EMAT) methods for detecting
+                         neering, Michigan State University, East Lansing.  single leg separation failures in artiﬁcial heart valves and microwave imaging
+                         His research is in the general area of nondestruc-  and ablation therapy systems. He and his research group have been engaged
+                         tive evaluation and materials characterization. His  in the design and development of high-performance instrumentation including
+        research interests include the application of signal and image processing  acoustic microscopes and single and multifrequency eddy current NDE instru-
+        methods, pattern recognition and neural networks for nondestructive evaluation  ments. These systems, as well as software packages embodying algorithms
+        applications, development of model-based solutions for inverse problems in  developed by Udpa for defect classiﬁcation and characterization, have been
+        NDE, and the development of information fusion algorithms for multimodal  licensed to industry.
+        data fusion.                                                He is a Fellow of the American Society for Nondestructive Testing (ASNT)
+         Dr. Ramuhalli is a Member of Phi Kappa Phi.                      and a Fellow of the Indian Society of Nondestructive Testing.
\ No newline at end of file
diff --git a/Corpus/Floating Point Operations in Matrix-Vector Calculus.txt b/Corpus/Floating Point Operations in Matrix-Vector Calculus.txt
new file mode 100644
index 0000000..2c6c299
Binary files /dev/null and b/Corpus/Floating Point Operations in Matrix-Vector Calculus.txt differ
diff --git a/Corpus/Green AI - Roy Schwartz.txt b/Corpus/Green AI - Roy Schwartz.txt
new file mode 100644
index 0000000..299197d
Binary files /dev/null and b/Corpus/Green AI - Roy Schwartz.txt differ
diff --git a/Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt b/Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt
new file mode 100644
index 0000000..73d70e5
Binary files /dev/null and b/Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt differ
diff --git a/Corpus/Identity Mappings in Deep Residual Networks.txt b/Corpus/Identity Mappings in Deep Residual Networks.txt
new file mode 100644
index 0000000..85ba774
Binary files /dev/null and b/Corpus/Identity Mappings in Deep Residual Networks.txt differ
diff --git a/Corpus/Language Models are Few-Shot Learners.txt b/Corpus/Language Models are Few-Shot Learners.txt
new file mode 100644
index 0000000..2b3bb92
Binary files /dev/null and b/Corpus/Language Models are Few-Shot Learners.txt differ
diff --git a/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt b/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt
new file mode 100644
index 0000000..a98b373
--- /dev/null
+++ b/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt	
@@ -0,0 +1,399 @@
+             Learning Efﬁcient Convolutional Networks through Network Slimming
+
+
+         Zhuang Liu 1∗ Jianguo Li 2  Zhiqiang Shen 3  Gao Huang 4  Shoumeng Yan 2  Changshui Zhang 1
+          1 CSAI, TNList, Tsinghua University 2 Intel Labs China 3 Fudan University 4 Cornell University
+              {liuzhuangthu, zhiqiangshen0214}@gmail.com,{jianguo.li, shoumeng.yan}@intel.com,
+                               gh349@cornell.edu, zcs@mail.tsinghua.edu.cn
+
+
+
+                        Abstract                     However, larger CNNs, although with stronger represen-
+                                                   tation power, are more resource-hungry. For instance, a
+          The deployment of deep convolutional neural networks   152-layer ResNet [14] has more than 60 million parame-
+        (CNNs) in many real world applications is largely hindered   ters and requires more than 20 Giga ﬂoat-point-operations
+        by their high computational cost. In this paper, we propose   (FLOPs) when inferencing an image with resolution 224×
+        a novel learning scheme for CNNs to simultaneously 1) re-   224. This is unlikely to be affordable on resource con-
+        duce the model size; 2) decrease the run-time memory foot-   strained platforms such as mobile devices, wearables or In-
+        print; and 3) lower the number of computing operations,   ternet of Things (IoT) devices.
+        without compromising accuracy. This is achieved by en-     The deployment of CNNs in real world applications areforcing channel-level sparsity in the network in a simple but   mostly constrained by1) Model size: CNNs’ strong repre-effective way. Different from many existing approaches, the   sentation power comes from their millions of trainable pa-proposed method directly applies to modern CNN architec-   rameters. Those parameters, along with network structuretures, introduces minimum overhead to the training process,   information, need to be stored on disk and loaded into mem-and requires no special software/hardware accelerators for   ory during inference time. As an example, storing a typi-the resulting models. We call our approachnetwork slim-   cal CNN trained on ImageNet consumes more than 300MBming, which takes wide and large networks as input mod-   space, which is a big resource burden to embedded devices.els, but during training insigniﬁcant channels are automat-   2) Run-time memory: During inference time, the interme-ically identiﬁed and pruned afterwards, yielding thin and   diate activations/responses of CNNs could even take morecompact models with comparable accuracy. We empirically   memory space than storing the model parameters, even withdemonstrate the effectiveness of our approach with several   batch size 1. This is not a problem for high-end GPUs, butstate-of-the-art CNN models, including VGGNet, ResNet   unaffordable for many applications with low computationaland DenseNet, on various image classiﬁcation datasets. For   power.3) Number of computing operations:The convolu-VGGNet, a multi-pass version of network slimming gives a   tion operations are computationally intensive on high reso-20×reduction in model size and a 5×reduction in comput-   lution images. A large CNN may take several minutes toing operations.                                 process one single image on a mobile device, making it un-
+                                                   realistic to be adopted for real applications.
+        1. Introduction                                Many works have been proposed to compress large
+                                                   CNNs or directly learn more efﬁcient CNN models for fast
+          In recent years, convolutional neural networks (CNNs)   inference. These include low-rank approximation [7], net-
+        have become the dominant approach for a variety of com-   work quantization [3, 12] and binarization [28, 6], weight
+        puter vision tasks, e.g., image classiﬁcation [22], object   pruning [12], dynamic inference [16], etc. However, most
+        detection [8], semantic segmentation [26]. Large-scale   of these methods can only address one or two challenges
+        datasets, high-end modern GPUs and new network architec-   mentioned above. Moreover, some of the techniques require
+        tures allow the development of unprecedented large CNN   specially designed software/hardware accelerators for exe-
+        models. For instance, from AlexNet [22], VGGNet [31] and   cution speedup [28, 6, 12].
+        GoogleNet [34] to ResNets [14], the ImageNet Classiﬁca-     Another direction to reduce the resource consumption of
+        tion Challenge winner models have evolved from 8 layers   large CNNs is to sparsify the network. Sparsity can be im-
+        to more than 100 layers.                           posed on different level of structures [2, 37, 35, 29, 25],
+          ∗ This work was done when Zhuang Liu and Zhiqiang Shen were interns   which yields considerable model-size compression and in-
+        at Intel Labs China. Jianguo Li is the corresponding author.           ference speedup. However, these approaches generally re-
+
+
+
+                                                 2736                      channel scaling                                channel scaling  i-thconv-layer   factors        (i+1)=j-th         i-thconv-layer    factors       (i+1)=j-th
+                                     conv-layer                                  conv-layer Ci1          1.170                           C           1.170
+               C                       C                 i1
+                i2          0.001           j1                                        Cj1
+               Ci3          0.290                 pruning     Ci3          0.290
+               C          0.003          Ci4                        j2                                        Cj2
+                                                          …      …    …
+                       …    …
+                …
+
+               C                                        Cin          0.820 in          0.820
+                    initial network                             compact network
+        Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity
+        regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small
+        scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then
+        ﬁne-tuned to achieve comparable (or even higher) accuracy as normally trained full network.
+
+        quire special software/hardware accelerators to harvest the   Low-rank Decompositionapproximates weight matrix in
+        gain in memory or time savings, though it is easier than   neural networks with low-rank matrix using techniques like
+        non-structured sparse weight matrix as in [12].            Singular Value Decomposition (SVD) [7]. This method
+          In this paper, we proposenetwork slimming, a simple   works especially well on fully-connected layers, yield-
+        yet effective network training scheme, which addresses all   ing∼3x model-size compression however without notable
+        the aforementioned challenges when deploying large CNNs   speed acceleration, since computing operations in CNN
+        under limited resources. Our approach imposes L1 regular-   mainly come from convolutional layers.
+        ization on the scaling factors in batch normalization (BN)   Weight Quantization. HashNet [3] proposes to quantizelayers, thus it is easy to implement without introducing any   the network weights. Before training, network weights arechange to existing CNN architectures. Pushing the val-   hashed to different groups and within each group weightues of BN scaling factors towards zero with L1 regulariza-   the value is shared. In this way only the shared weights andtion enables us to identify insigniﬁcant channels (or neu-   hash indices need to be stored, thus a large amount of stor-rons), as each scaling factor corresponds to a speciﬁc con-   age space could be saved. [12] uses a improved quantizationvolutional channel (or a neuron in a fully-connected layer).   technique in a deep compression pipeline and achieves 35xThis facilitates the channel-level pruning at the followed   to 49x compression rates on AlexNet and VGGNet. How-step. The additional regularization term rarely hurt the per-   ever, these techniques can neither save run-time memoryformance. In fact, in some cases it leads to higher gen-   nor inference time, since during inference shared weightseralization accuracy. Pruning unimportant channels may   need to be restored to their original positions.sometimes temporarily degrade the performance, but this     [28, 6] quantize real-valued weights into binary/ternaryeffect can be compensated by the followed ﬁne-tuning of   weights (weight values restricted to{−1,1}or{−1,0,1}).the pruned network. After pruning, the resulting narrower   This yields a large amount of model-size saving, and signiﬁ-network is much more compact in terms of model size, run-   cant speedup could also be obtained given bitwise operationtime memory, and computing operations compared to the   libraries. However, this aggressive low-bit approximationinitial wide network. The above process can be repeated   method usually comes with a moderate accuracy loss. for several times, yielding a multi-pass network slimming
+        scheme which leads to even more compact network.        Weight Pruning / Sparsifying.[12] proposes to prune the
+          Experiments on several benchmark datasets and different   unimportant connections with small weights in trained neu-
+        network architectures show that we can obtain CNN models   ral networks. The resulting network’s weights are mostly
+        with up to 20x mode-size compression and 5x reduction in   zeros thus the storage space can be reduced by storing the
+        computing operations of the original ones, while achieving   model in a sparse format. However, these methods can only
+        the same or even higher accuracy. Moreover, our method   achieve speedup with dedicated sparse matrix operation li-
+        achieves model compression and inference speedup with   braries and/or hardware. The run-time memory saving is
+        conventional hardware and deep learning software pack-   also very limited since most memory space is consumed by
+        ages, since the resulting narrower model is free of any   the activation maps (still dense) instead of the weights.
+        sparse storing format or computing operations.              In [12], there is no guidance for sparsity during training.
+                                                   [32] overcomes this limitation by explicitly imposing sparse
+        2. Related Work                             constraint over each weight with additional gate variables,
+                                                   and achieve high compression rates by pruning connections
+          In this section, we discuss related work from ﬁve aspects.   with zero gate values. This method achieves better com-
+
+
+
+                                                 2737        pression rate than [12], but suffers from the same drawback.   Advantages of Channel-level Sparsity. As discussed in
+                                                   prior works [35, 23, 11], sparsity can be realized at differ-Structured Pruning / Sparsifying. Recently, [23] pro-   ent levels, e.g., weight-level, kernel-level, channel-level orposes to prune channels with small incoming weights in   layer-level. Fine-grained level (e.g., weight-level) sparsitytrained CNNs, and then ﬁne-tune the network to regain   gives the highest ﬂexibility and generality leads to higheraccuracy. [2] introduces sparsity by random deactivat-   compression rate, but it usually requires special software oring input-output channel-wise connections in convolutional   hardware accelerators to do fast inference on the sparsiﬁedlayers before training, which also yields smaller networks   model [11]. On the contrary, the coarsest layer-level spar-with moderate accuracy loss. Compared with these works,   sity does not require special packages to harvest the infer-we explicitly impose channel-wise sparsity in the optimiza-   ence speedup, while it is less ﬂexible as some whole layerstion objective during training, leading to smoother channel   need to be pruned. In fact, removing layers is only effec-pruning process and little accuracy loss.                tive when the depth is sufﬁciently large, e.g., more than 50[37] imposes neuron-level sparsity during training thus   layers [35, 18]. In comparison, channel-level sparsity pro-some neurons could be pruned to obtain compact networks.   vides a nice tradeoff between ﬂexibility and ease of imple-[35] proposes a Structured Sparsity Learning (SSL) method   mentation. It can be applied to any typical CNNs or fully-to sparsify different level of structures (e.g. ﬁlters, channels   connected networks (treat each neuron as a channel), andor layers) in CNNs. Both methods utilize group sparsity   the resulting network is essentially a “thinned” version ofregualarization during training to obtain structured spar-   the unpruned network, which can be efﬁciently inferenced sity. Instead of resorting to group sparsity on convolu-   on conventional CNN platforms.tional weights, our approach imposes simple L1 sparsity on
+        channel-wise scaling factors, thus the optimization objec-   Challenges.  Achieving channel-level sparsity requires
+        tive is much simpler.                             pruning all the incoming and outgoing connections asso-
+          Since these methods prune or sparsify part of the net-   ciated with a channel. This renders the method of directly
+        work structures (e.g., neurons, channels) instead of individ-   pruning weights on a pre-trained model ineffective, as it is
+        ual weights, they usually require less specialized libraries   unlikely that all the weights at the input or output end of
+        (e.g. for sparse computing operation) to achieve inference   a channel happen to have near zero values. As reported in
+        speedup and run-time memory saving. Our network slim-   [23], pruning channels on pre-trained ResNets can only lead
+        ming also falls into this category, with absolutely no special   to a reduction of∼10% in the number of parameters without
+        libraries needed to obtain the beneﬁts.                 suffering from accuracy loss. [35] addresses this problem
+                                                   by enforcing sparsity regularization into the training objec-Neural Architecture Learning. While state-of-the-art   tive. Speciﬁcally, they adoptgroup LASSOto push all theCNNs are typically designed by experts [22, 31, 14], there   ﬁlter weights corresponds to the same channel towards zeroare also some explorations on automatically learning net-   simultaneously during training. However, this approach re-work architectures. [20] introduces sub-modular/super-   quires computing the gradients of the additional regulariza-modular optimization for network architecture search with   tion term with respect to all the ﬁlter weights, which is non-a given resource budget. Some recent works [38, 1] propose   trivial. We introduce a simple idea to address the aboveto learn neural architecture automatically with reinforce-   challenges, and the details are presented below.ment learning. The searching space of these methods are
+        extremely large, thus one needs to train hundreds of mod-   Scaling Factors and Sparsity-induced Penalty.Our idea
+        els to distinguish good from bad ones. Network slimming   is introducing a scaling factorγfor each channel, which is
+        can also be treated as an approach for architecture learning,   multiplied to the output of that channel. Then we jointly
+        despite the choices are limited to the width of each layer.   train the network weights and these scaling factors, with
+        However, in contrast to the aforementioned methods, net-   sparsity regularization imposed on the latter. Finally we
+        work slimming learns network architecture through only a   prune those channels with small factors, and ﬁne-tune the
+        single training process, which is in line with our goal of   pruned network. Speciﬁcally, the training objective of our
+        efﬁciency.                                    approach is given by
+                                                                             
+        3. Network slimming                                L=   l(f(x,W),y) +λ   g(γ)     (1)
+                                                              (x,y)            γ∈Γ We aim to provide a simple scheme to achieve channel-
+        level sparsity in deep CNNs. In this section, we ﬁrst dis-   where(x,y)denote the train input and target,Wdenotes
+        cuss the advantages and challenges of channel-level spar-   the trainable weights, the ﬁrst sum-term corresponds to the
+        sity, and introduce how we leverage the scaling layers in   normal training loss of a CNN,g(·)is a sparsity-induced
+        batch normalization to effectively identify and prune unim-   penalty on the scaling factors, andλbalances the two terms.
+        portant channels in the network.                     In our experiment, we chooseg(s) =|s|, which is known as
+
+
+
+                                                 2738                                                   convolution layers. 2), if we insert a scaling layer before
+                                                   a BN layer, the scaling effect of the scaling layer will be
+                 Train with     Prune channels Initial                        Fine-tune the     Compact      completely canceled by the normalization process in BN. channel sparsity     with small network                     pruned network   networkregularization   scaling factors                    3), if we insert scaling layer after BN layer, there are two
+                                                   consecutive scaling factors for each channel. Figure 2: Flow-chart of network slimming procedure. The dotted-
+        line is for the multi-pass/iterative scheme.                  Channel Pruning and Fine-tuning.After training under
+                                                   channel-level sparsity-induced regularization, we obtain a
+        L1-norm and widely used to achieve sparsity. Subgradient   model in which many scaling factors are near zero (see Fig-
+        descent is adopted as the optimization method for the non-   ure 1). Then we can prune channels with near-zero scaling
+        smooth L1 penalty term. An alternative option is to replace   factors, by removing all their incoming and outgoing con-
+        the L1 penalty with the smooth-L1 penalty [30] to avoid   nections and corresponding weights. We prune channels
+        using sub-gradient at non-smooth point.                with a global threshold across all layers, which is deﬁned
+          As pruning a channel essentially corresponds to remov-   as a certain percentile of all the scaling factor values. For
+        ing all the incoming and outgoing connections of that chan-   instance, we prune 70% channels with lower scaling factors
+        nel, we can directly obtain a narrow network (see Figure 1)   by choosing the percentile threshold as 70%. By doing so,
+        without resorting to any special sparse computation pack-   we obtain a more compact network with less parameters and
+        ages. The scaling factors act as the agents for channel se-   run-time memory, as well as less computing operations.
+        lection. As they are jointly optimized with the network     Pruning may temporarily lead to some accuracy loss,
+        weights, the network can automatically identity insigniﬁ-   when the pruning ratio is high. But this can be largely com-
+        cant channels, which can be safely removed without greatly   pensated by the followed ﬁne-tuning process on the pruned
+        affecting the generalization performance.               network. In our experiments, the ﬁne-tuned narrow network
+        Leveraging the Scaling Factors in BN Layers.Batch nor-   can even achieve higher accuracy than the original unpruned
+        malization [19] has been adopted by most modern CNNs   network in many cases.
+        as a standard approach to achieve fast convergence and bet-   Multi-pass Scheme. We can also extend the proposedter generalization performance. The way BN normalizes   method from single-pass learning scheme (training withthe activations motivates us to design a simple and efﬁ-   sparsity regularization, pruning, and ﬁne-tuning) to a multi-cient method to incorporates the channel-wise scaling fac-   pass scheme. Speciﬁcally, a network slimming proceduretors. Particularly, BN layer normalizes the internal activa-   results in a narrow network, on which we could again applytions using mini-batch statistics. Letzin andzout be the   the whole training procedure to learn an even more compactinput and output of a BN layer,Bdenotes the current mini-   model. This is illustrated by the dotted-line in Figure 2. Ex-batch, BN layer performs the following transformation:      perimental results show that this multi-pass scheme can lead
+                                                   to even better results in terms of compression rate.zzˆ= in −µ    B ; zσ2 +ǫ  out =γzˆ+β       (2)   Handling Cross Layer Connections and Pre-activation B                           Structure.  The network slimming process introduced
+        whereµB andσB are the mean and standard deviation val-   above can be directly applied to most plain CNN architec-
+        ues of input activations overB,γandβare trainable afﬁne   tures such as AlexNet [22] and VGGNet [31]. While some
+        transformation parameters (scale and shift) which provides   adaptations are required when it is applied to modern net-
+        the possibility of linearly transforming normalized activa-   works withcross layer connectionsand thepre-activation
+        tions back to any scales.                           design such as ResNet [15] and DenseNet [17]. For these
+          It is common practice to insert a BN layer after a convo-   networks, the output of a layer may be treated as the input
+        lutional layer, with channel-wise scaling/shifting parame-   of multiple subsequent layers, in which a BN layer is placed
+        ters. Therefore, we can directly leverage theγparameters in   before the convolutional layer. In this case, the sparsity is
+        BN layers as the scaling factors we need for network slim-   achieved at the incoming end of a layer, i.e., the layer selec-
+        ming. It has the great advantage of introducing no overhead   tively uses a subset of channels it received. To harvest the
+        to the network. In fact, this is perhaps also the most effec-   parameter and computation savings at test time, we need
+        tive way we can learn meaningful scaling factors for chan-   to place achannel selectionlayer to mask out insigniﬁcant
+        nel pruning.1), if we add scaling layers to a CNN without   channels we have identiﬁed.
+        BN layer, the value of the scaling factors are not meaning-
+        ful for evaluating the importance of a channel, because both   4. Experiments convolution layers and scaling layers are linear transforma-
+        tions. One can obtain the same results by decreasing the     We empirically demonstrate the effectiveness of network
+        scaling factor values while amplifying the weights in the   slimming on several benchmark datasets. We implement
+
+
+
+                                                 2739                                         (a) Test Errors on CIFAR-10
+                          Model        Test error (%) Parameters Pruned FLOPs Pruned
+                    VGGNet (Baseline)          6.34 20.04M - 7.97×10 8    -
+                    VGGNet (70% Pruned)       6.20      2.30M 88.5% 3.91×10 8  51.0%
+                    DenseNet-40 (Baseline)       6.11 1.02M - 5.33×10 8    -
+                    DenseNet-40 (40% Pruned)     5.19      0.66M 35.7% 3.81×10 8  28.4%
+                    DenseNet-40 (70% Pruned)     5.65 0.35M 65.2% 2.40×10 8  55.0%
+                    ResNet-164 (Baseline)        5.42 1.70M - 4.99×10 8    -
+                    ResNet-164 (40% Pruned)      5.08      1.44M 14.9% 3.81×10 8  23.7%
+                    ResNet-164 (60% Pruned)      5.27 1.10M 35.2% 2.75×10 8  44.9%
+
+                                         (b) Test Errors on CIFAR-100
+                          Model        Test error (%) Parameters Pruned FLOPs Pruned
+                    VGGNet (Baseline)         26.74 20.08M - 7.97×10 8    -
+                    VGGNet (50% Pruned)       26.52      5.00M 75.1% 5.01×10 8  37.1%
+                    DenseNet-40 (Baseline)       25.36 1.06M - 5.33×10 8    -
+                    DenseNet-40 (40% Pruned)    25.28      0.66M 37.5% 3.71×10 8  30.3%
+                    DenseNet-40 (60% Pruned)    25.72 0.46M 54.6% 2.81×10 8  47.1%
+                    ResNet-164 (Baseline)       23.37 1.73M - 5.00×10 8    -
+                    ResNet-164 (40% Pruned)     22.87      1.46M 15.5% 3.33×10 8  33.3%
+                    ResNet-164 (60% Pruned)     23.91 1.21M 29.7% 2.47×10 8  50.6%
+                                          (c) Test Errors on SVHN
+                          Model        Test Error (%) Parameters Pruned FLOPs Pruned
+                    VGGNet (Baseline)          2.17 20.04M - 7.97×10 8    -
+                    VGGNet (60% Pruned)       2.06      3.04M 84.8% 3.98×10 8  50.1%
+                    DenseNet-40 (Baseline)       1.89 1.02M - 5.33×10 8    -
+                    DenseNet-40 (40% Pruned)     1.79      0.65M 36.3% 3.69×10 8  30.8%
+                    DenseNet-40 (60% Pruned)     1.81 0.44M 56.6% 2.67×10 8  49.8%
+                    ResNet-164 (Baseline)        1.78      1.70M - 4.99×10 8    -
+                    ResNet-164 (40% Pruned)      1.85 1.46M 14.5% 3.44×10 8  31.1%
+                    ResNet-164 (60% Pruned)      1.81 1.12M 34.3% 2.25×10 8  54.9%
+        Table 1: Results on CIFAR and SVHN datasets. “Baseline” denotes normal training without sparsity regularization. In column-1, “60%
+        pruned” denotes the ﬁne-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters
+        and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy
+        could typically be maintained with≥60% channels pruned.
+
+        our method based on the publicly available Torch [5] im-   images, from which we split a validation set of 6,000 im-
+        plementation for ResNets by [10]. The code is available at   ages for model selection during training. The test set con-
+        https://github.com/liuzhuang13/slimming.   tains 26,032 images. During training, we select the model
+                                                   with the lowest validation error as the model to be pruned
+        4.1. Datasets                                 (or the baseline model). We also report the test errors of the
+                                                   models with lowest validation errors during ﬁne-tuning.CIFAR.The two CIFAR datasets [21] consist of natural im-
+        ages with resolution 32×32. CIFAR-10 is drawn from 10
+        and CIFAR-100 from 100 classes. The train and test sets   ImageNet. The ImageNet dataset contains 1.2 millioncontain 50,000 and 10,000 images respectively. On CIFAR-   training images and 50,000 validation images of 100010, a validation set of 5,000 images is split from the training   classes. We adopt the data augmentation scheme as in [10].set for the search ofλ(in Equation 1) on each model. We   We report the single-center-crop validation error of the ﬁnalreport the ﬁnal test errors after training or ﬁne-tuning on   model.all training images. A standard data augmentation scheme
+        (shifting/mirroring) [14, 18, 24] is adopted. The input data
+        is normalized using channel means and standard deviations.   MNIST.MNIST is a handwritten digit dataset containingWe also compare our method with [23] on CIFAR datasets.   60,000 training images and 10,000 test images. To test the
+        SVHN.The Street View House Number (SVHN) dataset   effectiveness of our method on a fully-connected network
+        [27] consists of 32x32 colored digit images. Following   (treating each neuron as a channel with 1×1 spatial size),
+        common practice [9, 18, 24] we use all the 604,388 training   we compare our method with [35] on this dataset.
+
+
+
+                                                 2740        4.2. Network Models                                     Model Parameter and FLOP Savings
+          On CIFAR and SVHN dataset, we evaluate our method        100  100.0% 100.0% 100.0%  Original
+                                                                                 Parameter Ratio
+        on three popular network architectures: VGGNet[31],        80                        FLOPs Ratio
+        ResNet [14] and DenseNet [17]. The VGGNet is originally
+
+                                                       Ratio (%)                       64.8%
+                                                        60
+        designed for ImageNet classiﬁcation. For our experiment a                                 55.1%
+                                                               49.0%      45.0%
+        variation of the original VGGNet for CIFAR dataset is taken        40            34.8%
+        from [36]. For ResNet, a 164-layer pre-activation ResNet        20    11.5%
+        with bottleneck structure (ResNet-164) [15] is used. For         0
+        DenseNet, we use a 40-layer DenseNet with growth rate 12             VGGNet   DenseNet-40  ResNet-164
+        (DenseNet-40).                                Figure 3: Comparison of pruned models withlowertest errors on On ImageNet dataset, we adopt the 11-layer (8-conv +   CIFAR-10 than the original models. The blue and green bars are 3 FC) “VGG-A” network [31] model with batch normaliza-   parameter and FLOP ratios between pruned and original models.
+        tion from [4]. We remove the dropout layers since we use
+        relatively heavy data augmentation. To prune the neurons   mented by building a new narrower model and copying the
+        in fully-connected layers, we treat them as convolutional   corresponding weights from the model trained with sparsity.
+        channels with 1×1 spatial size.
+          On MNIST dataset, we evaluate our method on the same   Fine-tuning.After the pruning we obtain a narrower and
+        3-layer fully-connected network as in [35].              more compact model, which is then ﬁne-tuned. On CIFAR,
+                                                   SVHN and MNIST datasets, the ﬁne-tuning uses the same
+        4.3. Training, Pruning and Fine­tuning            optimization setting as in training. For ImageNet dataset,
+                                                   due to time constraint, we ﬁne-tune the pruned VGG-A withNormal Training.We train all the networks normally from   a learning rate of 10 −3 for only 5 epochs.scratch as baselines. All the networks are trained using
+        SGD. On CIFAR and SVHN datasets we train using mini-   4.4. Results batch size 64 for 160 and 20 epochs, respectively. The ini-
+        tial learning rate is set to 0.1, and is divided by 10 at 50%   CIFAR and SVHNThe results on CIFAR and SVHN are
+        and 75% of the total number of training epochs. On Im-   shown in Table 1. We mark all lowest test errors of a model
+        ageNet and MNIST datasets, we train our models for 60   inboldface.
+        and 30 epochs respectively, with a batch size of 256, and an   Parameter and FLOP reductions. The purpose of net-initial learning rate of 0.1 which is divided by 10 after 1/3   work slimming is to reduce the amount of computing re-and 2/3 fraction of training epochs. We use a weight de-   sources needed. The last row of each model has≥60%cay of10 −4 and a Nesterov momentum [33] of 0.9 without   channels pruned while still maintaining similar accuracy todampening. The weight initialization introduced by [13] is   the baseline. The parameter saving can be up to 10×. Theadopted. Our optimization settings closely follow the orig-   FLOP reductions are typically around50%. To highlightinal implementation at [10]. In all our experiments, we ini-   network slimming’s efﬁciency, we plot the resource sav-tialize all channel scaling factors to be 0.5, since this gives   ings in Figure 3. It can be observed that VGGNet has ahigher accuracy for the baseline models compared with de-   large amount of redundant parameters that can be pruned.fault setting (all initialized to be 1) from [10].            On ResNet-164 the parameter and FLOP savings are rel-
+        Training with Sparsity.For CIFAR and SVHN datasets,   atively insigniﬁcant, we conjecture this is due to its “bot-
+        when training with channel sparse regularization, the hyper-   tleneck” structure has already functioned as selecting chan-
+        parameteerλ, which controls the tradeoff between empiri-   nels. Also, on CIFAR-100 the reduction rate is typically
+        cal loss and sparsity, is determined by a grid search over   slightly lower than CIFAR-10 and SVHN, which is possi-
+        10 −3 , 10 −4 , 10 −5 on CIFAR-10 validation set. For VG-   bly due to the fact that CIFAR-100 contains more classes.
+        GNet we chooseλ=10 −4 and for ResNet and DenseNet   Regularization Effect.From Table 1, we can observe that,λ=10 −5 . For VGG-A on ImageNet, we setλ=10 −5 . All   on ResNet and DenseNet, typically when40%channels areother settings are kept the same as in normal training.       pruned, the ﬁne-tuned network can achieve a lower test er-
+        Pruning.When we prune the channels of models trained   ror than the original models. For example, DenseNet-40
+        with sparsity, a pruning threshold on the scaling factors   with 40% channels pruned achieve a test error of 5.19%
+        needs to be determined. Unlike in [23] where different lay-   on CIFAR-10, which is almost 1% lower than the original
+        ers are pruned by different ratios, we use a global pruning   model. We hypothesize this is due to the regularization ef-
+        threshold for simplicity. The pruning threshold is deter-   fect of L1 sparsity on channels, which naturally provides
+        mined by a percentile among all scaling factors , e.g., 40%   feature selection in intermediate layers of a network. We
+        or 60% channels are pruned. The pruning process is imple-   will analyze this effect in the next section.
+
+
+
+                                                 2741               VGG-A       Baseline   50% Pruned                 (a) Multi-pass Scheme on CIFAR-10
+               Params       132.9M     23.2M           IterTrained Fine-tunedParams PrunedFLOPs Pruned
+             Params Pruned       -       82.5%           1  6.38 6.51     66.7%     38.6%
+               FLOPs       4.57×10 10   3.18×10 10          2  6.23 6.11     84.7%     52.7%
+             FLOPs Pruned       -       30.4%           3  5.87 6.10     91.4%     63.1%
+           Validation Error (%)    36.69      36.66            4  6.19 6.59     95.6%     77.2%
+                                                      5  5.96 7.73     98.3%     88.7%
+                  Table 2: Results on ImageNet.                 6  7.79 9.70     99.4%     95.7%
+
+        Model     Test Error (%)Params Pruned  #Neurons               (b) Multi-pass Scheme on CIFAR-100
+        Baseline      1.43        -     784-500-300-10       IterTrained Fine-tunedParams PrunedFLOPs Pruned
+        Pruned [35]    1.53      83.5%   434-174-78-10       1  27.72 26.52    59.1%     30.9%
+        Pruned (ours)   1.49      84.4%   784-100-60-10       2  26.03 26.52    79.2%     46.1%
+                                                      3  26.49 29.08    89.8%     67.3%
+                   Table 3: Results on MNIST.                 4  28.17 30.59    95.3%     83.0%
+                                                      5  30.04 36.35    98.3%     93.5%
+                                                      6  35.91 46.73    99.4%     97.7%
+        ImageNet. The results for ImageNet dataset are summa-
+        rized in Table 2. When 50% channels are pruned, the pa-   Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR-
+        rameter saving is more than 5×, while the FLOP saving   100 datasets, using VGGNet. The baseline model has test errors of
+        is only 30.4%. This is due to the fact that only 378 (out   6.34% and 26.74%. “Trained” and “Fine-tuned” columns denote
+        of 2752) channels from all the computation-intensive con-   the test errors (%) of the model trained with sparsity, and the ﬁne-
+                                                   tuned model after channel pruning, respectively. The parameter volutional layers are pruned, while 5094 neurons (out of   and FLOP pruned ratios correspond to the ﬁne-tuned model in that 8192) from the parameter-intensive fully-connected layers   row and the trained model in the next row. are pruned. It is worth noting that our method can achieve
+        the savings with no accuracy loss on the 1000-class Im-   more compact models. On CIFAR-10, the trained modelageNet dataset, where other methods for efﬁcient CNNs   achieves the lowest test error in iteration 5. This model[2, 23, 35, 28] mostly report accuracy loss.              achieves 20×parameter reduction and 5×FLOP reduction,
+        MNIST.On MNIST dataset, we compare our method with   while still achievinglowertest error. On CIFAR-100, after
+        the Structured Sparsity Learning (SSL) method [35] in Ta-   iteration 3, the test error begins to increase. This is pos-
+        ble 3. Despite our method is mainly designed to prune   sibly due to that it contains more classes than CIFAR-10,
+        channels in convolutional layers, it also works well in prun-   so pruning channels too agressively will inevitably hurt the
+        ing neurons in fully-connected layers. In this experiment,   performance. However, we can still prune near 90% param-
+        we observe that pruning with a global threshold sometimes   eters and near 70% FLOPs without notable accuracy loss.
+        completely removes a layer, thus we prune 80% of the neu-
+        rons in each of the two intermediate layers. Our method   5. Analysis
+        slightly outperforms [35], in that a slightly lower test error     There are two crucial hyper-parameters in network slim-is achieved while pruning more parameters.              ming, the pruned percentagetand the coefﬁcient of the
+          We provide some additional experimental results in the   sparsity regularization termλ(see Equation 1). In this sec-
+        supplementary materials, including (1) detailed structure of   tion, we analyze their effects in more detail.
+        a compact VGGNet on CIFAR-10; (2) wall-clock time and   Effect of Pruned Percentage. Once we obtain a modelrun-time memory savings in practice. (3) comparison with   trained with sparsity regularization, we need to decide whata previous channel pruning method [23];                percentage of channels to prune from the model. If we
+        4.5. Results for Multi­pass Scheme                prune too few channels, the resource saving can be very
+                                                   limited. However, it could be destructive to the model if
+          We employ the multi-pass scheme on CIFAR datasets   we prune too many channels, and it may not be possible to
+        using VGGNet. Since there are no skip-connections, prun-   recover the accuracy by ﬁne-tuning. We train a DenseNet-
+        ing away a whole layer will completely destroy the mod-   40 model withλ=10 −5 on CIFAR-10 to show the effect of
+        els. Thus, besides setting the percentile threshold as 50%,   pruning a varying percentage of channels. The results are
+        we also put a constraint that at each layer, at most 50% of   summarized in Figure 5.
+        channels can be pruned.                             From Figure 5, it can be concluded that the classiﬁcation
+          The test errors of models in each iteration are shown in   performance of the pruned or ﬁne-tuned models degrade
+        Table 4. As the pruning process goes, we obtain more and   only when the pruning ratio surpasses a threshold. The ﬁne-
+
+
+
+                                                 2742                         λ= 0                    λ= 10 −5                    λ= 10 −4
+             400                       450                      2000
+             350                       400
+             300                       350                      1500
+                                      300250
+
+             Count                         250200                                               1000200150                       150
+             100                       100                       500
+              50                       50
+               0                        0                        00.0  0.2  0.4  0.6  0.8      0.0  0.2  0.4  0.6  0.8      0.0  0.2  0.4  0.6  0.8
+                                              Scaling factor value
+        Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter
+        λ). With the increase ofλ, scaling factors become sparser.
+            8.0                                         0Baseline
+            7.5    Trained with Sparsity                          10 Pruned 7.0    Fine-tuned
+
+
+
+
+                                                     Channel Index )
+           %                                          20
+
+
+
+           Test error ( 6.5
+                                                      30 6.0
+                                                      40 5.5
+            5.0                                        50
+
+            4.50  10 20 30 40 50 60 70 80 90            0     20     40     60     80
+                      Pruned channels (%)                                   Epoch
+        Figure 5: The effect of pruning varying percentages of channels,   Figure 6: Visulization of channel scaling factors’ change in scale
+        from DenseNet-40 trained on CIFAR-10 withλ=10 −5 .          along the training process, taken from the 11th conv-layer in VG-
+                                                   GNet trained on CIFAR-10. Brighter color corresponds to larger
+                                                   value. The bright lines indicate the “selected” channels, the dark
+        tuning process can typically compensate the possible accu-   lines indicate channels that can be pruned.
+        racy loss caused by pruning. Only when the threshold goes
+        beyond 80%, the test error of ﬁne-tuned model falls behind   progresses, some channels’ scaling factors become largerthe baseline model. Notably, when trained with sparsity,   (brighter) while others become smaller (darker).even without ﬁne-tuning, the model performs better than the
+        original model. This is possibly due the the regularization   6. Conclusion effect of L1 sparsity on channel scaling factors.
+                                                     We proposed the network slimming technique to learnChannel Sparsity Regularization.The purpose of the L1   more compact CNNs. It directly imposes sparsity-inducedsparsity term is to force many of the scaling factors to be   regularization on the scaling factors in batch normalizationnear zero. The parameterλin Equation 1 controls its signif-   layers, and unimportant channels can thus be automatically icance compared with the normal training loss. In Figure 4   identiﬁed during training and then pruned. On multiple we plot the distributions of scaling factors in the whole net-   datasets, we have shown that the proposed method is able towork with differentλvalues. For this experiment we use a   signiﬁcantly decrease the computational cost (up to 20×) ofVGGNet trained on CIFAR-10 dataset.                 state-of-the-art networks, with no accuracy loss. More im-
+          It can be observed that with the increase ofλ, the scaling   portantly, the proposed method simultaneously reduces the
+        factors are more and more concentrated near zero. When   model size, run-time memory, computing operations while
+        λ=0, i.e., there’s no sparsity regularization, the distribution   introducing minimum overhead to the training process, and
+        is relatively ﬂat. Whenλ=10 −4 , almost all scaling factors   the resulting models require no special libraries/hardware
+        fall into a small region near zero. This process can be seen   for efﬁcient inference.
+        as a feature selection happening in intermediate layers of
+        deep networks, where only channels with non-negligible   Acknowledgements. Gao Huang is supported by the In-
+        scaling factors are chosen. We further visualize this pro-   ternational Postdoctoral Exchange Fellowship Program of
+        cess by a heatmap. Figure 6 shows the magnitude of scaling   China Postdoctoral Council (No.20150015). Changshui
+        factors from one layer in VGGNet, along the training pro-   Zhang is supported by NSFC and DFG joint project NSFC
+        cess. Each channel starts with equal weights; as the training   61621136008/DFG TRR-169.
+
+
+
+                                                 2743         References                                     [20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network
+                                                            architecture optimization through submodularity and super- [1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu-       modularity.arXiv preprint arXiv:1609.00074, 2016. ral network architectures using reinforcement learning. In    [21] A. Krizhevsky and G. Hinton. Learning multiple layers of ICLR, 2017.                                       features from tiny images. InTech Report, 2009. [2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power    [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetof sparsity in convolutional neural networks.arXiv preprint       classiﬁcation with deep convolutional neural networks. In arXiv:1702.06257, 2017.                               NIPS, pages 1097–1105, 2012. [3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and    [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Y. Chen. Compressing neural networks with the hashing       Graf. Pruning ﬁlters for efﬁcient convnets. arXiv preprint trick. InICML, 2015.                                 arXiv:1608.08710, 2016.
+          [4] S. Chintala. Training an object classiﬁer in torch-7 on    [24] M. Lin, Q. Chen, and S. Yan. Network in network. InICLR,multiple gpus over imagenet. https://github.com/       2014.soumith/imagenet-multiGPU.torch.            [25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.
+          [5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A       Sparse convolutional neural networks. InProceedings of the
+            matlab-like environment for machine learning. InBigLearn,       IEEE Conference on Computer Vision and Pattern Recogni-
+            NIPS Workshop, number EPFL-CONF-192376, 2011.            tion, pages 806–814, 2015.
+          [6] M. Courbariaux and Y. Bengio. Binarynet: Training deep    [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
+            neural networks with weights and activations constrained to+       networks for semantic segmentation. InCVPR, pages 3431–
+            1 or-1.arXiv preprint arXiv:1602.02830, 2016.                3440, 2015.
+          [7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-    [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
+            gus. Exploiting linear structure within convolutional net-       Ng. Reading digits in natural images with unsupervised fea-
+            works for efﬁcient evaluation. InNIPS, 2014.                 ture learning, 2011. InNIPS Workshop on Deep Learning
+          [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-       and Unsupervised Feature Learning, 2011.
+            ture hierarchies for accurate object detection and semantic    [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
+            segmentation. InCVPR, pages 580–587, 2014.                net: Imagenet classiﬁcation using binary convolutional neu-
+          [9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and       ral networks. InECCV, 2016.
+            Y. Bengio. Maxout networks. InICML, 2013.             [29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini.
+         [10] S. Gross and M. Wilber. Training and investigating residual       Group sparse regularization for deep neural networks.arXiv
+            nets. https://github.com/szagoruyko/cifar.       preprint arXiv:1607.00485, 2016.
+            torch.                                      [30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization
+         [11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-       methods for l1 regularization: A comparative study and two
+            pressing deep neural network with pruning, trained quanti-       new approaches. InECML, pages 286–297, 2007.
+            zation and huffman coding. InICLR, 2016.               [31] K. Simonyan and A. Zisserman. Very deep convolutional
+         [12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights       networks for large-scale image recognition. InICLR, 2015.
+            and connections for efﬁcient neural network. InNIPS, pages    [32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse
+            1135–1143, 2015.                                   neural networks.CoRR, abs/1611.06694, 2016.
+         [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into    [33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the
+            rectiﬁers: Surpassing human-level performance on imagenet       importance of initialization and momentum in deep learning.
+            classiﬁcation. InICCV, 2015.                            InICML, 2013.
+         [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning    [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
+            for image recognition. InCVPR, 2016.                      D. Anguelov, D. Erhan, et al. Going deeper with convolu-
+                                                            tions. InCVPR, pages 1–9, 2015.[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in    [35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningdeep residual networks. InECCV, pages 630–645. Springer,       structured sparsity in deep neural networks. InNIPS, 2016.2016.                                        [36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and       github.com/szagoruyko/cifar.torch.K. Q. Weinberger. Multi-scale dense convolutional networks
+            for efﬁcient prediction. arXiv preprint arXiv:1703.09844,    [37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards
+            2017.                                            compact cnns. InECCV, 2016.
+                                                         [38] B. Zoph and Q. V. Le. Neural architecture search with rein-[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.       forcement learning. InICLR, 2017.Densely connected convolutional networks. InCVPR, 2017.
+         [18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger.
+            Deep networks with stochastic depth. InECCV, 2016.
+         [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
+            deep network training by reducing internal covariate shift.
+            arXiv preprint arXiv:1502.03167, 2015.
+
+
+
+
+                                                       2744
\ No newline at end of file
diff --git a/Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt b/Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt
new file mode 100644
index 0000000..643bfe2
Binary files /dev/null and b/Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt differ
diff --git a/Corpus/Learning both Weights and Connections for Efficient  Neural Networks.txt b/Corpus/Learning both Weights and Connections for Efficient  Neural Networks.txt
new file mode 100644
index 0000000..4c089fe
Binary files /dev/null and b/Corpus/Learning both Weights and Connections for Efficient  Neural Networks.txt differ
diff --git a/Corpus/Learning both Weights and Connections for Efficient Neural Networks - Song Han.txt b/Corpus/Learning both Weights and Connections for Efficient Neural Networks - Song Han.txt
new file mode 100644
index 0000000..bdcb2b8
Binary files /dev/null and b/Corpus/Learning both Weights and Connections for Efficient Neural Networks - Song Han.txt differ
diff --git a/Corpus/Learning to Generalize.txt b/Corpus/Learning to Generalize.txt
new file mode 100644
index 0000000..dac9877
--- /dev/null
+++ b/Corpus/Learning to Generalize.txt	
@@ -0,0 +1,933 @@
+        262-A1677  7/24/01  11:12 AM  Page 763
+
+
+
+
+
+
+
+                                                          SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING
+
+
+
+                         MANFRED OPPER                               Theories that try to understand the ability of neural
+
+                         Neural Computation Research Group                  networks to generalize from learned examples are
+                         Aston University                                   discussed. Also, an approach that is based on ideas
+                         Birmingham B4 7ET, United Kingdom                 from statistical physics which aims to model typical
+                                                                           learning behavior is compared with a worst-case
+                                                                           framework.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                    Learning to
+
+
+                    Generalize
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                    ................................................ ◗
+
+                                      Introduction                      rule. To what extent is it possible to understand the com-
+                                                                           plexity of learning from examples by mathematical models
+                    Neural networks learn from examples. This statement is     andtheirsolutions?Thisquestionisthefocusofthisarticle.
+                    obviously true for the brain, but also artiﬁcial networks (or    I concentrate on the use of neural networks for classiﬁca-
+                    neural networks), which have become a powerful new tool     tion. Here, one can take characteristic features (e.g., the
+                    for many pattern-recognition problems, adapt their “syn-    pixels of an image) as an input pattern to the network. In
+                    aptic” couplings to a set of examples. Neural nets usually     the simplest case, it should decide whether a given pattern
+                    consist of many simple computing units which are com-    belongs (at least more likely) to a certain class of objects
+                    bined in an architecture which is often independent from    and respond with the output 1 or 1. To learn the under-
+                    the problem. The parameters which control the interaction    lying classiﬁcation rule, the network is trained on a set of
+                    among the units can be changed during the learning phase     patterns together with the classiﬁcation labels, which are
+                    and these are often called synaptic couplings.After the    provided by a trainer. A heuristic strategy for training is to
+                    learning phase, a network adopts some ability to generalize    tune the parameters of the machine (the couplings of the
+                    from the examples; it can make predictions about inputs    network) using a learning algorithm, in such a way that the
+                    which it has not seen before; it has begun to understand a     errors made on the set of training examples are small, in
+
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       763        262-A1677  7/24/01  11:12 AM  Page 764
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                the hope that this helps to reduce the errors on new data.     for the case of realizable rules they are also independent
+                How well will the trained network be able to classify an in-    of the speciﬁc algorithm, as long as the training examples
+                put that it has not seen before? This performance on new     are perfectly learned. Because it is able to cover even bad
+                data deﬁnes the generalization ability of the network. This    situations which are unfavorable for improvement of the
+                ability will be affected by the problem of realizability: The     learning process, it is not surprising that this theory may
+                network may not be sufﬁciently complex to learn the rule     in some cases provide too pessimistic results which are also
+                completely or there may be ambiguities in classiﬁcation.     too crude to reveal interesting behavior in the intermediate
+                Here, I concentrate on a second problem arising from the     region of the learning curve.
+                fact that learning will mostly not be exhaustive and the in-       In this article, I concentrate mainly on a different ap-
+                formation about the rule contained in the examples is not    proach, which has its origin in statistical physics rather than
+                complete. Hence, the performance of a network may vary     in mathematical statistics, and compare its results with the
+                from one training set to another. In order to treat the gen-     worst-case results. This method aims at studying the typical
+                eralization ability in a quantitative way, a common model     rather than the worst-case behavior and often enables the
+                assumes that all input patterns, those from the training set     exact calculations of the entire learning curve for models of
+                and the new one on which the network is tested, have a pre-    simple networks which have many parameters. Since both
+                assigned probability distribution (which characterizes the     biological and artiﬁcial neural networks are composed of
+                feature that must be classiﬁed), and they are produced in-     many elements, it is hoped that such an approach may ac-
+                dependently at random with the same probability distribu-    tually reveal some relevant and interesting structures.
+                tion from the network’s environment. Sometimes the prob-       At ﬁrst, it may seem surprising that a problem should
+                ability distribution used to extract the examples and the     simplifywhenthenumberofitsconstituentsbecomeslarge.
+                classiﬁcation of these examples is called the rule.The net-     However, this phenomenon is well-known for macroscopic
+                work’s performance on novel data can now be quantiﬁed by     physical systems such as gases or liquids which consist of
+                the so-called generalization error,which is the probability     a huge number of molecules. Clearly, it is not possible to
+                of misclassifying the test input and can be measured by re-     study the complete microscopic state of such a system,
+                peating the same learning experiment many times with dif-    which is described by the rapidly ﬂuctuating positions and
+                ferent data.                                             velocities of all particles. On the other hand, macroscopic
+                   Within such a probabilistic framework, neural networks     quantities such as density, temperature, and pressure are
+                areoftenviewedasstatisticaladaptivemodelswhichshould    usually collective properties inﬂuenced by all elements. For
+                give a likely explanation of the observed data. In this frame-    such quantities, ﬂuctuations are averaged out in the ther-
+                work, the learning process becomes mathematically related     modynamic limit of a large number of particles and the col-
+                to a statistical estimation problem for optimal network pa-    lective properties become, to some extent, independent of
+                rameters.Hence,mathematicalstatisticsseemstobeamost    themicrostate.Similarly,thegeneralizationabilityofaneu-
+                appropriate candidate for studying a neural network’s be-     ral network is a collective property of all the network pa-
+                havior. In fact, various statistical approaches have been ap-     rameters, and the techniques of statistical physics allow, at
+                plied to quantify the generalization performance. For ex-     least for some simple but nontrivial models, for exact com-
+                ample, expressions for the generalization error have been     putations in the thermodynamic limit. Before explaining
+                obtainedinthelimit,wherethenumberofexamplesislarge    these ideas in detail, I provide a short description of feed-
+                compared to the number of couplings (Seung et al.,1992;    forward neural networks.
+                Amari and Murata, 1993). In such a case, one can expect                              ................................................that learning is almost exhaustive, such that the statistical                             ◗
+
+                ﬂuctuations of the parameters around their optimal values              Artiﬁcial Neural Networks
+                are small. However, in practice the number of parameters is
+                often large so that the network can be ﬂexible, and it is not    Based on highly idealized models of brain function, artiﬁ-
+                clear how many examples are needed for the asymptotic    cial neural networks are built from simple elementary com-
+                theorytobecomevalid.Theasymptotictheorymayactually    puting units, which are sometimes termed neurons after
+                miss interesting behavior of the so-called learning curve,    their biological counterparts. Although hardware imple-
+                which displays the progress of generalization ability with    mentations have become an important research topic, neu-
+                an increasing amount of training data.                      ral nets are still simulated mostly on standard computers.
+                   A second important approach, which was introduced    Each computing unit of a neural net has a single output and
+                into mathematical statistics in the 1970s by Vapnik and    several ingoing connections which receive the outputs of
+                Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact    other units. To every ingoing connection (labeled by the
+                bounds for the generalization error which are valid for any    index i) a real number is assigned, the synaptic weight w,i
+                number of training examples. Moreover, they are entirely    which is the basic adjustable parameter of the network. To
+                independent of the underlying distribution of inputs, and    compute a unit’s output, all incoming values x are multi- i
+
+
+
+
+                764                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 765
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                                     0.6   −0.9   0.8
+                                                    inputs
+
+                                     1.6 −1.4    −0.1 synaptic weights
+
+                                                       weighted sum
+                                               1.6 × 0.6 + (–1.4) × (–0.9) + (–0.1) × 0.8 = 2.14
+
+
+
+                                                                 1
+
+
+                                                                 0
+
+
+
+                                                                  −1
+                                                                      2.14 aboutput
+                            FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numeri-
+                            cal values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs
+                            reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which
+                            the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and
+                            step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information.
+
+
+                    plied by the weights w and then added. Figure 1a shows     its simple structure, it can for many learning problems give i
+                    an example of such a computation with three couplings.     a nontrivial generalization performance and may be used
+                    Finally, the result,  wx,is passed through an activation     as a ﬁrst step to an unknown classiﬁcation task. As can be i  i i
+                    function which is typically of the shape of the red curve in     seen by comparing Figs. 2a and 1b, it is also a building
+                    Fig. 1a (a sigmoidal function), which allows for a soft, am-     block for the more complex multilayer networks. Hence,
+                    biguous classiﬁcation between 1 and 1. Other impor-     understanding its performance theoretically may also pro-
+                    tant cases are the step function (green curve) and the linear     vide insight into the more complex machines. To learn a set
+                    function (yellow curve; used in the output neuron for prob-    of examples, a network must adjust its couplings appropri-
+                    lems of ﬁtting continuous functions). In the following, to     ately (I often use the word couplings for their numerical
+                    keep matters simple, I restrict the discussion mainly to the     strengths, the weights w, for i1,..., N). Remarkably, i
+                    step function. Such simple units can develop a remarkable     for the perceptron there exists a simple learning algorithm
+                    computational power when connected in a suitable archi-     which always enables the network to ﬁnd those parameter
+                    tecture. An important network type is the feedforward ar-     values whenever the examples can be learnt by a percep-
+                    chitecture shown in Fig. 1b, which has two layers of comput-     tron. In Rosenblatt’s algorithm, the input patterns are pre-
+                    ing units and adjustable couplings. The input nodes (which     sented sequentially (e.g., in cycles) to the network and the
+                    do not compute) are coupled to the so-called hidden units,
+                    whichfeedtheiroutputsintooneormoreoutputunits.With
+                    suchanarchitectureandsigmoidalactivationfunctions,any
+                    continuous function of the inputs can be arbitrarily closely                                         xx                                   21   x2   x3       xn
+                    approximated when the number of hidden units is sufﬁ-
+                    ciently large.                                                                                      (w1 ,w 2 )
+                                                                            w ................................................                                1  w2 w3    wn ◗
+
+                                    The Perceptron                                                                    x1
+
+
+                    The simplest type of network is the perceptron (Fig. 2a).
+                    There are Ninputs, Nsynaptic couplings w, and the output i
+                    is simply                                               a                          b
+                                           N                               FIGURE 2 (a) The perceptron. (b) Classiﬁcation of inputs
+                                          awx                   [1] i i                           by a perceptron with two inputs. The arrow indicates the vec-
+                                          i1                              tor composed of the weights of the network, and the line per-
+                    It has a single-layer architecture and the step function     pendicular to this vector is the boundary between the classes
+                    (green curve in Fig. 1a) as its activation function. Despite     of input.
+
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       765        262-A1677  7/24/01  11:12 AM  Page 766
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                output is tested. Whenever a pattern is not classiﬁed cor-
+                rectly, all couplings are altered simultaneously. We increase     x2
+                by a ﬁxed amount all weights for which the input unit and
+                the correct value of the output neuron have the same sign
+                but we decrease them for the opposite sign. This simple
+                algorithm is reminiscent of the so-called Hebbian learning
+                rule,a physiological model of a learning processes in the
+                real brain. It assumes that synaptic weights are increased
+                when two neurons are simultaneously active. Rosenblatt’s
+                theorem states that in cases in which there exists a choice of
+                the w which classify correctly all of the examples (i.e., per- i
+                fectly learnable perceptron), this algorithm ﬁnds a solution
+                in a ﬁnite number of steps, which is at worst equal to A N 3 ,
+                where Ais an appropriate constant.
+                   It is often useful to obtain an intuition of a perceptron’s                                                    xa                                               1
+                classiﬁcation performance by thinking in terms of a geo-
+                metric picture. We may view the numerical values of the in-
+                puts as the coordinates of a point in some (usually) high-
+                dimensional space. The case of two dimensions is shown
+                in Fig. 2b. A corresponding point is also constructed for the
+                couplings w.The arrow which points from the origin of the i
+                coordinate system to this latter point is called the weight
+                vector or coupling vector. An application of linear algebra
+                tothecomputationofthenetworkshowsthatthelinewhich
+                is perpendicular to the coupling vector is the boundary be-
+                tween inputs belonging to the two different classes. Input
+                points which are on the same side as the coupling vector are
+                classiﬁed as 1 (the green region in Fig. 2b) and those on
+                the other side as 1 (red region in Fig. 2b).
+                   Rosenblatt’s algorithm aims to determine such a line
+                when it is possible. This picture generalizes to higher di-                    direction of coupling vectorb
+                mensions, for which a hyperplane plays the same role of the     FIGURE 3 (a) Projection of 200 random points (with ran-
+                line of the previous two-dimensional example. We can still     dom labels) from a 200-dimensional space onto the ﬁrst two
+                obtainanintuitivepicturebyprojectingontwo-dimensional    coordinate axes (x and x). (b) Projection of the same points 1     2
+                planes. In Fig. 3a, 200 input patterns with random coordi-     onto a plane which contains the coupling vector of a perfectly
+                nates (randomly labeled red and blue) in a 200-dimensional     trained perceptron.
+                input space are projected on the plane spanned by two arbi-
+                trary coordinate axes. If we instead use a plane for projec-
+                tion which contains the coupling vector (determined from    tions for small changes of the couplings). Hence, in general,
+                a variant of Rosenblatt’s algorithm) we obtain the view    in addition to the perfectly learnable perceptron case in
+                shown in Fig. 3b, in which red and green points are clearly     which the ﬁnal error is zero, minimizing the training error
+                separated and there is even a gap between the two clouds.     is usually a difﬁcult task which could take a large amount of
+                   It is evident that there are cases in which the two sets of    computer time. However, in practice, iterative approaches,
+                points are too mixed and there is no line in two dimensions    which are based on the minimization of other smooth cost
+                (or no hyperplane in higher dimensions which separates     functions,areusedtotrainaneuralnetwork(Bishop,1995).
+                them). In these cases, the rule is too complex to be per-                              ................................................fectly learned by a perceptron. If this happens, we must at-                             ◗
+
+                tempt to determine the choice of the coupling which mini-               Capacity, VC Dimension, 
+                mizesthenumberoferrorsonagivensetofexamples.Here,           and Worst-Case Generalization
+                Rosenblatt’s algorithm does not work and the problem of
+                ﬁnding the minimum is much more difﬁcult from the algo-    As previously shown, perceptrons are only able to realize a
+                rithmic point. The training error, which is the number of     very restricted type of classiﬁcation rules, the so-called lin-
+                errorsmadeonthetrainingset,isusuallyanonsmoothfunc-    early separable ones. Hence, independently from the issue
+                tion of the network couplings (i.e., it may have large varia-    of ﬁnding the best algorithm to learn the rule, one may ask
+
+
+
+
+
+                766                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 767
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                    the following question: In how many cases will the percep-     exp[Nf(m/N)], where the function f(a) vanishes for 
+                    tron be able to learn a given set of training examples per-     a2 and it is positive for a2. Such a threshold phe-
+                    fectly if the output labels are chosen arbitrarily? In order to     nomenon is an example of a phase transition (i.e., a sharp
+                    answer this question in a quantitative way, it is convenient     change of behavior) which can occur in the thermodynamic
+                    tointroducesomeconceptssuchascapacity,VCdimension,     limit of a large network size.
+                    andworst-casegeneralization,whichcanbeusedinthecase       Generally, the point at which such a transition takesof the perceptron and have a more general meaning.          place deﬁnes the so-called capacity of the neural network.In the case of perceptrons, this question was answered in     Although the capacity measures the ability of a network tothe 1960s by Cover (1965). He calculated for any set of in-     learn random mappings of the inputs, it is also related to itsput patterns, e.g., m,the fraction of all the 2 m possible map-     ability to learn a rule (i.e., to generalize from examples).pings that can be linearly separated and are thus learnable     The question now is, how does the network perform on aby perceptrons. This fraction is shown in Fig. 4 as a func-     new example after having been trained to learn mexampletion of the number of examples per coupling for different     on the training set?numbers of input nodes (couplings) N.Three regions can       To obtain an intuitive idea of the connection betweenbe distinguished:                                        capacity and ability to generalize, we assume a training set
+                       Region in which m/N1: Simple linear algebra shows     of size mand a single pattern for test. Suppose we deﬁne
+                    that it is always possible to learn all mappings when the     a possible rule by an arbitrary learnable mapping from
+                    number mof input patterns is less than or equal to the     inputs to outputs. If m1 is much larger than the capac-
+                    number Nof couplings (there are simply enough adjustable     ity, then for most rules the labels on the mtraining pat-
+                    parameters).                                            terns which the perceptron is able to recognize will nearly
+                       Region in which m/N1: For this region, there are ex-     uniquely determine the couplings (and consequently the
+                    amples of rules that cannot be learned. However, when the     answer of the learning algorithm on the test pattern), and
+                    number of examples is less than twice the number of cou-     therulecanbeperfectlyunderstoodfromtheexamples.Be-
+                    plings (m/N2), if the network is large enough almost all     low capacity, in most cases there are two different choices
+                    mappings can be learned. If the output labels for each of    of couplings which give opposite answers for the test pat-
+                    the minputs are chosen randomly 1 or 1 with equal    tern. Hence, a correct classiﬁcation will occur with proba-
+                    probability, the probability of ﬁnding a nonrealizable cou-    bility 0.5 assuming all rules to be equally probable. Figure 5
+                    pling goes to zero exponentially when Ngoes to inﬁnity at    displays the two types of situations form3andN2.
+                    ﬁxed ratio m/N.                                           This intuitive connection can be sharpened. Vapnik and
+                       Region in which m/N2: For m/N2 the probabil-     Chervonenkis established a relation between a capacity
+                    ity for a mapping to be realizable by perceptrons decreases     such as quantity and the generalization ability that is valid
+                    to zero rapidly and it goes to zero exponentially when N     for general classiﬁers (Vapnik, 1982, 1995). The VC dimen-
+                    goes to inﬁnity at ﬁxed ratio m/N(it is proportional to     sion is deﬁned as the size of the largest set of inputs for
+                                                                           which all mappings can be learned by the type of classi-
+                                                                           ﬁer. It equals Nfor the perceptron. Vapnik and Chervo-
+                        1.0                                                 nenkis were able to show that for any training set of size m
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                      fraction of realizable mappings 0.8
+
+
+                        0.6
+
+
+                        0.4                                                                   ?                           ?
+
+
+                        0.2
+
+
+                        0.0                                                 a                            b
+                          01234 FIGURE 5 Classiﬁcation rules for four patterns based on a m/N                         perceptron. The patterns colored in red represent the training
+                    FIGURE 4 Fraction of all mappings of minput patterns    examples, and triangles and circles represent different class la-
+                    which are learnable by perceptrons as a function of m/Nfor    bels. The question mark is a test pattern. (a) There are two
+                    different numbers of couplings N: N10 (in green), N20    possible ways of classifying the test point consistent with the
+                    (in blue), and N100 (in red).                             examples; (b) only one classiﬁcation is possible.
+
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       767        262-A1677  7/24/01  11:12 AM  Page 768
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                larger than the VC dimension D , the growth of the num-    blue curve in Fig. 6, the minimal training error will decrease VC
+                ber of realizable mappings is bounded by an expression     for increasing complexity of the nets. On the other hand,
+                which grows much slower than 2 m (in fact, only like a poly-     the VC dimension and the complexity of the networks in-
+                nomial in m).                                           crease with the increasing number of hidden units, leading
+                   They proved that a large difference between training er-    to an increasing expected difference (conﬁdence interval)
+                ror (i.e., the minimum percentage of errors that is done on    between training error and generalization error as indi-
+                the training set) and generalization error (i.e., the proba-     cated by the red curve. The sum of both (green curve) will
+                bility of producing an error on the test pattern after having    have a minimum, giving the smallest bound on the general-
+                learned the examples) of classiﬁers is highly improbable if    ization error. As discussed later, this procedure will in some
+                the number of examples is well above D . This theorem    cases lead to not very realistic estimates by the rather pes- VC
+                implies a small expected generalization error for perfect     simistic bounds of the theory. In other words, the rigorous
+                learning of the training set results. The expected general-     bounds, which are obtained from an arbitrary network and
+                ization error is bounded by a quantity which increases pro-    rule, are much larger than those determined from the re-
+                portionally to D  and decreases (neglecting logarithmic     sults for most of the networks and rules. VC
+                corrections in m) inversely proportional to m.                                         ................................................Conversely, one can construct a worst-case distribution                             ◗
+
+                of input patterns, for which a size of the training set larger           Typical Scenario: The Approach
+                than D  is also necessary for good generalization. The VC                  of Statistical Physics VC
+                results should, in practice, enable us to select the network
+                with the proper complexity which guarantees the smallest    When the number of examples is comparable to the size of
+                bound on the generalization error. For example, in order     the network, which for a perceptron equals the VC dimen-
+                toﬁnd the proper size of the hidden layer of a network with    sion, the VC theory states that one can construct malicious
+                twolayers,onecouldtrainnetworksofdifferentsizesonthe    situations which prevent generalizations. However, in gen-
+                same data.                                             eral, we would not expect that the world acts as an adver-
+                   The relation among these concepts can be better under-    sary. Therefore, how should one model a typical situation?
+                stood if we consider a family of networks of increasing com-    As a ﬁrst step, one may construct rules and pattern dis-
+                plexity which have to learn the same rule. A qualitative pic-    tributions which act together in a nonadversarial way. The
+                ture of the results is shown in Fig. 6. As indicated by the    teacher–student paradigm has proven to be useful in such a
+                                                                       situation. Here, the rule to be learned is modeled by a sec-
+                                                                       ondnetwork,theteachernetwork;inthiscase,iftheteacher
+                                                                       and the student have the same architecture and the same
+                                  upper bound on                         numberofunits,theruleisevidentlyrealizable.Thecorrect generalization error                        class labels for any inputs are given by the outputs of the
+                                                                       teacher. Within this framework, it is often possible to ob-
+                                                                       tain simple expressions for the generalization error. For a
+                                            upper bound on               perceptron, we can use the geometric picture to visualize confidence interval             the generalization error. A misclassiﬁcation of a new in-
+                                                                       put vector by a student perceptron with coupling vector ST
+                                                                       occurs only if the input pattern is between the separating
+                                                                       planes (dashed region in Fig. 7) deﬁned by ST and the vec-
+                                                                       tor of teacher couplings TE. If the inputs are drawn ran- training error               domlyfromauniformdistribution,thegeneralizationerror
+                                                                       is directly proportional to the angle between ST and TE.
+                                 network complexity                      Hence, the generalization error is small when teacher and
+                                                                       student vectors are close together and decreases to zero
+                                                                       when both coincide.
+                                                                          In the limit, when the number of examples is very large
+                                                                       all the students which learn the training examples perfectly
+                                                                       will not differ very much from and their couplings will be FIGURE 6 As the complexity of the network varies (i.e.,     close to those of the teacher. Such cases with a small gen- of the number of hidden units, as shown schematically below),
+                the generalization error (in red), calculated from the sum of     eralization error have been successfully treated by asymp-
+                the training error (in green) and the conﬁdence interval (in     totic methods of statistics. On the other hand, when the
+                blue) according to the theory of Vapnik–Chervonenkis, shows     number of examples is relatively small, there are many dif-
+                a minimum; this corresponds to the network with the best gen-    ferent students which are consistent with the teacher re-
+                eralization ability.                                        garding the training examples, and the uncertainty about
+
+
+
+                768                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 769
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                                                                           with the number of couplings N(like typical volumes in 
+                                                                           N-dimensional spaces) and Bdecreases exponentially with
+                                                                           m(because it becomes more improbable to be correct ST                        mtimes for any e0), both factors can balance each other
+                                                                           when mincreases like maN.ais an effective measure for TE                   the size of the training set when Ngoes to inﬁnity. In order
+                                                                           to have quantities which remain ﬁnite as NSq, it is also
+                                                                           useful to take the logarithm of V(e) and divide by N, which
+                                                                           transforms the product into a sum of two terms. The ﬁrst
+                                                                           one (which is often called the entropic term) increases with
+                                                                           increasing generalization error (green curve in Fig. 8). This
+                    FIGURE 7 For a uniform distribution of patterns, the gen-     is true because there are many networks which are not
+                    eralization error of a perceptron equals the area of the    similar to the teacher, but there is only one network equal
+                    shaded region divided by the area of the entire circle. ST and     to the teacher. For almost all networks (remember, the
+                    TE represent the coupling vectors of the student and teacher,     entropic term does not include the effect of the training ex-
+                    respectively.                                             amples) e0.5, i.e., they are correct half of the time by
+                                                                           random guessing. On the other hand, the second term (red
+                                                                           curve in Fig. 8) decreases with increasing generalization er-
+                    the true couplings of the teacher is large. Possible general-    ror because the probability of being correct on an input
+                    ization errors may range from zero (if, by chance, a learn-     pattern increases when the student network becomes more
+                    ing algorithm converges to the teacher) to some worst-case    similar to the teacher. It is often called the energetic contri-
+                    value. We may say that the constraint which speciﬁes the     butionbecauseitfavorshighlyordered(towardtheteacher)
+                    macrostateofthenetwork(itstrainingerror)doesnotspec-    network states, reminiscent of the states of physical systems
+                    ify the microstate uniquely. Nevertheless, it makes sense to    at low energies. Hence, there will be a maximum (Fig. 8, ar-
+                    speak of a typical value for the generalization error, which     row) of V(e) at some value of ewhich by deﬁnition is the
+                    is deﬁned as the value which is realized by the majority of    typical generalization error.
+                    the students. In the thermodynamic limit known from sta-       The development of the learning process as the number
+                    tistical physics, in which the number of parameters of the    of examples aNincreases can be understood as a compe-
+                    network is taken to be large, we expect that in fact almost    tition between the entropic term, which favors disordered
+                    all students belong to this majority, provided the quantity    network conﬁgurations that are not similar to the teacher,
+                    of interest is a cooperative effect of all components of the    andtheenergeticterm.Thelattertermdominateswhenthe
+                    system. As the geometric visualization for the generaliza-    number of examples is large. It will later be shown that such
+                    tion error of the perceptron shows, this is actually the case.    a competition can lead to a rich and interesting behavior as
+                    The following approach, which was pioneered by Elizabeth    the number of examples is varied. The result for the learn-
+                    Gardner (Gardner, 1988; Gardner and Derrida, 1989), is    ing curve (Györgyi and Tishby, 1990; Sompolinsky et al.,
+                    based on the calculation of V(e), the volume of the space
+                    of couplings which both perfectly implement mtraining
+                    examples and have a given generalization error e. For an
+                    intuitive picture, consider that only discrete values for the                               entropic contribution
+                    couplings are allowed; then V(e) would be proportional to
+                    the number of students. The typical value of the general-
+                    ization error is the value of e, which maximizes V(e). It
+                    should be kept in mind that V(e) is a random number and                               energetic contribution
+                    ﬂuctuates from one training set to another. A correct treat-                 1/N logfV(ε)g
+                    ment of this randomness requires involved mathematical
+                    techniques (Mézard et al.,1987). To obtain a picture which
+                    is quite often qualitatively correct, we may replace it by its
+                    average over many realizations of training sets. From ele-
+                    mentary probability theory we see that this average num-              maximum
+                    ber can be found by calculating the volume Aof the space     0        0.1 0.2 0.3 0.4 0.5 
+                    of all students with generalization error e, irrespective of                                                ε
+                    their behavior on the training set, and multiplying it by    FIGURE 8 Logarithm of the average volume of students that
+                    the probability Bthat a student with generalization error e    havelearnedmexamplesandgiveegeneralizationerror(green
+                    gives mtimes the correct answers on independent draw-     curve). The blue and red curves represent the energetic and
+                    ings of the input patterns. Since Aincreases exponentially     entropic contributions, respectively.
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       769        262-A1677  7/24/01  11:12 AM  Page 770
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                  0.5                                                   student is free to ask the teacher questions, i.e., if the stu-
+                ε                                                      dent can choose highly informative input patterns. For the
+                                                                       simple perceptron a fruitful query strategy is to select a new 0.4                                                   input vector which is perpendicular to the current coupling
+                                                                       vector of the student (Kinzel and Ruján, 1990). Such an
+                  0.3                                                   input is a highly ambiguous pattern because small changes
+                                    continuous couplings                   in the student couplings produce different classiﬁcation an-
+                                                                       swers. For more complicated networks it may be difﬁcult 0.2                                                   to obtain similar ambiguous inputs by an explicit construc-
+                                                                       tion. A general algorithm has been proposed (Seung et al.,
+                  0.1                                                   1992a) which uses the principle of maximal disagreement discrete couplings                          in a committee of several students as a selection process for
+                                                                       training patterns. Using an appropriate randomized train- 0.00.0 0.1 0.2     0.3 0.4 0.5 0. 6     ingstrategy,differentstudentsaregeneratedwhichalllearn α        the same set of examples. Next, any new input vector is only
+                FIGURE 9 Learning curves for typical student perceptrons.     accepted for training when the disagreement of its classi-
+                am/Nis the ratio between the number of examples and the     ﬁcation between the students is maximal. For a committee
+                coupling number.                                        of two students it can be shown that when the number of
+                                                                       examples is large, the information gain does not decrease
+                                                                       but reaches a positive constant. This results in a much faster
+                1990) of a perceptron obtained by the statistical physics ap-    decrease of the generalization error. Instead of being in-
+                proach (treating the random sampling the proper way) is     versely proportional to the number of examples, the de-
+                shown by the red curve of Fig. 9. In contrast to the worst-     crease is now exponentially fast.
+                casepredictionsoftheVCtheory,itispossibletohavesome                              ................................................generalization ability below VC dimension or capacity. As                             ◗
+
+                we might have expected, the generalization error decreases          Bad Students and Good Students
+                monotonically, showing that the more that is learned, the
+                more that is understood. Asymptotically, the error is pro-    Although the typical student perceptron has a smooth,
+                portional to Nand inversely proportional to m, in agree-    monotonically decreasing learning curve, the possibility
+                ment with the VC predictions. This may not be true for    that some concrete learning algorithm may result in a set
+                more complicated networks.                              of student couplings which are untypical in the sense of
+                                                                       our theory cannot be ruled out. For bad students, even non-................................................ ◗                              monotic generalization behavior is possible. The problem
+                                Query Learning                    of a concrete learning algorithm can be made to ﬁt into the
+                                                                       statistical physics framework if the algorithm minimizes a
+                Soon after Gardner’s pioneering work, it was realized that    certain cost function. Treating the achieved values of the
+                the approach of statistical physics is closely related to ideas    new cost function as a macroscopic constraint, the tools of
+                in information theory and Bayesian statistics (Levin et al.,     statistical physics apply again.
+                1989;GyörgyiandTishby,1990;OpperandHaussler,1991),       As an example, it is convenient to consider a case in
+                for which the reduction of an initial uncertainty about the    which the teacher and the student have a different archi-
+                true state of a system (teacher) by observing data is a cen-     tecture: In one of the simplest examples one tries to learn
+                tral topic of interest. The logarithm of the volume of rele-     a classiﬁcation problem by interpreting it as a regression
+                vant microstates as deﬁned in the previous section is a di-     problem, i.e., a problem of ﬁtting a continuous function
+                rect measure for such uncertainty. The moderate progress     through data points. To be speciﬁc, we study the situation
+                in generalization ability displayed by the red learning curve    in which the teacher network is still given by a percep-
+                of Fig. 9 can be understood by the fact that as learning pro-    tron which computes binary valued outputs of the form 
+                gresses less information about the teacher is gained from a     ywx, 1, but as the student we choose a network i  i i
+                newrandomexample.Here,theinformationgainisdeﬁned    with a linear transfer function (the yellow curve in Fig. 1a)
+                as the reduction of the uncertainty when a new example is
+                learned. The decrease in information gain is due to the in-                        Y awxi i
+                crease in the generalization performance. This is plausible                              i
+                because inputs for which the majority of student networks    and try to ﬁt this linear expression to the binary labels of
+                give the correct answer are less informative than those for    the teacher. If the number of couplings is sufﬁciently large
+                which a mistake is more likely. The situation changes if the    (larger than the number of examples) the linear function
+
+
+
+
+                770                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 771
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                    (unlike the sign) is perfectly able to ﬁt arbitrary continuous    the student learns all examples perfectly. Although it may
+                    output values. This linear ﬁt is an attempt to explain the    not be easy to construct a learning algorithm which per-
+                    data in a more complicated way than necessary, and the    forms such a maximization in practice, the resulting gener-
+                    couplings have to be ﬁnely tuned in order to achieve this    alization error can be calculated using the statistical phys-
+                    goal. We ﬁnd that the student trained in such a way does    ics approach (Engel and Van den Broeck, 1993). The result
+                    not generalize well (Opper and Kinzel, 1995). In order to    is in agreement with the VC theory: There is no prediction
+                    compare the classiﬁcations of teacher and student on a new    better than random guessing below the capacity.
+                    random input after training, we have ﬁnally converted the       Although the previous algorithms led to a behavior
+                    student’s output into a classiﬁcation label by taking the sign    whichisworsethanthetypicalone,wenowexaminetheop-
+                    of its output. As shown in the red curve of Fig. 10, after    positecaseofanalgorithmwhichdoesbetter.Sincethegen-
+                    an initial improvement of performance the generalization    eralization ability of a neural network is related to the fact
+                    error increases again to the random guessing value e0.5    that similar input vectors are mapped onto the same out-
+                    at a1 (Fig. 10, red curve). This phenomenon is called    put, one can assume that such a property can be enhanced
+                    overﬁtting.For a1 (i.e., for more data than parameters),    if the separating gap between the two classes is maximized,
+                    it is no longer possible to have a perfect linear ﬁt through    which deﬁnes a new cost function for an algorithm. This
+                    the data, but a ﬁt with a minimal deviation from a linear    optimal margin perceptron can be practically realized and
+                    function leads to the second part of the learning curve.ede-    when applied to a set of data leads to the projection of
+                    creases again and approaches 0 asymptotically for aSq.    Fig. 11. As a remarkable result, it can be seen that there is a
+                    This shows that when enough data are available, the details    relatively large fraction of patterns which are located at the
+                    of the training algorithm are less important.                 gap. These points are called support vectors(SVs). In order
+                       The dependence of the generalization performance on    to understand their importance for the generalization abil-
+                    the complexity of the assumed data model is well-known. If    ity, we make the following gedankenexperimentand assume
+                    function class is used that is too complex, data values can be    that all the points which lie outside the gap (the nonsupport
+                    perfectly ﬁtted but the predicted function will be very sen-    vectors) are eliminated from the training set of examples.
+                    sitive to the variations of the data sample, leading to very       From the two-dimensional projection of Fig. 11, we may
+                    unreliable predictions on novel inputs. On the other hand,    conjecture that by running the maximal margin algorithm
+                    functions that are too simple make the best ﬁt almost insen-    on the remaining examples (the SVs) we cannot create a
+                    sitive to the data, which prevents us from learning enough    larger gap between the points. Hence, the algorithm will
+                    from them.                                             converge to the same separating hyperplane as before. This
+                       It is also possible to calculate the worst-case generaliza-    intuitive picture is actually correct. If the SVs of a training
+                    tion ability of perceptron students learning from a percep-    set were known beforehand (unfortunately, they are only
+                    tron teacher. The largest generalization error is obtained    identiﬁed after running the algorithm), the margin classi-
+                    (Fig. 7) when the angle between the coupling vectors of    ﬁer would have to be trained only on the SVs. It would au-
+                    teacher and student is maximized under the constraint that    tomatically classify the rest of the training inputs correctly.
+
+
+
+
+
+                     0.50
+                    ε
+                     0.40
+
+
+                     0.30            linear student
+
+
+                     0.20
+                           margin classifier
+
+                     0.10
+
+
+                     0.000123456 α
+                    FIGURE 10 Learning curves for a linear student and for a     FIGURE 11 Learning with a margin classiﬁer and m300
+                    margin classiﬁer. am/N.                                 examples in an N150-dimensional space.
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       771        262-A1677  7/24/01  11:12 AM  Page 772
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                Hence, if in an actual classiﬁcation experiment the number    ber of consistent students is small; nevertheless, the few re-
+                of SVs is small compared to the number of non-SVs, we    maining ones must still differ in a ﬁnite fraction of bits from
+                may expect a good generalization ability.                    each other and from the teacher so that perfect generaliza-
+                   The learning curve for a margin classiﬁer (Opper and    tion is still impossible. For aslightly above a only the cou- c
+                Kinzel, 1995) learning from a perceptron teacher (calcu-     plings of the teacher survive.
+                lated by the statistical physics approach) is shown in Fig. 10
+                (blue curve). The concept of a margin classiﬁer has recently                              ................................................
+                been generalized to the so-called support vector machines                             ◗
+
+                                                                                    Learning with Errors (Vapnik, 1995), for which the inputs of a perceptron are re-
+                placed by suitable features which are cleverly chosen non-
+                linear functions of the original inputs. In this way, nonlin-    The example of the Ising perceptron teaches us that it will
+                ear separable rules can be learned, providing an interesting     not always be simple to obtain zero training error. More-
+                alternative to multilayer networks.                         over, an algorithm trying to achieve this goal may get stuck
+                                                                       in local minima. Hence, the idea of allowing errors explic-
+                                                                       itly in the learning procedure, by introducing an appropri-................................................ ◗                              ate noise, can make sense. An early analysis of such a sto-
+                             The Ising Perceptron                 chastic training procedure and its generalization ability for
+                                                                       the learning in so-called Boolean networks (with elemen-
+                The approach of statistical physics can develop a speciﬁc     tary computing units different from the ones used in neural
+                predictivepowerinsituationsinwhichonewouldliketoun-    networks) can be found in Carnevali and Patarnello (1987).
+                derstand novel network models or architectures for which    A stochastic algorithm can be useful to escape local min-
+                currently no efﬁcient learning algorithm is known. As the    ima of the training error, enabling a better learning of the
+                simplest example, we consider a perceptron for which the     training set. Surprisingly, such a method can also lead to
+                couplings w are constrained to binary values 1 and 1    bettergeneralizationabilitiesiftheclassiﬁcationruleisalso j
+                (Gardner and Derrida, 1989; Györgyi, 1990; Seung et al.,    corrupted by some degree of noise (Györgyi and Tishby,
+                1992b). For this so-called Ising perceptron(named after    1990). A stochastic training algorithm can be realized by
+                Ernst Ising, who studied coupled binary-valued elements as    the Monte Carlo metropolis method, which was invented
+                a model for a ferromagnet), perfect learning of examples is    to generate the effects of temperature in simulations of
+                equivalent to a difﬁcult combinatorial optimization prob-    physical systems. Any changes of the network couplings
+                lem (integer linear programming), which in the worst case    which lead to a decrease of the training error during learn-
+                is believed to require a learning time that increases expo-     ing are allowed. However, with some probability that in-
+                nentially with the number of couplings N.                   creases with the temperature, an increase of the training
+                   To obtain the learning curve for the typical student, we    error is also accepted. Although in principle this algorithm
+                can proceed as before, replacing V(e) by the number of    may visit all the network’s conﬁgurations, for a large sys-
+                student conﬁgurations that are consistent with the teacher    tem, with an overwhelming probability, only states close to
+                which results in changing the entropic term appropriately.    some ﬁxed training error will actually appear. The method
+                When the examples are provided by a teacher network of     of statistical physics applied to this situation shows that for
+                thesamebinarytype,onecanexpectthatthegeneralization     sufﬁciently large temperatures (T) we often obtain a quali-
+                error will decrease monotonically to zero as a function of a.    tatively correct picture if we repeat the approximate calcu-
+                The learning curve is shown as the blue curve in Fig. 9. For    lation for the noise-free case and replace the relative num-
+                sufﬁciently small a, the discreteness of the couplings has al-    ber of examples aby the effective number a/T.Hence, the
+                most no effect. However, in contrast to the continuous case,    learning curves become essentially stretched and good gen-
+                perfect generalization does not require inﬁnitely many ex-    eralization ability is still possible at the price of an increase
+                amples but is achieved already at a ﬁnite number a 1.24.     in necessary training examples. c
+                This is not surprising because the teacher’s couplings con-       Within the stochastic framework, learning (with errors)
+                tain only a ﬁnite amount of information (one bit per cou-    can now also be realized for the Ising perceptron, and it is
+                pling) and one would expect that it does not take much     interesting to study the number of relevant student conﬁgu-
+                more than aboutNexamples to learn them. The remark-     rations as a function of ein more detail (Fig. 12). The green
+                ableandunexpectedresultoftheanalysisisthefactthatthe     curve is obtained for a small value ofawhere a strong maxi-
+                transition to perfect generalization is discontinuous. The     mum with high generalization error exists. By increasing a,
+                generalization error decreases immediately from a non-     this maximum decreases until it is the same as the second
+                zero value to zero. This gives an impression about the com-     maximum at e0.5, indicating a transition like that of the
+                plex structure of the space of all consistent students and     blue learning curve in Fig. 9. For larger a, the state of per-
+                also gives a hint as to why perfect learning in the Ising per-     fect generalization should be the typical state. Neverthe-
+                ceptron is a difﬁcult task. For aslightly below a, the num-     less, if the stochastic algorithm starts with an initial state c
+
+
+
+                772                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 773
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+
+                                                                 α        lar model. Here, each hidden unit is connected to a dif- 1       ferent set of the input nodes. A further simpliﬁcation is the
+
+
+
+
+
+
+
+
+
+                      log (number of students)                                           α        replacement of adaptive couplings from the hidden units to 2
+                                                                           the output node by a prewired ﬁxed function which maps
+                                                                           the states of the hidden units to the output. α3          Two such functions have been studied in great detail.
+                                                                           For the ﬁrst one, the output gives just the majority vote of
+                                                                 α        the hidden units—that is, if the majority of the hidden units 4
+                                          α                               is negative, then the total output is negative, and vice versa. 4  >α3  >α2  >α1                     This network is called a committee machine.For the second
+                      0 0.1 0.2 0.3 0.4 0.5      type of network, the parity machine,the output is the par- ε          ity of the hidden outputs—that is, a minus results from an
+                    FIGURE 12 Logarithm of the number of relevant Ising stu-     odd number of negative hidden units and a plus from an
+                    dents for different values of a.                              even number. For both types of networks, the capacity has
+                                                                           been calculated in the thermodynamic limit of a large num-
+                                                                           ber Nof (ﬁrst layer) couplings (Barkai et al.,1990; Monas-
+                    which has no resemblance to the (unknown) teacher (i.e.,    son and Zecchina, 1995). By increasing the number of hid-
+                    with e0.5), it will spend time that increases exponentially    den units (but always keeping it much smaller than N),
+                    with Nin the smaller local maximum, the metastable state.    the capacity per coupling (and the VC dimension) can be
+                    Hence, a sudden transition to perfect generalization will be    made arbitrarily large. Hence, the VC theory predicts that
+                    observable only in examples which correspond to the blue    the ability to generalize begins at a size of the training set
+                    curve of Fig. 12, where this metastable state disappears.    which increases with the capacity. The learning curves of
+                    For large vales of a(yellow curve), the stochastic algorithm    the typical parity machine (Fig. 14) being trained by a par-
+                    will converge always to the state of perfect generalization.    ity teacher for (from left to right) one, two, four, and six
+                    On the other hand, since the state with e0.5 is always    hidden units seem to partially support this prediction.
+                    metastable, a stochastic algorithm which starts with the       Belowacertainnumberofexamples,onlymemorization
+                    teacher’s couplings will never drive the student out of the    ofthelearnedpatternsoccursandnotgeneralization.Then,
+                    state of perfect generalization. It should be made clear that    a transition to nontrivial generalization takes place (Han-
+                    the sharp phase transitions are the result of the thermody-    sel et al.,1992; Opper, 1994). Far beyond the transition, the
+                    namic limit, where the macroscopic state is entirely domi-    decay of the learning curves becomes that of a simple per-
+                    nated by the typical conﬁgurations. For simulations of any    ceptron (black curve in Fig. 14) independent of the num-
+                    ﬁnite system a rounding and softening of the transitions    ber of hidden units, and this occurs much faster than for
+                    will be observed.                                        the bound given by VC theory. This shows that the typical
+                    ................................................                              learning curve can in fact be determined by more than one ◗
+
+                         More Sophisticated Computations 
+                        Are Needed for Multilayer Networks         0.5
+                                                                           ε
+                    As a ﬁrst step to understand the generalization perfor-      0.4 mance of multilayer networks, one can study an archi-                               46
+                    tecture which is simpler than the fully connected one of
+                    Fig. 1b. The tree architecture of Fig. 13 has become a popu-      0.3                  2
+
+                                                                                            10.2
+
+
+                                                                            0.1
+
+
+                                                                            0.00.0 0.1     0.2 0.3 0.4 0.5 0.6 α
+
+                                                                           FIGURE 14 Learning curves for the parity machine with
+                    FIGURE 13 A two-layer network with tree architecture.    tree architecture. Each curve represents the generalization er-
+                    The arrow indicates the direction of propagation of the    ror eas a function of aand is distinguished by the number of
+                    information.                                            hidden units of the network.
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       773        262-A1677  7/24/01  11:12 AM  Page 774
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                complexity parameter. In contrast, the learning curve of     the same similarity to every teacher perceptron. Although
+                the committee machine with the tree architecture of Fig. 13    this symmetric state allows for some degree of generaliza-
+                (Schwarze and Hertz, 1992) is smooth and resembles that     tion, it is not able to recover the teacher’s rule completely.
+                of the simple perceptron. As the number of hidden units    After a long plateau, the symmetry is broken and each of
+                is increased (keeping Nﬁxed and very large), the general-    the student perceptrons specializes to one of the teacher
+                ization error increases, but despite the diverging VC di-    perceptrons, and thus their similarity with the others is
+                mension the curves converge to a limiting one having an    lost. This leads to a rapid (but continuous) decrease in the
+                asymptotic decay which is only twice as slow as that of the    generalization error. Such types of learning curves with
+                perceptron. This is an example for which typical and worst-    plateaus can actually be observed in applications of fully
+                case generalization behaviors are entirely different.           connected multilayer networks.
+                   Recently, more light has been shed on the relation be-                              ................................................tween average and worst-case scenarios of the tree com-                             ◗
+
+                mittee. A reduced worst-case scenario, in which a tree                         Outlook
+                committee teacher was to be learned from tree committee
+                students under an input distribution, has been analyzed     The worst-case approach of the VC theory and the typical
+                from a statistical physics perspective (Urbanczik, 1996). As     case approach of statistical physics are important theories
+                expected, few students show a much worse generalization     for modeling and understanding the complexity of learning
+                ability than the typical one. Moreover, such students may     to generalize from examples. Although the VC approach
+                also be difﬁcult to ﬁnd by most reasonable learning algo-     plays an important role in a general theory of learnabil-
+                rithms because bad students require very ﬁne tuning of    ity, its practical applications for neural networks have been
+                their couplings. Calculation of the couplings with ﬁnite pre-    limited by the overall generality of the approach. Since only
+                cision requires many bits per coupling that increases faster    weak assumptions about probability distributions and ma-
+                than exponentially with aand which for sufﬁciently large a    chines are considered by the theory, the estimates for gen-
+                willbebeyondthecapabilityofpracticalalgorithms.Hence,    eralization errors have often been too pessimistic. Recent
+                it is expected that, in practice, a bad behavior will not be     developments of the theory seem to overcome these prob-
+                observed.                                              lems. By using modiﬁed VC dimensions, which depend on
+                   Transitions of the generalization error such as those     the data that have actually occurred and which in favorable
+                observed for the tree parity machine are a characteristic     cases are much smaller than the general dimensions, more
+                feature of large systems which have a symmetry that can     realistic results seem to be possible. For the support vec-
+                be spontaneously broken. To explain this, consider the sim-    tor machines (Vapnik, 1995) (generalizations of the margin
+                plest case of two hidden units. The output of this parity ma-    classiﬁers which allow for nonlinear boundaries that sepa-
+                chine does not change if we simultaneously change the sign    rate the two classes), Vapnik and collaborators have shown
+                of all the couplings for both hidden units. Hence, if the    the effectiveness of the modiﬁed VC results for selecting
+                teacher’s couplings are all equal to 1, a student with all    the optimal type of model in practical applications.
+                couplings equal to 1 acts exactly as the same classiﬁer. If       The statistical physics approach, on the other hand, has
+                there are few examples in the training set, the entropic con-    revealed new and unexpected behavior of simple network
+                tribution will dominate the typical behavior and the typi-    models,suchasavarietyofphasetransitions.Whethersuch
+                cal students will display the same symmetry. Their coupling    transitions play a cognitive role in animal or human brains
+                vectors will consist of positive and negative random num-    is an exciting topic. Recent developments of the theory
+                bers. Hence, there is no preference for the teacher or the     aim to understand dynamical problems of learning. For ex-
+                reversed one and generalization is not possible. If the num-    ample, online learning (Saad, 1998), in which the problems
+                ber of examples is large enough, the symmetry is broken     of learning and generalization are strongly mixed, has en-
+                and there are two possible types of typical students, one    abled the study of complex multilayer networks and has
+                with more positive and the other one with more negative     stimulated research on the development of optimized algo-
+                couplings. Hence, any of the typical students will show     rithms. In addition to an extension of the approach to more
+                some similarity with the teacher (or it’s negative image) and    complicated networks, an understanding of the robustness
+                generalization occurs. A similar type of symmetry break-     of the typical behavior, and an interpolation to the other
+                ing also leads to a continuous phase transition in the fully     extreme, the worst-case scenario is an important subject of
+                connected committee machine. This can be viewed as a     research.
+                committee of perceptrons, one for each hidden unit, which
+                share the same input nodes. Any permutation of these per-                    Acknowledgments
+                ceptrons obviously leaves the output invariant. Again, if    I thank members of the Department of Physics of Complex Sys-
+                few examples are learned, the typical state reﬂects the sym-    tems at the Weizmann Institute in Rehovot, Israel, where parts of
+                metry. Each student perceptron will show approximately     this article were written, for their warm hospitality.
+
+
+
+
+
+                774                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 775
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                                    References Cited                    OPPER , M., and H AUSSLER , M. (1991). Generalization perfor-
+                                                                              mance of Bayes optimal classiﬁcation algorithm for learning a
+                    AMARI , S., and M URATA , N. (1993). Statistical theory of learning       perceptron. Phys. Rev. Lett.66,2677.
+                       curves under entropic loss. Neural Comput.5,140.             OPPER , M., and K INZEL , W. (1995). Statistical mechanics of gen-
+                    BARKAI , E., H ANSEL , D., and K ANTER , I. (1990). Statistical me-       eralization. In Physics of Neural Networks III(J. L. van Hem-
+                       chanics of a multilayered neural network. Phys. Rev. Lett.65,       men, E. Domany, and K. Schulten, Eds.). Springer-Verlag,
+                       2312.                                                   New York.
+                    BISHOP , C. M. (1995). Neural Networks for Pattern Recognition.    SAAD , D. (Ed.) (1998). Online Learning in Neural Networks.
+                       Clarendon/Oxford Univ. Press, Oxford/New York.                Cambridge Univ. Press, New York.
+                    CARNEVALI , P., and P ATARNELLO , S. (1987). Exhaustive thermo-    SCHWARZE , H., and H ERTZ , J. (1992). Generalization in a large
+                       dynamical analysis of Boolean learning networks. Europhys.       committee machine. Europhys. Lett.20,375.
+                       Lett.4,1199.                                          SCHWARZE , H., and H ERTZ , J. (1993). Generalization in fully con-
+                    COVER , T. M. (1965). Geometrical and statistical properties of       nected committee machines. Europhys. Lett.21,785.
+                       systems of linear inequalities with applications in pattern rec-    SEUNG , H. S., S OMPOLINSKY , H., and T ISHBY , N. (1992a). Statis-
+                       ognition. IEEE Trans. El. Comp.14,326.                       tical mechanics of learning from examples. Phys. Rev. A45,
+                    ENGEL , A., and V AN DEN BROECK , C. (1993). Systems that can       6056.
+                       learn from examples: Replica calculation of uniform conver-     SEUNG , H. S., O PPER , M., and S OMPOLINSKY , H. (1992b). Query
+                       gence bound for the perceptron. Phys. Rev. Lett.71,1772.          by committee. InThe Proceedings of the Vth Annual Workshop
+                    GARDNER ,E.(1988).Thespaceofinteractionsinneuralnetworks.       on Computational Learning Theory (COLT92),p. 287. Associ-
+                       J. Phys. A21,257.                                         ation for Computing Machinery, New York.
+                    GARDNER , E., and D ERRIDA , B. (1989). Optimal storage proper-     SOMPOLINSKY , H., T ISHBY , N., and S EUNG , H. S. (1990). Learning
+                       ties of neural network models. J. Phys. A21,271.                 from examples in large neural networks. Phys. Rev. Lett.65,
+                    GYÖRGYI , G. (1990). First order transition to perfect generaliza-       1683.
+                       tion in a neural network with binary synapses. Phys. Rev. A41,    URBANCZIK , R. (1996). Learning in a large committee machine:
+                       7097.                                                   Worst case and average case. Europhys. Lett.35,553.
+                    GYÖRGYI , G., and T ISHBY , N. (1990). Statistical theory of learn-     VALLET , F., C AILTON , J., and R EFREGIER , P. (1989). Linear and
+                       ing a rule. In Neural Networks and Spin Glasses: Proceedings       nonlinear extension of the pseudo-inverse solution for learn-
+                       of the STATPHYS 17 Workshop on Neural Networks and Spin       ing Boolean functions. Europhys. Lett.9,315.
+                       Glasses(W. K. Theumann and R. Koberle, Eds.). World Scien-     VAPNIK , V. N. (1982). Estimation of Dependencies Based on Em-
+                       tiﬁc, Singapore.                                           pirical Data.Springer-Verlag, New York.
+                    HANSEL , D., M ATO , G., and M EUNIER , C. (1992). Memorization     VAPNIK , V. N. (1995). The Nature of Statistical Learning Theory.
+                       without generalization in a multilayered neural network. Eu-       Springer-Verlag, New York.
+                       rophys. Lett.20,471.                                    VAPNIK , V. N., and C HERVONENKIS , A. (1971). On the uniform
+                    KINZEL , W., and R UJÀN , P. (1990). Improving a network general-       convergence of relative frequencies of events to their probabil-
+                       ization ability by selecting examples. Europhys. Lett.13,473.       ities. Theory Probability Appl.16,254.
+                    LEVIN ,E.,T ISHBY ,N.,andS OLLA ,S.(1989).Astatisticalapproach
+                       to learning and generalization in neural networks. In Proceed-                   General References ings of the Second Workshop on Computational Learning The-
+                       ory(R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan     ARBIB , M. A. (Ed.) (1995). The Handbook of Brain Theory and
+                       Kaufmann, San Mateo, CA.                                 Neural Networks.MIT Press, Cambridge, MA.
+                    MÉZARD , M., P ARISI , G., and V IRASORO , M. A. (1987). Spin glass    BERGER , J. O. (1985). Statistical Decision Theory and Bayesian
+                       theory and beyond. In Lecture Notes in Physics,Vol. 9. World       Analysis.Springer-Verlag, New York.
+                       Scientiﬁc, Singapore.                                    HERTZ ,J.A.,K ROGH ,A.,andP ALMER , R. G. (1991).Introduction
+                    MONASSON , R., and Z ECCHINA , R. (1995). Weight space structure       to the Theory of Neural Computation.Addison-Wesley, Red-
+                       andinternalrepresentations:Adirectapproachtolearningand       wood City, CA.
+                       generalization in multilayer neural networks. Phys. Rev. Lett.    MINSKY , M., and P APERT , S. (1969). Perceptrons.MIT Press,
+                       75,2432.                                                Cambridge, MA.
+                    OPPER , M. (1994). Learning and generalization in a two-layer     WATKIN , T. L. H., R AU , A., and B IEHL , M. (1993). The statistical
+                       neural network: The role of the Vapnik–Chervonenkis dimen-       mechanics of learning a rule. Rev. Modern Phys.65,499.
+                       sion. Phys. Rev. Lett.72,2113.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       775        262-A1677  7/24/01  11:12 AM  Page 776
\ No newline at end of file
diff --git a/Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt b/Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt
new file mode 100644
index 0000000..2be843a
Binary files /dev/null and b/Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt differ
diff --git a/Corpus/MOGRIFIER LSTM.txt b/Corpus/MOGRIFIER LSTM.txt
new file mode 100644
index 0000000..c75f02e
Binary files /dev/null and b/Corpus/MOGRIFIER LSTM.txt differ
diff --git a/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt b/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt
new file mode 100644
index 0000000..5741d6c
--- /dev/null
+++ b/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt	
@@ -0,0 +1,1145 @@
+         Deep learning for visual unDerstanDing:
+         part 2
+
+
+                                                                     Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang
+
+
+
+
+
+
+
+        Model Compression and Acceleration  
+
+
+        for Deep Neural Networks
+
+
+        The principles, progress, and challenges
+
+
+
+
+
+
+
+
+
+
+
+
+                                                            In recent years, deep neural networks (DNNs) have received 
+                                                              increased attention, have been applied to different applica-
+                                                              tions, and achieved dramatic accuracy improvements in many 
+                                                            tasks. These works rely on deep networks with millions or even 
+                                                            billions of parameters, and the availability of graphics process-
+                                                            ing units (GPUs) with very high computation capability plays 
+                                                            a key role in their success. For example, Krizhevsky et al. [1] 
+                                                            achieved breakthrough results in the 2012 ImageNet Challenge 
+                                                            using a network containing 60 million parameters with five 
+                                                            convolutional layers and three fully connected layers. Usu-
+                                                            ally, it takes two to three days to train the whole model on the 
+                                                            ImagetNet data set with an NVIDIA K40 machine. In another 
+                                                            example, the top face-verification results from the Labeled 
+                                                            Faces in the Wild (LFW) data set were obtained with networks 
+                                                            containing hundreds of millions of parameters, using a mix 
+                                                            of convolutional, locally connected, and fully connected layers 
+                                                            [2], [3]. It is also very time-consuming to train such a model 
+                                                            to obtain a reasonable performance. In architectures that only 
+                                                            rely on fully connected layers, the number of parameters can 
+                                                            grow to billions [4].
+
+
+                                                            Introduction
+                                                            As larger neural networks with more layers and nodes are 
+                                                            considered, reducing their storage and computational cost 
+                                                            becomes critical, especially for some real-time applications  ©Istockphoto.com/zapp2photo
+                                                            such as online learning and incremental learning. In addition, 
+                                                            recent years witnessed significant progress in virtual real-
+                                                            ity, augmented reality, and smart wearable devices, creating 
+                                                            unprecedented opportunities for researchers to tackle fun-
+                                                            damental challenges in deploying deep-learning systems to 
+                                                            portable devices with limited resources [e.g., memory, central 
+                                                            processing units (CPUs), energy, bandwidth]. Efficient deep-
+                                                            learning methods can have a significant impact on distributed 
+                                                            systems, embedded devices, and field-programmable gate ar-
+                                                            ray (FPGA) for artificial intelligence (AI). For example, the 
+                                                            residual network-50 (ResNet-50) [5], which has 50 convolu-
+                                                            tional layers, needs more than 95 megabytes of memory for  Digital Object Identifier 10.1109/MSP.2017.2765695
+        Date of publication: 9 January 2018                                  storage, and numerous floating number multiplications for 
+
+
+      126                                   IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |                     1053-5888/18©2018IEEE       calculating each image. After discarding    As larger neural networks   volutional layers only. Low-rank factoriza-
+       some redundant weights, the network still    with more layers and       tion and transferred/compact filters-based 
+       works as usual but saved more than 75% of    nodes are considered,      approaches provide an end-to-end pipeline 
+       parameters and 50% computational time.    reducing their storage      and can be easily implemented in a CPU/
+       For devices like cell phones and FPGAs                            GPU environment, which is straightfor-
+       with only several megabyte resources, how                           and computational         ward, while parameter pruning and sharing 
+       to compact the models used on them is     cost becomes critical,      use different methods such as vector quan-
+       also important.                      especially for some real-   tization, binary coding, and sparse con-
+         Achieving these goals calls for joint    time applications such      straints to perform the task. Usually, it will 
+       solutions from many disciplines, including    as online learning and      take several steps to achieve the goal.
+       but not limited to machine learning, opti-    incremental learning.        Regarding training protocols, models 
+       mization, computer architecture, data com-                           based on parameter pruning/sharing low-
+       pression, indexing, and hardware design.                            rank factorization can be extracted from 
+       In this article, we review recent works on compressing and   pretrained ones or trained from scratch, while the transferred/
+       accelerating DNNs, which attracted much attention from the   compact filter and KD models can only support training from 
+       deep-learning community and has already achieved signifi-   scratch. These methods are independently designed and com-
+       cant progress in past years.                          plement each other. For example, transferred layers and pa-
+         We classify these approaches into four categories:         rameter pruning and sharing can be used together, and model 
+       1) Parameter pruning and sharing: The parameter pruning   quantization and binarization can be used together with low-
+         and sharing-based methods explore the redundancy in the   rank approximations to achieve further speedup. We will de-
+         model parameters and try to remove the redundant and   scribe the details of each theme and their properties, strengths, 
+         noncritical ones.                               and drawbacks in the following sections.
+       2) Low-rank factorization: Low-rank factorization-based 
+         techniques use matrix/tensor decomposition to estimate the   Parameter pruning and sharing
+         informative parameters of the deep convolutional neural   An early work that showed that network pruning is effective in 
+         networks (CNNs).                               reducing the network complexity and addressed the overfitting 
+       3) Transferred/compact convolutional filters: The trans-   problem is [6]. Since then, it has been widely studied to compress 
+         ferred/compact convolutional filters-based approaches   DNN models, trying to remove parameters that are not crucial to 
+         design special structural convolutional filters to reduce the   the model performance. These techniques can be further classi-
+         storage and computation complexity.                 fied into three categories: model quantization and binarization, 
+       4) Knowledge distillation (KD): The KD methods learn a dis-   parameter sharing, and structural matrix.
+         tilled model and train a more compact neural network to 
+         reproduce the output of a larger network.               Quantization and binarization
+         In Table 1, we briefly summarize these four types of meth-   Network quantization compresses the original network by 
+       ods. Generally, the parameter pruning and sharing, low-rank   reducing the number of bits required to represent each weight. 
+       factorization, and KD approaches can be used in DNNs with   Gong et al. [6] and Wu et al. [7] applied k-means scalar quanti-
+       fully connected layers and convolutional layers, achieving   zation to the parameter values. Vanhoucke et al. [8] showed that 
+       comparable performances. On the other hand, methods using   8-bit quantization of the parameters can result in significant 
+       transferred/compact filters are designed for models with con-   speedup with minimal loss of accuracy. The work in [9] used 
+
+
+
+
+        Table 1. A summary of different approaches for network compression.
+        Theme Name            Description                   Applications          More Details 
+        Parameter pruning and sharing  Reducing redundant parameters that      Convolutional layer and     Robust to various settings, can achieve 
+                          are not sensitive to the performance     fully connected layer     good performance, can support both train-
+                                                                    ing from scratch and pretrained model
+        Low-rank factorization       Using matrix/tensor decomposition to     Convolutional layer and     Standardized pipeline, easily implement-
+                          estimate the informative parameters      fully connected layer     ed, can support both training from scratch 
+                                                                    and pretrained model
+        Transferred/compact        Designing special structural convolutional   Only for convolutional layer Algorithms are dependent on applications, 
+        convolutional filters        filters to save parameters                           usually achieve good performance, only 
+                                                                    support training from scratch
+        KD                 Training a compact neural network with    Convolutional layer and     Model performances are sensitive to  
+                          distilled knowledge of a large model    fully connected layer     applications and network structure, only 
+                                                                    support training from scratch
+
+
+                                     IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |                             127       16-bit fixed-point representation in stochastic rounding-based   er drawback of these binary nets is that existing binarization 
+       CNN training, which significantly reduced memory usage and   schemes are based on simple matrix approximations and ignore 
+       float- point operations with little loss in classification accuracy.  the effect of binarization on the accuracy loss. To address 
+         The method proposed in [10] first pruned the unimportant con-  this issue, the work in [17] proposed a proximal Newton algo-
+       nections and retrained the sparsely connected networks. Then it   rithm with diagonal Hessian approximation that directly mini-
+       quantized the link weights using weight-sharing, and then applied   mizes the loss with respect to the binary weights. The work in 
+       Huffman coding to the quantized weights as                           [18] significantly reduced the time on float-
+       well as the codebook to further reduce the                           point multiplication in the training stage by 
+       rate. As shown in Figure 1     , it starts by learn-    Network pruning and       stochastically binarizing weights and con-
+       ing the connectivity via normal network train-   sharing has been used      verting multiplications in the hidden state 
+       ing, followed by pruning the small-weight    both to reduce network     computation to sign changes.
+       connections. Finally, the network is retrained    complexity and to address  to learn the final weights for the remaining    the overfitting issue.       Pruning and sharing
+       sparse connections. This work achieves the                            Network pruning and sharing has been used 
+       state-of-the-art performance among all param-                           both to reduce network complexity and to 
+       eter quantization-based methods. It was shown in [11] that Hes-  address the overfitting issue. An early approach to pruning was 
+       sian weight could be used to measure the importance of network   biased weight decay [19]. The optimal brain damage [20] and 
+       parameters and proposed to minimize Hessian-weighted quantiza-  the optimal brain surgeon [21] methods reduced the number 
+       tion errors on average for clustering network parameters. A novel   of connections based on the Hessian of the loss function, and 
+       quantization framework was introduced in [12], which reduced the   their works suggested that such pruning gave higher accuracy 
+       precision of network weights to ternary values.              than magnitude-based pruning such as the weight decay meth-
+         In the extreme case of 1-bit representation of each weight, i.e.,   od. Those methods supported training from scratch. 
+       binary weight neural networks, there are also many works that     A recent trend in this direction is to prune redundant, non-
+       directly train CNNs with binary weights; for instance, Binary-   informative weights in a pretrained CNN model. For example, 
+       Connect [13], BinaryNet [14], and XNORNetworks [15]. The   Srinivas and Babu [22] explored the redundancy among neurons 
+       main idea is to directly learn binary weights or activations dur-   and proposed a data-free pruning method to remove redundant 
+       ing the model training. The systematic study in [16] showed that   neurons. Han et al. [23] proposed to reduce the total number of 
+       networks trained with backpropagation could be robust against   parameters and operations in the entire network. Chen et al. [24] 
+       (robust against or resilient to) specific weight distortions, includ-   proposed a HashedNets model that used a low-cost hash function 
+       ing binary weights.                               to group weights into hash buckets for parameter sharing. The 
+                                                   deep compression method in [10] removed the redundant connec-
+       Drawbacks                                    tions and quantized the weights and then used Huffman coding 
+       However, the accuracy of such binary nets is significantly low-   to encode the quantized weights. In [25], a simple regularization 
+       ered when dealing with large CNNs such as GoogleNet. Anoth-   method based on soft weight-sharing was proposed, which 
+
+
+
+
+
+
+
+                                             Cluster the Weights
+
+                   Train ConnectivityOriginal                                                                    Compressed
+        Network                                                                     NetworkGenerate Codebook           Encode Weights
+
+                  Prune Connections
+                                            Quantize the Weights
+                                              with Codebook              Encode Index
+                    Train Weights
+
+                                             Retrain Codebook
+
+
+
+
+       Figure 1.  The three-stage compression method proposed in [10]: pruning, quantization, and encoding. The input is the original model, and the output is 
+       the compression model.
+
+
+     128                             IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |       included both quantization and pruning in one simple (re)train-   Thus the memory cost becomes O()d instead of O()d2 . 
+       ing procedure. It is worth noting that the aforementioned prun-   This circulant structure also enables the use of fast Fou-
+       ing schemes typically produce connection pruning in CNNs.    rier transform (FFT) to speed up the computation. Given a 
+         There is also growing interest in training compact CNNs   d-dimensional vector r, the 1-layer circulant neural network 
+       with sparsity constraints. Those sparsity constraints are    in (1) has time complexity of O()ddlog .
+       typically introduced in the optimization                               In [31], a novel adaptive fastfood trans-
+       problem as l0  or l1 -norm regularizers.    CNNs are parameter-efficient  form was introduced to reparameterize the 
+       The work in [26] imposed group sparsity    due to exploring the         matrix-vector multiplication of fully con-
+       constraints on the convolutional filters to                            nected layers. The adaptive fastfood trans-
+       achieve structured brain damage, i.e., prun-    translation invariant property  form matrix RR! nd#  was defined as
+       ing entries of the convolution kernels in a    of the representations to 
+       group-wise fashion. In [27], a group-sparse    input image, which is the key          RS= HGPHB. (2)
+       regularizer on neurons was introduced    to the success of training  during the training stage to learn compact    very deep models without    Here, SG,, and B are random diago-
+       CNNs with reduced filters. Wen et al. [28]                            nal matrices. P!{,01}dd#
+                                       severe overfitting.                              is a random 
+       added a structured sparsity regularizer on                            permutation matrix and H denotes the 
+       each layer to reduce trivial filters, chan-                            Walsh–Hadamard matrix. Reparameteriz-
+       nels, or even layers. In filter-level pruning, all of the afore-   ing a fully connected layer with  d inputs and n outputs using 
+       mentioned works used l21, -norm regularizers. The work in [29]   the adaptive fastfood transform reduces the storage and the 
+       used l1 -norm to select and prune unimportant filters.         computational costs from O()nd to O()n and from O()nd to 
+                                                   O()ndlog , respectively.
+       Drawbacks                                      The work in [32] showed the effectiveness of the new notion 
+       There are some potential issues of the pruning and sharing   of parsimony in the theory of structured matrices. Their pro-
+       works. First, pruning with l1  or l2  regularization requires   posed method can be extended to various other structured matrix 
+       more iterations to converge. Furthermore, all pruning criteria   classes, including block and multilevel Toeplitz-like [33] matrices 
+       require manual setup of sensitivity for layers, which demands   related to multidimensional convolution [34].
+       fine-tuning of the parameters and could be cumbersome for 
+       some applications.                               Drawbacks
+                                                   One potential problem of this kind of approach is that the struc-
+       Designing the structural matrix                        tural constraint will cause loss in accuracy since the constraint 
+       In architectures that contain only fully connected layers, the   might bring bias to the model. On the other hand, how to find a 
+       number of parameters can grow up to billions [4]. Thus, it is   proper structural matrix is difficult. There is no theoretical way 
+       critical to explore this redundancy of parameters in fully con-   from which to derive it.
+       nected layers, which is often the bottleneck in terms of memory 
+       consumption. These network layers use the nonlinear transforms   Low-rank factorization and sparsity
+       f(,xM)(=v Mx), where  v ()o is an element-wise nonlinear   As convolution operations constitute the bulk of all computations 
+       operator, x is the input vector, and M is the mn#  matrix of   in CNNs, simplifying the convolution layer would have a direct 
+       parameters. When M is a large general dense matrix, the cost   impact on the overall speedup. The convolution kernels in a typi-
+       of storing mn parameters and computing matrix-vector products   cal CNN is a four-dimensional tensor. The key observation is that 
+       in Om()n time. Thus, an intuitive way to prune parameters is to   there might be a significant amount of redundancy in the tensor. 
+       impose x as a parameterized structural matrix. An mn#  matrix   Ideas based on tensor decomposition seem to be a particularly 
+       that can be described using much fewer parameters than mn is   promising way to remove the redundancy. Regarding to the fully 
+       called a structured matrix. Typically, the structure should not   connected layer, it can be viewed as a two-dimensional (2-D) 
+       only reduce the memory cost but also dramatically accelerate the   matrix and the low-rankness can also help.
+       inference and training stage via fast matrix-vector multiplication     Using low-rank filters to accelerate convolution has a long 
+       and gradient computations.                          history. Typical examples include high-dimensional discrete 
+         Following this direction, the work in [30] proposed a sim-   cosine transform (DCT) and wavelet systems constructed 
+       ple and efficient approach based on circulant projections,   from one-dimensional (1-D) DCT transform and 1-D wave-
+       while maintaining competitive error rates. Given a vector   lets, respectively, using tensor products. In the context of 
+       r=(,rr 01 ,,frd-1 ),  a circulant matrix RR! dd#  is defined as   dictionary learning, Rigamonti et al. [35] suggested learning 
+                                                   separable 1-D filters. In [36], a few low-rank approximation Rr0 rd 1 g r    VS     -      2  r1 W          and clustering schemes for the convolutional kernels were 
+                       Sr1  r0 rd 1    r W          proposed. They achieved 2# speedup for a single convolu-
+            Rr (circ ): S        -     2
+             ==r           WS h   1  r0 j  h W. (1)   tional layer with 1% drop in classification accuracy. The 
+                       Srd-2     j jrd-1 W          work in [37] suggested using different tensor decomposition Sr               WTd-1 rd-2 g r1  r0 X          schemes, reporting a 45.# speedup with 1% drop in accuracy 
+
+
+                                     IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |                             129                                                                  case. For the scheme in [39], the decom-
+                                                                  position always exists and can achieve 
+                                                                  better performance than general CP. 
+                                                                  Table 2 lists a performance comparison 
+                                                                  of both methods. The actual speedup 
+                                                                  and compression rates are used to mea-
+                                                                  sure the performances. We can see that 
+                                                                  the BN version can achieve slightly bet-
+                                                                  ter performance while the CP version 
+                                                                  gives higher compression rates. Original Framework             Low-Rank                          Note that the fully connected layers  Factorization Framework               can be viewed as a 2-D matrix and thus 
+                (a)                            (b)                 the aforementioned methods can also 
+                                                                  be applied there. There are several clas-
+                                                                  sical works on exploiting low-rankness  Figure 2.  A typical framework of the low-rank regularization method. (a) is theoriginal convolutional 
+       layer, and (b) is the low-rank constraint convolutional layer with rank-K.                   in fully connected layers. For instance, 
+                                                                  Misha et al. [40] reduced the number 
+                                                                  of dynamic parameters in deep models 
+       in text recognition. In both works, the approximation was   using the low-rank method. Reference [41] explored a low-rank 
+       done layer by layer. After one layer was approximated by   matrix factorization of the final weight layer in a DNN for 
+       the low-rank filters, the parameters of that layer were fixed,   acoustic modeling.
+       and the layers above were fine-tuned based on a reconstruc-
+       tion error criterion. These are typical low-rank methods for   Drawbacks
+       compressing 2-D convolutional layers, which is described in   Low-rank approaches are straightforward for model compres-
+       Figure 2. In [38], canonical polyadic (CP) decomposition of   sion and acceleration. The idea complements recent advances 
+       the kernel tensors was proposed. Their work used nonlinear   in deep learning such as dropout, rectified units, and maxout. 
+       least squares to compute the CP decomposition, which was   However, the implementation is not that easy since it involves 
+       also based on the tensor decomposition idea. In [39], a new   a decomposition operation, which is computationally expen-
+       algorithm for computing the low-rank tensor decomposition   sive. Another issue is that current methods perform low-rank 
+       and a new method for training low-rank constrained CNNs   approximation layer by layer, and thus cannot perform global 
+       from scratch were proposed. It used batch normalization (BN)   parameter compression, which is important as different lay-
+       to transform the activations of the internal hidden units, and it   ers hold different information. Finally, factorization requires 
+       was shown to be an effective way to deal with the exploding   extensive model retraining to achieve convergence when com-
+       or vanishing gradients.                            pared to the original model.
+         In principle, both the CP decomposition scheme and the 
+       decomposition scheme in [39] (BN low-rank) can be used to   Transferred/compact convolutional filters
+       train CNNs from scratch. For the CP decomposition, finding   CNNs are parameter-efficient due to exploring the transla-
+       the best low-rank approximation is an ill-posed problem, and   tion invariant property of the representations to input image, 
+       the best rank-K approximation may not exist in the general   which is the key to the success of training very deep models 
+                                                   without severe overfitting. Although a strong theory is cur-
+                                                   rently missing, a large amount of empirical evidence sup-
+                                                   ports the notion that both the translation invariant property  Table 2. Comparisons between the low-rank models and their baselines 
+        on ILSVRC-2012.                                 and convolutional weight-sharing are important for good 
+                                                   predictive performance. The idea of using transferred con-Model       TOP-5 Accuracy  Speedup   Compression Rate   volutional filters to compress CNN models is motivated by 
+        AlexNet      80.03%       1        1             recent works in [42], which introduced the equivariant group 
+        BN low-rank   80.56%       1.09      4.94           theory. Let x be an input, U()$ be a network or layer, and 
+                                                   T()$ be the transform matrix. The concept of equivariance  CP low-rank   79.66%       1.82      5             is defined as VGG-16      90.60%       1        1 
+        BN low-rank   90.47%       1.53      2.72                        TTlUU ^^ xx hh =     , (3)
+        CP low-rank   90.31%       2.05      2.75 
+        GoogleNet    92.21%       1        1             which says that transforming the input x by the transform 
+                                                   T()$ and then passing it through the network or layer U(·) BN low-rank   91.88%       1.08      2.79           should give the same result as first mapping x through the  CP low-rank   91.79%       1.20      2.84           network and then transforming the representation. Note that, 
+
+
+     130                             IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |       in [42], the transforms T()$ and Tl()$ are not necessarily   where Tx(·,,y) denoted the translation of the first oper-
+       the same as they operate on different objects. According to   and by (,xy) along its spatial dimensions, with proper zero 
+       this theory, it is reasonable to apply the transform to layers   padding at borders to maintain the shape. The proposed 
+       or filters U()$ to compress the whole network models. From   framework can be used to 1) improve the classification accu-
+       empirical observation, deep CNNs also benefit from using a   racy as a regularized version of maxout networks and 2) 
+       large set of convolutional filters by applying a certain trans-   to achieve parameter efficiency by flexibly varying their 
+       form T()$ to a small set of base filters since it acts as a regu-   architectures to compress networks.
+       larizer for the model.                                Table 3 briefly compares the performance of different 
+         Following this trend, there are many recent works proposed   methods with transferred convolutional filters, using VGG-
+       to build a convolutional layer from a set of base filters [42]–   Net (16 layers) as the baseline model. The results are report-
+       [45]. What they have in common is that the transform T()$   ed on the CIFAR-10 and CIFAR-100 data sets with top-five 
+       lies in the family of functions that only operate in the spatial   error rates. It is observed that they can achieve reduction in 
+       domain of the convolutional filters. For                            parameters with little or no drop in clas-
+       example, the work in [44] found that the                            sification accuracy.
+       lower convolution layers of CNNs learned    The basic idea of KD is to 
+       redundant filters to extract both positive and    distill knowledge from a    Drawbacks
+       negative phase information of an input sig-    large teacher model into     There are several issues that need to be 
+       nal, and defined T()$  to be the simple nega-    a small one by learning      addressed for approaches that apply transfer 
+       tion function                        the class distributions      information to convolutional filters. First, 
+                                        output by the teacher       these methods can achieve competitive per-
+               T^h WW x =  -x . (4)                           formance for wide/flat architectures (like  via softened softmax.      VGGNet) but not narrow/special ones (like 
+       Here, Wx  is the basis convolutional filter                           GoogleNet and ResNet). Second, the trans-
+       and W-x  is the filter consisting of the shifts whose activation is   fer assumptions sometimes are too strong to guide the algo-
+       opposite to that of Wx  and selected after max-pooling opera-   rithm, making the results unstable on some data sets.
+       tion. By doing this, the work in [44] can easily achieve 2# com-     Using a compact filter for convolution can directly reduce 
+       pression rate on all the convolutional layers. It is also shown that   the computation cost. The key idea is to replace the loose and 
+       the negation transform acts as a strong regularizer to improve   overparametric filters with compact blocks to improve the 
+       the classification accuracy. The intuition is that the learning   speed, which significantly accelerate CNNs on several bench-
+       algorithm with pair-wise positive-negative constraint can lead   marks. Decomposing 33#  convolution into two 11#  con-
+       to useful convolutional filters instead of redundant ones.       volutions was used in [47], which achieved state-of-the-art 
+         In [45], it was observed that magnitudes of the responses   acceleration performance on object recognition. SqueezeNet 
+       from convolutional kernels had a wide diversity of pattern rep-   [48] was proposed to replace 33#  convolution with 11#  
+       resentations in the network, and it was not proper to discard   convolution, which created a compact neural network with 
+       weaker signals with a single threshold. Thus, a multibias non-   approximately 50 fewer parameters and comparable accuracy 
+       linearity activation function was proposed to generate more   when compared to AlexNet.
+       patterns in the feature space at low computational cost. The 
+       transform T()$ was define as                        KD
+                                                   To the best of our knowledge, exploiting knowledge transfer to 
+                    TlU^h xW=+ x d , (5)   compress model was first proposed by Caruana et al. [49]. They 
+                                                   trained a compressed model with pseudo-data labeled by an 
+       where  d  were the multibias factors. The work in [46] consid-   ensemble of strong classifiers and reproduced the output of the 
+       ered a combination of rotation by a multiple of 90° and hori-   original larger network. However, their work is limited to shal-
+       zontal/vertical flipping with                         low models. The idea has been recently adopted in [50] as KD 
+                                                   to compress deep and wide networks into shallower ones, where 
+                     TlU^h xW=  Ti , (6)
+                                                    Table 3. Comparisons of different approaches based on transferred  where WTi  was the transformation matrix that rotated the orig-   convolutional filters on CIFAR-10 and CIFAR-100.
+       inal filters with angle  i !{90,,}180270. In [42], the transform    Model        CIFAR-100   CIFAR-10   Compression Rate was generalized to any angle learned from data, and  i  was 
+       directly obtained from data. Both [46] and [42] can achieve    VGG-16       34.26%     9.85%     1 
+       good classification performance.                       MBA [45]      33.66%     9.76%     2 
+         Reference [43] defined T()$ as the set of translation func-   CRELU [44]     34.57%     9.92%     2 
+       tions applied to 2-D filters                           CIRC [42]      35.15%     10.23%    4 
+             T lU^^ xhh =Tx·,,y              ,  (7)   DCNN [43]     33.57%     9.65%     1.62  xy,,!" -kkf,, ,^ xy,( h !00,)
+
+
+                                     IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |                             131       the compressed model mimicked the function learned by the   Other types of approaches
+       complex model. The basic idea of KD is to distill knowledge   We first summarize the works utilizing attention-based 
+       from a large teacher model into a small one by learning the   methods. Note that attention-based systems [57] can reduce 
+       class distributions output by the teacher via softened softmax.   computations significantly by learning to selectively focus or 
+         The work in [51] introduced a KD compression framework,   “attend to” a few, task-relevant input regions. The work in [57] 
+       which eased the training of deep networks by following a student-  introduced the dynamic capacity network that combined two 
+       teacher paradigm, in which the student was penalized according   types of modules: the small subnetworks with low capacity, and 
+       to a softened version of the teacher’s output. The framework   the large ones with high capacity. The low-capacity subnetworks 
+       compressed an ensemble of deep networks (teacher) into a stu-  were active on the whole input to first find the task-relevant areas 
+       dent network of similar depth. To do so, the student was trained   in the input, and then the attention mechanism was used to di-
+       to predict the output of the teacher, as well as the true classifica-  rect the high-capacity subnetworks to focus on the task-relevant 
+       tion labels. Despite its simplicity, KD demonstrates promising   regions in the input. By doing this, the size of the CNN model 
+       results in various image classification tasks. The work in [52]   could be significantly reduced.
+       aimed to address the network compression                              Following this direction, the work in 
+       problem by taking advantage of depth neural    The standard criteria       [58] introduced the conditional computation 
+       networks. It proposed an approach to train    to measure the quality      idea, which only computes the gradient for 
+       thin and deep networks, called FitNets, to    of model compression      some important neurons. It proposed a new 
+       compress wide and shallower (but still deep)    and acceleration are the    type of general-purpose neural network com-
+       networks. The method was rooted in KD and                           ponent: a sparsely gated mixture-of-experts 
+       extended the idea to allow for thinner and    compression and the       (MoE) layer. The MoE consisted of a number 
+       deeper student models. To learn from the    speedup rates.            of experts, each a simple feed-forward neural 
+       intermediate representations of the teacher                           network, and a trainable gating network that 
+       network, FitNet made the student mimic the full feature maps of   selected a sparse combination of the experts to process each input. 
+       the teacher. However, such assumptions are too strict since the   In [59], dynamic DNNs (D2NNs) were introduced, which were a 
+       capacities of teacher and student may differ greatly. In certain   type of feed-forward DNN that selected and executed a subset of 
+       circumstances, FitNet may adversely affect the performance and   D2NN neurons based on the input.
+       convergence. All the aforementioned methods are validated on     There have been other attempts to reduce the number of 
+       the MNIST, CIFAR-10, CIFAR-100, SVHN, and AFLW bench-  parameters of neural networks by replacing the fully con-
+       mark data sets, and simulation results show that these methods   nected layer with global average pooling [43], [60]. Network 
+       match or outperform the teacher’s performance, while requiring   architectures, such as GoogleNet or network in network, 
+       notably fewer parameters and multiplications.              can achieve state-of-the-art results on several benchmarks 
+         There are several extensions along this direction of distilla-   by adopting this idea. However, transfer learning, i.e., reus-
+       tion knowledge. The work in [53] trained a parametric student   ing features learned on the ImageNet data set and applying 
+       model to approximate a Monte Carlo teacher. The proposed   them to new tasks, is more difficult with this approach. This 
+       framework used online training and used DNNs for the student   problem was noted by Szegedy et al. [60] and motivated 
+       model. Different from previous works, which represented the   them to add a linear layer on  top of their networks to enable 
+       knowledge using the softened label probabilities, [54] repre-   transfer learning.
+       sented the knowledge by using the neurons in the higher hidden     The work in [61] targeted the ResNet-based model with a 
+       layer, which preserved as much information as the label prob-   spatially varying computation time, called stochastic depth, 
+       abilities, but are more compact. The work in [55] accelerated   which enabled the seemingly contradictory setup to train short 
+       the experimentation process by instantaneously transferring   networks and used deep networks at test time. It started with 
+       the knowledge from a previous network to each new deeper   very deep networks and, while during training, for each mini-
+       or wider network. The techniques are based on the concept   batch, randomly dropped a subset of layers and bypassed them 
+       of function-preserving transformations between neural net-   with the identity function. This model is end-to-end trainable, 
+       work specifications. Zagoruyko et al. [56] proposed attention   deterministic, and can be viewed as a black-box feature extrac-
+       transfer to relax the assumption of FitNet. They transferred the   tor. Following this direction, the work in [62] proposed a pyra-
+       attention maps that are summaries of the full activations.      midal residual network with stochastic depth.
+                                                     Other approaches to reduce the convolutional overheads 
+       Drawbacks                                    include using FFT-based convolutions [63] and fast convolution 
+       KD-based approaches can make deeper models thinner and   using the Winograd algorithm [64]. Those works only aim to 
+       help significantly reduce the computational cost. However,   speedup the computation but not reduce the memory storage.
+       there are a few disadvantages. One of them is that KD can only 
+       be applied to classification tasks with softmax loss function,   Benchmarks, evaluation, and databases
+       which hinders its usage. Another drawback is that the model   In the past five years, the deep-learning community has made 
+       assumptions sometimes are too strict to make the performance   great efforts in benchmark models. One of the most well-
+       competitive with other types of approaches.               known models used in compression and acceleration for CNNs 
+
+
+     132                             IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |       is Alexnet [1], which occasionally has been    Proposing some general/   about how to choose different compression 
+       used for assessing the performance of com-    unified approaches is       approaches and possible challenges/solu-
+       pression. Other popular standard models    one direction that can       tions in this area.
+       include LeNets [65], All-CNN-nets [66],    be taken regarding  and many others. LeNet-300-100 is a fully                            General suggestions
+       connected network with two hidden layers,    the use of CNNs in          There is no golden rule to measure which one 
+       with 300 and 100 neurons each. LeNet-5 is    small platforms.          of the four kinds of approaches is the best. How 
+       a convolutional network that has two convo-                           to choose the proper approaches is really de-
+       lutional layers and two fully connected layers. Recently, more   pendent on the applications and requirements. Here, we provide 
+       state-of-the-art architectures are used as baseline models in   some general suggestions.
+       many works, including network in networks [67], VGGNets   ■  If the applications needs compacted models from pretrained 
+       [68], and ResNets [69]. Table 4 summarizes the baseline mod-     models, one can choose either pruning and sharing or low-
+       els commonly used in several typical compression methods.      rank factorization-based methods. If end-to-end solutions 
+         The standard criteria to measure the quality of model com-     are needed for the problem, the low-rank and transferred 
+       pression and acceleration are the compression and the speedup     convolutional filters approaches are preferred.
+       rates. Assume that a is the number of the parameters in the   ■  For applications in some specific domains, methods with 
+       original model M and a*  is that of the compressed model M* ,     human prior (like the transferred convolutional filters and 
+       then the compression rate  a (,MM * ) of  M*  over M is          structural matrix) sometimes have benefits. For example, 
+                                                     when conducting medical images classification, transferred 
+                       MM,.aa ^h * =   (8)a                      convolutional filters should work well as medical images  *                     (like organs) do have the rotation transformation property.
+       Another widely used measurement is the index space saving   ■  Usually, the approaches of pruning and sharing could give 
+       defined in several papers [70], [71] as                     a reasonable compression rate while not hurting the accu-
+                                                     racy. Thus, for applications that require stable model accu-
+                      MM,,aa b    * = -^h *  (9)a                      racy, it is better to utilize pruning and sharing. *
+                                                   ■  If a problem involves small- or medium-size data sets, one 
+       where a and a are the number of the dimension of the index     can try the KD approaches. The compressed student model 
+       space in the original model and that of the compressed      can take the benefit of transferring knowledge from the 
+       model, respectively.                                teacher model, making it a robust data set that is not large.
+         Similarly, given the running time s of M and s*  of M*,  the   ■  As we mentioned in the “Introduction,” techniques of the 
+       speedup rate  d (,MM * ) is defined as                      four themes are orthogonal. It makes sense to combine two 
+                                                     or three of them to maximize the compression/speedup 
+                      MM,.sd ^h * =s  (10)     rates. For some specific applications, like object detection,  *                      which requires both convolutional and fully connected lay-
+       Most work used the average training time per epoch to mea-     ers, one can compress the convolutional layers with low-
+       sure the running time, while in [70] and [71], the average     rank factorization and the fully connected layers with a 
+       testing time was used. Generally, the compression rate and     pruning method.
+       speedup rate are highly correlated, as smaller models often 
+       results in faster computation for both the training and the 
+       testing stages.
+         Good compression methods are expected to achieve almost    Table 4. A summary of baseline models used in  
+       the same performance as the original model with much smaller    different representative works of network compression.
+       parameters and less computational time. However, for differ-   Baseline Models        Representative Works 
+       ent applications with varying CNN designs, the correlation    Alexnet [1]           Structural matrix [30]–[32]  between parameter size and computational time may be dif-
+       ferent. For example, it is observed that, for deep CNNs with                     Low-rank factorization [39] 
+       fully connected layers, most of the parameters are in the fully    Network in network [67]   Low-rank factorization [39] 
+       connected layers; while for image classification tasks, float-   VGGNets [68]         Transferred filters [43] 
+       point operations are mainly in the first few convolutional lay-                    Low-rank factorization [39]  ers since each filter is convolved with the whole image, which    ResNets [69]          Compact filters [48], stochastic depth [61] is usually very large at the beginning. Different applications 
+       should focus on different layers.                                         Parameter sharing [25] 
+                                                    All-CNN-nets [66]       Transferred filters [44] 
+       Discussion and challenges                         LeNets [65]           Parameter sharing [25] 
+       In this article, we summarized recent works on compress-                    Parameter pruning [21], [23]  ing and accelerating DNNs. Here we discuss more details 
+
+
+                                     IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |                             133      Technique challenges               good compression        approaches. Instead of directly reducing 
+      Techniques for deep model compression    methods are expected     and transferring parameters from the teach-
+      and acceleration are still in the early stages,    to achieve almost the      er models, passing selectivity knowledge of 
+      and the following challenges still need to    same performance as the   neurons could be helpful. One can derive 
+      be addressed.                                        a way to select essential neurons related to  original model with much  ■  Most of the current state-of-the-art ap  -                       the task. The intuition is that, if a neuron 
+        proaches are built on well-designed    smaller parameters and    is activated in certain regions or samples, 
+        CNN models, which have limited free-   less computational time.    this implies these regions or samples share 
+        dom to change the configuration (e.g.,                        some common properties that may relate 
+        network structural, hyperparameters).                        to the task. Performing such steps is time-
+        To handle more complicated tasks, it should provide more   consuming, thus efficient implementation is important.
+        plausible ways to configure the compressed models.        For methods with convolutional filters and the structural 
+      ■  Pruning is an effective way to compress and accelerate   matrix, we can conclude that the transformation lies in the 
+        CNNs. Current pruning techniques are mostly designed to   family of functions that only operations on the spatial dimen-
+        eliminate connections between neurons. On the other hand,   sions. Hence, to address the imposed prior issue, one solution 
+        a pruning channel can directly reduce the feature map   is to provide a generalization of the aforementioned approach-
+        width and shrink the model into a thinner one. It is efficient   es in two aspects: 1) instead of limiting the transformation 
+        but also challenging because removing channels might dra-  to belong to a set of predefined transformations, let it be the 
+        matically change the input of the following layer. It is   whole family of spatial transformations applied to 2-D filters 
+        important to focus on how to address this issue.         or the matrix, and 2) learn the transformation jointly with all 
+      ■  As we mentioned previously, methods of structural matrix   of the model parameters.
+        and transferred convolutional filters impose prior human    Proposing some general/unified approaches is one direction 
+        knowledge to the model, which could significantly affect   that can be taken regarding the use of CNNs in small platforms. 
+        the performance and stability. It is critical to investigate   Yuhen et al. [75] presented a feature map dimensionality reduc-
+        how to control the impact of the imposed prior knowledge.  tion method by excavating and removing redundancy in feature 
+      ■  The methods of KD provide many benefits such as directly   maps generated by different filters, which could also preserve 
+        accelerating the model without special hardware or imple-  intrinsic information of the original network. The idea can be 
+        mentations. It is still worth it to develop KD-based   extended to make CNNs more applicable for different platforms. 
+        approaches and explore how to improve the performance.   The work in [76] proposed a one-shot whole network compres-
+      ■  Hardware constraints in various of small platforms (e.g.,   sion scheme consisting of three components: rank selection, low-
+        mobile, robotic, self-driving cars) are still a major problem   rank tensor decomposition, and fine-tuning to make deep CNNs 
+        that hinder the extension of deep CNNs. How to make full   work in mobile devices. From the systematic side, Facebook 
+        use of the limited computational source available and how   released the platform Caffe2 [77], which employed a particularly 
+        to design special compression methods for such platforms   lightweight and modular framework and included mobile-specif-
+        are still challenges that need to be addressed.          ic optimizations based on the hardware design. Caffe2 can help 
+                                            developers and researchers train large machine-learning models 
+      Possible solutions                           and deliver AI on mobile devices.
+      To solve the hyperparameters configuration problem, we can 
+      rely on the recent learning-to-learn strategy [72], [73]. This   Acknowledgments
+      framework provides a mechanism, allowing the algorithm to   We would like to thank the reviewers and broader community 
+      automatically learn how to exploit structure in the problem of   for their feedback on this survey. In particular, we would like 
+      interest. There are two different ways to combine the learning-  to thank Hong Zhao from the Department of Automation of 
+      to-learn module with the model compression. The first designs   Tsinghua University for her help on modifying this article. 
+      compression and learning-to-learn simultaneously, while the   This research is supported by National Science Foundation of 
+      second way first configures the model with learn-to-learning   China, grant number 61401169. The corresponding author of 
+      and then prunes the parameters.                   this article is Pan Zhou. 
+        Channel pruning provides the efficiency benefit on 
+      both CPUs and GPUs because no special implementation is   Authors
+      required. But it is also challenging to handle the input con-  Yu Cheng (chengyu@us.ibm.com) received his bachelor’s 
+      figuration. One possible solution is to use the training-based   degree in automation from Tsinghua University, Beijing, 
+      channel pruning methods [74], which focus on imposing sparse   China, in 2010 and his Ph.D. degree in computer science 
+      constraints on weights during training, and could adaptively   from Northwestern University, Evanston, Illinois in 2015. 
+      determine hyperparameters. However, training from scratch   Currently, he is a research staff member at AI Foundations Lab, 
+      for such a method is costly for very deep CNNs.          IBM T.J. Watson Research Center, Yorktown Heights, New 
+        Exploring new types of knowledge in the teacher models   York. His research is focused on deep learning in general, with 
+      and transferring it to the student models is useful for the KD   specific interests in deep generative models and deep models 
+
+    134                         IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |       compression. He also has published many works regarding the   [12] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv 
+       applications of deep learning in computer vision and natural   Preprint, arXiv:1612.01064, 2016.
+       language processing.                              [13] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep neu-
+                                                   ral networks with binary weights during propagations,” in Proc. Advances Neural 
+         Duo Wang (d-wang15@mails.tsinghua.edu.cn) received the   Information Processing Systems Annu. Conf., 2015, pp. 3123–3131.
+       B.S. degree in automation from the Harbin Institute of    [14] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks 
+       Technology, China, in 2015, where he is currently pursuing his   with weights and activations constrained to +1 or −1,” Computing Res. Repository, 
+                                                   vol. abs/1602.02830, 2016. [Online]. Available: https://arxiv.org/abs/1602.02830 Ph.D. degree in the Department of Automation, Tsinghua   [15] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet  University. His research interests are deep/machine learning and   classification using binary convolutional neural networks,” in Proc. European Conf. 
+       their applications in computer vision and robotics vision.        Computer Vision, 2016, pp. 525–542. 
+         Pan Zhou (panzhou@hust.edu.cn) received his B.S. degree   [16] P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, “Deep 
+                                                   neural networks are robust to weight binarization and other non-linear distortions,”  in the Advanced Class of Huazhong University of Science and   Computing Res. Repository, vol. abs/1606.01981, 2016. [Online]. Available: https://
+       Technology (HUST), Wuhan China, and his M.S. degree in elec-  arxiv.org/abs/1606.01981 
+       tronics and information engineering from the same university in   [17] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep networks,” 
+                                                   Computing Res. Repository, vol. abs/1611.01600, 2016. [Online]. Available: https:// 2006 and 2008, respectively. He received his Ph.D. degree from   arxiv.org/abs/1611.01600 
+       the School of Electrical and Computer Engineering at the   [18] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with 
+       Georgia Institute of Technology, Atlanta in 2011. Currently, he is   few multiplications,” Computing Res. Repository, vol. abs/1510.03009, 2015. 
+                                                   [Online]. Available: https://arxiv.org/abs/1510.03009 an associate professor with School of Electronic Information and   [19] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network con- Communications, HUST. His research interests include big data   struction with back-propagation,” Adv. Neural Inform. Process. Syst. 1, 1989, pp. 
+       analytics and machine learning, security and privacy, and infor-   177–185.
+       mation networks.                                 [20] Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information pro-
+                                                   cessing systems 2,” in Optimal Brain Damage, D. S. Touretzky, Ed. San Mateo,  Tao Zhang (taozhang@mail.tsinghua.edu.cn) received his   CA: Morgan Kaufmann, 1990, pp. 598–605.
+       B.S., M.S., and Ph.D. degrees from Tsinghua University,   [21] B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives for 
+       Beijing, China, in 1993, 1995, and 1999, respectively, and his   network pruning: Optimal brain surgeon,” in Advances in Neural Information 
+                                                   Processing Systems, vol. 5. San Mateo, CA: Morgan Kaufmann, 1993, pp. 164– Ph.D. degree from Saga University, Japan, in 2002, all in con-  171. 
+       trol engineering. He is a professor with the Department of   [22] S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural net-
+       Automation, Tsinghua University. His current research inter-   works,” in Proc. British Machine Vision Conf., 2015, pp. 31.1–31.12.
+       ests include artificial intelligence, robotics, image processing,   [23] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connec-
+                                                   tions for efficient neural networks,” in Proc. 28th Int. Conf. Neural Information  control theory, and control of spacecraft.                 Processing Systems, 2015, pp. 1135–1143. 
+                                                   [24] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing 
+       References                                   neural networks with the hashing trick,” in Proc. Machine Learning Research 
+                                                   Workshop Conf., 2015, pp. 2285–2294.[1] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep 
+       convolutional neural networks,” in Proc. Conf. Neural Information Processing   [25] K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network 
+       Systems, 2012, pp. 1097–1105.                             compression,” Computing Res. Repository, vol. abs/1702.04008, 2017. [Online]. 
+                                                   Available: https://arxiv.org/abs/1702.04008 [2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to 
+       human-level performance in face verification,” in Proc. IEEE Conf. Computer   [26] V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain dam-
+       Vision Pattern Recognition, 2014, pp. 1701–1708.                   age,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 2554–
+                                                   2564.[3] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are sparse, 
+       selective, and robust,” in Proc. IEEE Conf. Computer Vision Pattern Recognition,   [27] H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact 
+       2015, pp. pp. 2892–2900.                                CNNs,” in Proc. European Conf. Computer Vision, 2016, pp. 662–677.
+       [4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M.   [28] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in 
+       Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale distributed deep   deep neural networks,” Adv. Neural Inform. Process. Syst., vol. 29, pp. 2074–2082, 
+       networks,” in Proc. Conf. Neural Information Processing Systems, 2012, pp.   2016.  
+       1223–1231.                                      [29] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for 
+       [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni-   efficient convnets,” Computing Res. Repository, vol. abs/1608.08710, 2016. 
+       tion,” Computing Res. Repository, vol. abs/1512.03385, 2015. [Online]. Available:   [Online]. Available: https://arxiv.org/abs/1608.08710
+       https://arxiv.org/pdf/1512.03385.pdf                          [30] Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. Chang, “An 
+       [6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep convolutional   exploration of parameter redundancy in deep networks with circulant projections,” in 
+       networks using vector quantization,” Computing Res. Repository, vol.   Proc. Int. Conf. Computer Vision, 2015, pp. 2857–2865.
+       abs/1412.6115, 2014. [Online]. Available: https://arxiv.org/pdf/1412.6115.pdf      [31] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. 
+       [7] Y. W. Q. H. Jiaxiang Wu, C. Leng, and J. Cheng, “Quantized convolutional neu-   Wang, “Deep fried convnets,” in Proc. Int. Conf. Computer Vision, 2015, pp. 1476–
+       ral networks for mobile devices,” in Proc. IEEE Conf. Computer Vision Pattern   1483.
+       Recognition, 2016, pp. 4820–4828.                           [32] V. Sindhwani, T. Sainath, and S. Kumar. (2015). Structured transforms for 
+       [8] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural net-   small-footprint deep learning. Advances in Neural Information Processing 
+       works on cpus,” in Proc. Conf. Neural Information Processing Systems Deep   Systems, 28, pp. 3088–3096. [Online]. Available: http://papers.nips.cc/paper/5869-
+       Learning and Unsupervised Feature Learning Workshop, 2011.             structured-transforms-for-small-footprint-deep-learning.pdf
+       [9] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning   [33] J. Chun and T. Kailath, Generalized Displacement Structure for Block-
+       with limited numerical precision,” in Proc. 32nd Int. Conf. Machine Learning,   Toeplitz, Toeplitz-Block, and Toeplitz-Derived Matrices. Berlin, Germany: 
+       2015, vol. 37, pp. 1737–1746.                             Springer, 1991, pp. 215–236.
+       [10] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural   [34] M. V. Rakhuba and I. V. Oseledets. (2015). Fast multidimensional convolution 
+       networks with pruning, trained quantization and Huffman coding,” in Proc. Int.   in low-rank tensor formats via cross approximation. SIAM J. Sci. Comput., 37(2). 
+       Conf. Learning Representations, 2016.                         [Online]. Available: http://dx.doi.org/10.1137/140958529
+       [11] Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network quantiza-   [35] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable filters,” 
+       tion,” Computing Res. Repository, vol. abs/1612.01543, 2016. [Online]. Available:   in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2013, pp. 2754–
+       https://arxiv.org/abs/1612.01543                            2761.
+
+
+                                     IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |                             135                                                                      [36] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting lin-                             [57]  A.  Almahairi,  N.  Ballas,  T. Cooijmans,  Y. Zheng,  H.  Larochelle, and A. C. 
+
+
+
+
+
+
+                                                                      ear structure within convolutional networks for efficient evaluation,”  Adv. Neural                          Courville, “Dynamic capacity networks,” in Proc. 33rd Int. Conf. Machine Learning, 
+
+
+
+
+
+
+                                                                      Inform. Process. Syst. vol. 27, pp. 1269–1277, 2014.                                                                                                                                                                              2016, pp. 2549–2558.
+
+
+
+
+
+
+
+
+
+
+
+                                                                      [37] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neu-                             [58] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. 
+
+
+
+
+
+
+                                                                      ral networks with low rank expansions,” in Proc. British Machine Vision Conf.,                            (2017).  Outrageously large neural networks: The sparsely-gated mixture-of-experts 
+
+
+
+
+
+
+                                                                      2014, pp. 1–13.                                                                                                                                                                                                                                                                                                                                                         layer. [Online]. Available: https://openreview.net/pdf?id=B1ckMDqlg
+
+
+
+
+
+
+
+
+
+
+
+                                                                      [38]  V.  Lebedev,  Y.  Ganin,  M.  Rakhuba,  I. V.  Oseledets, and V. S.  Lempitsky,                            [59]  D.  Wu,  L.  Pigou,  P.  Kindermans,  N. D.  Le,  L.  Shao,  J.  Dambre, and J. 
+
+
+
+
+
+
+                                                                      “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,”                           Odobez, “Deep dynamic neural networks for multimodal gesture segmentation and 
+
+
+
+
+
+
+                                                                      Computing Res. Repository, vol. abs/1412.6553, 2014. [Online]. Available: https://                             recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1583–
+
+
+
+
+
+
+                                                                      arxiv.org/abs/1412.6553                                                                                                                                                                                                                                                                                                                1597, 2016. 
+
+
+
+
+
+
+
+
+
+
+
+                                                                      [39]  C.  Tai,  T.  Xiao,  X.  Wang, and E. Weinan, “Convolutional neural networks                          [60]  C.  Szegedy,  W.  Liu,  Y.  Jia,  P.  Sermanet,  S.  Reed,  D.  Anguelov,  D.  Erhan,  V. 
+
+
+
+
+
+
+                                                                      with low-rank regularization,” Computing Res. Repository, vol. abs/1511.06067,                            Vanhoucke, and A. Rabinovich. (2015). Going deeper with convolutions. Proc. IEEE 
+
+
+
+
+
+
+                                                                      2015.                                                                                                                                                                                                                                                                                                                                                                                                       Computer Vision Pattern Recognition. [Online]. Available: http://arxiv.org/
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       abs/1409.4842
+
+
+
+                                                                      [40]  M.  Denil,  B.  Shakibi,  L.  Dinh,  M.  Ranzato, and N. D.  Freitas. (2013). 
+
+
+
+
+
+
+
+                                                                      Predicting parameters in deep learning.    Advances in Neural Information                           [61] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks 
+
+
+
+
+
+
+                                                                      Processing Systems, 26, 2148 –2156. [Online]. Available: http://media.nips.cc/nips-                             with stochastic depth,” Computing Res. Repository, vol. arXiv:1603.09382, 
+
+
+
+
+
+
+                                                                      books/nipspapers/paper_files/nips26/1053.pdf                                                                                                                                                                                                         2016. 
+
+
+
+
+
+
+
+
+
+
+
+                                                                      [41] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran,                            [62] Y. Yamada, M. Iwamura, and K. Kise. (2016). Deep pyramidal residual networks 
+
+
+
+
+
+
+                                                                      “Low-rank matrix factorization for deep neural network training with high-dimen-                            with separated stochastic depth, Computing Res. Repository, vol. abs/1612.01230. 
+
+
+
+
+
+
+                                                                      sional output targets,” in  Proc. IEEE Int. Conf. Acoustics Speech Signal                             [Online]. Available: http://arxiv.org/abs/1612.01230
+
+
+
+
+
+
+                                                                      Processing, 2013, pp. 6655–6659.
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [63] M. Mathieu, M. Henaff, and Y. Lecun, “Fast training of convolutional networks 
+
+
+
+
+
+
+                                                                      [42]  T. S.  Cohen and M.  Welling, “Group equivariant convolutional networks,”                           through FFTs,” Computing Res. Repository, vol. arXiv:1312.5851, 2014. 
+
+
+
+
+
+
+                                                                      arXiv Preprint, arXiv:1602.07576, 2016.
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [64] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” 
+
+
+
+
+
+
+                                                                      [43] S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural networks,” in                         in  Proc. IEEE Conf. Computer Vision Pattern Recognition ,  2016, pp. 4013 –
+
+
+
+
+
+
+                                                                     Proc. Advances Neural Information Processing Systems, 2016, pp. 1082–1090.                                                            4021.
+
+
+
+
+
+
+
+
+
+
+
+                                                                     [44] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving con-                             [65] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied 
+
+
+
+
+
+
+                                                                     volutional neural networks via concatenated rectified linear units,” arXiv Preprint,                            to document recognition,” Proc. IEEE, pp. 2278–2324, 1998. 
+
+
+
+
+
+
+                                                                     arXiv:1603.05201, 2016.
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [66] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for 
+
+
+
+
+
+
+
+                                                                     [45] H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in deep neural                         simplicity: The all convolutional net,” Computing Res. Repository, vol. abs/1412.6806, 
+
+
+
+
+
+
+
+                                                                     networks,” arXiv Preprint, arXiv:1604.00676, 2016.                                                                                                                                                                                         2014. [Online]. Available: https://arxiv.org/abs/1412.6806
+
+
+
+
+
+
+
+
+
+
+
+                                                                     [46]  S. Dieleman, J. D Fauw, and K. Kavukcuoglu, “Exploiting cyclic symmetry in                         [67]  M.  Lin,  Q.  Chen, and S.  Yan , “Network in network,” in Proc. Int. Conf. 
+
+
+
+
+
+
+
+                                                                     convolutional neural networks,” in Proc. 33rd Int. Conf. Machine Learning, 2016, vol.                           Learning Representations,   2014. [Online]. Available: https://arxiv.org/abs/ 
+
+
+
+
+
+
+
+                                                                     48, pp. 1889–1898.                                                                                                                                                                                                                                                                                                                                           1312.4400 
+
+
+
+
+
+
+
+
+
+
+
+                                                                     [47] C. Szegedy, S. Ioffe, and V. Vanhoucke. (2016). Inception-v4, inception-resnet and                           [68]  K.  Simonyan and A.  Zisserman, “Very deep convolutional networks for large-
+
+
+
+
+
+
+
+                                                                     the impact of residual connections on learning,  Computing Res. Repository, vol.                          scale image recognition,” Computing Res. Repository, vol. abs/1409.1556,  2014. 
+
+
+
+
+
+
+
+                                                                     abs/1602.07261. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1602.                             [Online]. Available: https://arxiv.org/abs/1409.1556
+
+
+
+
+
+
+
+                                                                     html#SzegedyIV16
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [69] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni-
+
+
+
+
+
+
+
+                                                                     [48] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, small, low                          tion,” arXiv Preprint, arXiv:1512.03385, 2015.
+
+
+
+
+
+
+
+                                                                     power fully convolutional neural networks for real-time object detection for autono-
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [70] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and S. Chang, “An 
+
+
+                                                                     mous driving,” Computing Res. Repository, vol. abs/1612.01051,  2016. [Online]. 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       exploration of parameter redundancy in deep networks with circulant projections,” in 
+
+
+                                                                     Available: https://arxiv.org/abs/1612.01051
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 2857–2865.
+
+
+
+
+
+
+
+                                                                     [49]  C.  Buciluaˇ,  R.  Caruana, and A. Niculescu-Mizil. (2006).  Model compression. 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [71] M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “ACDC: A structured 
+
+
+                                                                     Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 535–
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       efficient linear layer,” in Proc. Int. Conf. Learning Representations, 2016.
+
+
+                                                                     541. [Online]. Available: http://doi.acm.org/10.1145/1150402.1150464
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [72] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. 
+
+
+                                                                     [50] J. Ba and R. Caruana, “Do deep nets really need to be deep?” Adv. Neural Inform. 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Schaul, and N.  de Freitas, “Learning to learn by gradient descent by gradient 
+
+
+                                                                     Process. Syst., vol. 27, pp. 2654–2662, 2014. 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       descent,” in Proc. Neural Information Processing Systems Conf., 2016, pp. 3981–
+
+
+
+
+
+
+
+                                                                     [51] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural net-                            3989. 
+
+
+
+
+
+
+
+                                                                      work,” Computing Res. Repository, vol. abs/1503.02531, 2015. [Online]. Available: 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [73] D. Ha, A. Dai,  and Q. Le, “Hypernetworks,” in Proc. Int. Conf. Learning 
+
+
+                                                                     https://arxiv.org/abs/1503.02531
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Representations, 2016. 
+
+
+
+
+
+
+                                                                     [52]  A.  Romero,  N.  Ballas,  S. E.  Kahou,  A.  Chassang,  C.  Gatta, and Y.  Bengio, 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [74] J. M. Alvarez and M. Salzmann, “Learning the number of neurons in deep net-
+
+
+                                                                      “Fitnets: Hints for thin deep nets,” Computing Res. Repository, vol. abs/1412.6550, 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       works,” in Proc.  Neural Information Processing Systems Conf.,  2016, pp. 2270–
+
+
+                                                                      2014. [Online]. Available: https://arxiv.org/abs/1412.6550
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       2278. 
+
+
+
+
+
+
+                                                                      [53] A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling. (2015). Bayesian 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [75] Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature map for 
+
+
+                                                                      dark knowledge. Advances in Neural Information Processing Systems, 28, 3420–3428. 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       portable deep model,” in Proc. 34th Int. Conf. Machine Learning, 2017, pp. 3703–
+
+
+                                                                      [Online]. Available: http://papers.nips.cc/paper/5965-bayesian-dark-knowledge.pdf
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       3711.
+
+
+
+
+
+
+                                                                      [54] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression by dis-
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [76] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep 
+
+
+                                                                      tilling knowledge from neurons,” in Proc. 30th AAAI Conf. Artificial Intelligence, 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       convolutional neural networks for fast and low power mobile applications,” Computing 
+
+
+                                                                      2016, pp. 3560–3566.
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Res. Repository, vol. abs/1511.06530,  2015. [Online]. Available: https://arxiv.org/
+
+
+
+
+
+
+                                                                      [55]  T.  Chen,  I. J.  Goodfellow, and J.  Shlens, “Net2net: Accelerating learning via                          abs/1511.06530
+
+
+
+
+
+
+                                                                      knowledge transfer,” Computing Res. Repository, vol. abs/1511.05641, 2015. [Online]. 
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [77] Facebook, Inc. Caffe2: A new lightweight, modular, and scalable deep learning 
+
+
+                                                                      Available: https://arxiv.org/abs/1511.05641
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       framework. (2016). [Online]. Available: https://caffe2.ai/ 
+
+
+
+
+
+
+                                                                      [56]  S.  Zagoruyko and N.  Komodakis. (2016).  Paying more attention to attention: 
+
+
+
+
+
+
+
+                                                                      Improving the performance of convolutional neural networks via attention transfer, 
+
+
+
+
+
+
+
+                                                                      Computing Res. Repository, vol. abs/1612.03928. [Online]. Available: http://arxiv.org/
+
+
+
+
+
+
+
+                                                                      abs/1612.03928                                                                                                                                                                                                                                                                                                                                                                                                  SP
+
+
+
+
+
+
+
+
+
+     136                             IEEE SIgnal ProcESSIng MagazInE    |   January 2018    |
\ No newline at end of file
diff --git a/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt b/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt
new file mode 100644
index 0000000..47f9152
--- /dev/null
+++ b/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt	
@@ -0,0 +1,662 @@
+                                      Movement Pruning:
+                              Adaptive Sparsity by Fine-Tuning
+
+
+
+
+                                Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2
+                                       1 Hugging Face, 2 Cornell University
+                             {victor,thomas}@huggingface.co;arush@cornell.edu
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+     arXiv:2005.07683v1  [cs.CL]  15 May 2020                                         Abstract
+
+                       Magnitude pruning is a widely used strategy for reducing model size in pure
+                       supervised learning; however, it is less effective in the transfer learning regime that
+                       has become standard for state-of-the-art natural language processing applications.
+                       We propose the use ofmovement pruning, a simple, deterministic ﬁrst-order weight
+                       pruning method that is more adaptive to pretrained model ﬁne-tuning. We give
+                       mathematical foundations to the method and compare it to existing zeroth- and
+                       ﬁrst-order pruning methods. Experiments show that when pruning large pretrained
+                       language models, movement pruning shows signiﬁcant improvements in high-
+                       sparsity regimes. When combined with distillation, the approach achieves minimal
+                       accuracy loss with down to only 3% of the model parameters.
+
+
+                 1 Introduction
+
+                 Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art
+                 performance in applications in natural language processing and related ﬁelds. In this setup, a large
+                 model pretrained on a massive generic dataset is then ﬁne-tuned on a smaller annotated dataset to
+                 perform a speciﬁc end-task. Model accuracy has been shown to scale with the pretrained model and
+                 dataset size [Raffel et al., 2019]. However, signiﬁcant resources are required to ship and deploy these
+                 large models, and training the models have high environmental costs [Strubell et al., 2019].
+                 Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at
+                 only a small cost of accuracy. Pruning methods, which remove weights based on their importance,
+                 are a particularly simple and effective method for compressing models to be sent to edge devices such
+                 as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high
+                 absolute values, is the most widely used method for weight pruning. It has been applied to a large
+                 variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al.,
+                 2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis
+                 [Frankle et al., 2019].
+                 While magnitude pruning is highly effective for standard supervised learning, it is inherently less
+                 useful in the transfer learning regime. In supervised learning, weight values are primarily determined
+                 by the end-task training data. In transfer learning, weight values are mostly predetermined by the
+                 original model and are only ﬁne-tuned on the end task. This prevents these methods from learning to
+                 prune based on the ﬁne-tuning step, or “ﬁne-pruning.”
+                 In this work, we argue that to effectively reduce the size of models for transfer learning, one should
+                 instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during
+                 ﬁne-tuning. Movement pruning differs from magnitude pruning in that both weights with low and
+                 high values can be pruned if they shrink during training. This strategy moves the selection criteria
+                 from the 0th to the 1st-order and facilitates greater pruning based on the ﬁne-tuning objective. To
+
+
+                 Preprint. Under review.                 test this approach, we introduce a particularly simple, deterministic version of movement pruning
+                 utilizing the straight-through estimator [Bengio et al., 2013].
+                 We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019,
+                 Vaswani et al., 2017] on a diverse set of ﬁne-tuning tasks. In highly sparse regimes (less than 15% of
+                 remaining weights), we observe signiﬁcant improvements over magnitude pruning and other 1st-order
+                 methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original
+                 BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI)
+                 [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of
+                 the differences between magnitude pruning and movement pruning shows that the two methods lead
+                 to radically different pruned models with movement pruning showing greater ability to adapt to the
+                 end-task.
+
+                 2 Related Work
+
+                 In addition to magnitude pruning, there are many other approaches for generic model weight pruning.
+                 Most similar to our approach are methods for using parallel score matrices to augment the weight
+                 matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo-
+                 lutional networks. Differing from our methods, these methods keep the weights of the model ﬁxed
+                 (either from a randomly initialized network or a pre-trained network) and the scores are updated to
+                 ﬁnd a good sparse subnetwork.
+                 Many previous works have also explored using higher-order information to select prunable weights.
+                 LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for
+                 deletion. Our method does not require the (possibly costly) computation of second-order derivatives
+                 since the importance scores are obtained simply as the by-product of the standard ﬁne-tuning. Theis
+                 et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In
+                 contrast, we found it useful to preserve the direction of movement in our algorithm.
+                 Compressing pretrained language models for transfer learning is also a popular area of study. Other
+                 approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning
+                 [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model
+                 and targets individual weight. We also show that having a teacher can further improve our approach.
+                 Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train
+                 sparse language models from scratch. This differs from our approach which focuses on the ﬁne-tuning
+                 stage. Finally, another popular compression approach is quantization. Quantization has been applied
+                 to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014]
+                 providing high memory compression rates at the cost of no or little performance. As shown in
+                 previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and
+                 can be combined to further improve the performance/size ratio.
+
+                 3 Background: Score-Based Pruning
+
+                 We ﬁrst establish shared notation for discussing different neural network pruning strategies. Let
+                 W2Rn n refer to a generic weight matrix in the model (we consider square matrices, but they
+                 could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of
+                 associated importance scoresS2Rn n . Given importance scores, each pruning strategy computes a
+                 maskM2 f0;1gn n . Inference for an inputxbecomesa= (W M)x, where is the Hadamard
+                 product. A common strategy is to keep the top-vpercent of weights by importance. We deﬁne Top v as a function which selects thev%highest values inS: 1; STop(S)                                     (1) v  i;j =     i;j in topv%
+                                                  0; o.w.
+
+                 Magnitude-based weight pruning determines the mask based on the absolute value of each weight      as a measure of importance. Formally, we have importance scoresS= jWi;j j     , and masks 1 i;j n M=Top(S)(Eq(1)). There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude pruning: the model is ﬁrst trained until convergence and weights with the lowest
+                 magnitudes are removed afterward. The sparsiﬁed model is then re-trained with the removed weights
+                 ﬁxed to 0. This loop is repeated until the desired sparsity level is reached.
+
+                                                  2                              Magnitude pruning    L0 regularization Movement pruning Soft movement pruning
+                  Pruning Decision 0th order 1st order 1st order 1st order
+                  Masking Function Top v     Continuous Hard-Concrete Top v        Thresholding
+                  Pruning Structure Local or Global Global Local or Global Global
+                  Learning Objective      L L+ l0 E(L0 )          L L+ mvp R(S)
+                  Gradient Form Gumbel-Softmax Straight-Through Straight-ThroughP               P           PScoresS          jW              )                               )i;j j    (@L )(t W(t) f(S(t) )    (@L )(t) W(t)     (@L )(t) W(t
+                                            t@W    i;j  i;j                         i;j i;j            t@W    i;j i;j         t@W i;j
+                 Table 1: Summary of the pruning methods considered in this work and their speciﬁcities. The
+                 expression offofL0 regularization is detailed in Eq (3).
+
+
+                 In this study, we focus onautomated gradual pruning[Zhu and Gupta, 2018]. It supplements
+                 magnitude pruning by allowing masked weights to be updated such that they are not ﬁxed for the
+                 entire duration of the training. Automated gradual pruning enables the model to recover from previous
+                 masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity levelv       during training using a cubic sparsity scheduler:v(t) =vf + (v          t 3
+                                                             i  vf ) 1 t i . The sparsity n t level at time stept,v(t) is increased from an initial valuevi (usually 0) to a ﬁnal valuevf innpruning
+                 steps afterti steps of warm-up. The model is thus pruned and trained jointly.
+
+                 4 Movement Pruning
+
+                 Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running
+                 model. In this work, we focus on movement pruning methods where importance is derived from
+                 ﬁrst-order information. Intuitively, instead of selecting weights that are far from zero, we retain
+                 connections that are moving away from zero during the training process. We consider two versions of
+                 movement pruning: hard and soft.
+                 For (hard) movement pruning, masks are computed using the Top v function:M=Top(S). Unlike v magnitude pruning, during training, we learn both the weightsP      Wand their importance scoresS.
+                 During the forward pass, we compute for alli,a    ni =    Wk=1 i;k Mi;k xk .
+                 Since the gradient of Top v is 0 everywhere it is deﬁned, we follow Ramanujan et al. [2020], Mallya
+                 and Lazebnik [2018] and approximate its value with thestraight-through estimator[Bengio et al.,
+                 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The
+                 approximation of gradient of the lossLwith respect toSi;j is given by
+                                        @L   @L @a=     i   @L=   W x@S                   j                   (2)
+                                         i;j  @a i @S i;j  @a  i;ji
+                 This implies that the scores of weights are updated, even if these weights are masked in the forward
+                 pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge.
+                 We also consider a relaxed (soft) version of movement pruning based on the binary mask function
+                 described by Mallya and Lazebnik [2018]. Here we replace hyperparametervwith a ﬁxed global
+                 threshold value that controls the binary mask. The mask is calculated asP M= (S>  ). In order to
+                 control the sparsity level, we add a regularization termR(S) = mvp    (Si;j   i;j )which encourages
+                 the importance scores to decrease over time 1 . The coefﬁcient mvp controls the penalty intensity and
+                 thus the sparsity level.
+                 Finally we note that these approaches yield a similar updateL0 regularization based pruning, another
+                 movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the
+                 hard-concretedistribution, where the maskMis sampled for alli;jwith hyperparametersb >0,
+                 l <0, andr >1:                                               u  U(0;1)            Si;j = (log(u) log(1 u) +Si;j )=b
+                         Zi;j = (r l)Si;j +l M i;j = min(1;ReLU(Zi;j ))
+
+                 The expectedP       L0 norm has a closed form involving the parameters of the hard-concrete:                                       E(L0 ) =
+                       logSi;j     i;j  blog( l=r). Thus, the weights and scores of the model can be optimized in
+                                     P1 We also experimented with   jSi;j i;j jbut it turned out to be harder to tune while giving similar results.
+
+                                                  3                              (a) Magnitude pruning             (b) Movement pruning
+                 Figure 1: During ﬁne-tuning (on MNLI), the weights stay close to their pre-trained values which
+                 limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are
+                 plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning
+                 selects weights that are moving away from 0.
+
+
+                 an end-to-end fashion to minimize the sum of the training lossLand the expectedL0 penalty. A
+                 coefﬁcient l0 controls theL0 penalty and indirectly the sparsity level. Gradients take a similar form:
+                         @L   @L                      r l=   W@S        i;j xj f(Si;j )wheref(Si;j ) =    S            Zi;j  1g     (3)
+                           i;j  @a i                       b  i;j (1 S i;j )1f0
+                                                                                   
+                 At test time, a non-stochastic estimation of the mask is used:M^= min 1;ReLU (r l) (S)+l
+                  and weights multiplied by 0 can simply be discarded.
+                 Table 1 highlights the characteristics of each pruning method. The main differences are in the masking
+                 functions, pruning structure, and the ﬁnal gradient form.
+
+                 Method Interpretation In movement pruning, the gradient ofLwith respect toWi;j is given
+                 by the standard gradient derivation: @L = @L M@W i;j   @a  i;j xj . By combining it to Eq(2), we have i @L = @L W@S i;j  @W   i;j (we omit the binary mask termMi;j for simplicity). From the gradient update in i;j
+                 Eq (2),S               @Li;j is increasing when    <0, which happens in two cases: @S i;j
+                       (a) @L <0andW@W         i;j >0i;j
+                       (b) @L >0andW@W         i;j <0i;j
+                 It means that during trainingWi;j is increasing while being positive or is decreasing while being
+                 negative. It is equivalent to saying thatSi;j is increasing whenWi;j is moving away from 0. Inversely,
+                 Si;j is decreasing when @L >0which means thatW@S                  i;j is shrinking towards 0. i;j
+                 While magnitude pruning selects the most important weights as the ones which maximize their
+                 distance to 0 (jWi;j j), movement pruning selects the weights which are moving the most away from
+                 0 (Si;j ). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement
+                 pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from
+                 equation (2), afterTgradient updates, we have
+                                                XS(T)          @L=          )(t) W(t)                   (4) i;j     S  (@W     i;j i;jt<T
+
+                 Figure 1 shows this difference empirically by comparing weight values during ﬁne-tuning against
+                 their pre-trained value. As observed by Gordon et al. [2020], ﬁne-tuned weights stay close in absolute
+                 value to their initial pre-trained values. For magnitude pruning, this stability around the pre-trained
+
+                                                  4                 values implies that we know with high conﬁdence before even ﬁne-tuning which weights will be
+                 pruned as the weights with the smallest absolute value at pre-training will likely stay small and be
+                 pruned. In contrast, in movement pruning, the pre-trained weights do not have such an awareness of
+                 the pruning decision since the selection is made during ﬁne-tuning (moving away from 0), and both
+                 low and high values can be pruned. We posit that this is critical for the success of the approach as it
+                 is able to prune based on the task-speciﬁc data, not only the pre-trained value.
+
+                 5 Experimental Setup
+
+                 Transfer learning for NLP uses large pre-trained language models that are ﬁne-tuned on target tasks
+                 [Ruder et al., 2019, Devlin et al., 2019, Radford et al., 2019, Liu et al., 2019]. We experiment with task-
+                 speciﬁc pruning ofBERT-base-uncased, a pre-trained model that contains roughly 84M parameters.
+                 We freeze the embedding modules and ﬁne-tune the transformer layers and the task-speciﬁc head.
+                 We perform experiments on three monolingual (English) tasks, which are common benchmarks for
+                 the recent progress in transfer learning for NLP: question answering (SQuAD v1.1) [Rajpurkar et al.,
+                 2016], natural language inference (MNLI) [Williams et al., 2018], and sentence similarity (QQP)
+                 [Iyer et al., 2017]. The datasets respectively contain 8K, 393K, and 364K training examples. SQuAD
+                 is formulated as a span extraction task, MNLI and QQP are paired sentence classiﬁcation tasks.
+                 For a given task, we ﬁne-tune the pre-trained model for the same number of updates (between 6
+                 and 10 epochs) across pruning methods 2 . We follow Zhu and Gupta [2018] and use a cubic sparsity
+                 scheduling for Magnitude Pruning (MaP), Movement Pruning (MvP), and Soft Movement Pruning
+                 (SMvP). Adding a few steps of cool-down at the end of pruning empirically improves the performance
+                 especially in high sparsity regimes. The schedule forvis:
+                                 8<vi                    0 t < t i
+                                   v      v          )3 t                        (5):f + (vi  f )(1 t ti  tf
+                                                  n t     i  t < T tf
+                                   vf                    o.w.
+                 wheretf is the number of cool-down steps.
+                 We compare our results against several state-of-the-art pruning baselines: Reweighted Proximal
+                 Pruning (RPP) [Guo et al., 2019] combines re-weightedL1 minimization and Proximal Projection
+                 [Parikh and Boyd, 2014] to perform unstructured pruning. LayerDrop [Fan et al., 2020a] leverages
+                 structured dropout to prune models at test time. For RPP and LayerDrop, we report results from
+                 authors. We also compare our method against the mini-BERT models, a collection of smaller BERT
+                 models with varying hyper-parameters [Turc et al., 2019].
+
+                 6 Results
+
+                 Figure 2 displays the results for the main pruning methods at different levels of pruning on each
+                 dataset. First, we observe the consistency of the comparison between magnitude and movement
+                 pruning: at low sparsity (more than 70% of remaining weights), magnitude pruning outperforms
+                 all methods with little or no loss with respect to the dense model whereas the performance of
+                 movement pruning methods quickly decreases even for low sparsity levels. However, magnitude
+                 pruning performs poorly with high sparsity, and the performance drops extremely quickly. In contrast,
+                 ﬁrst-order methods show strong performances with less than 15% of remaining weights.
+                 Table 2 shows the speciﬁc model scores for different methods at high sparsity levels. Magnitude
+                 pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 withL0 regular-
+                 ization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning. These experiments
+                 indicate that in high sparsity regimes, importance scores derived from the movement accumulated
+                 during ﬁne-tuning induce signiﬁcantly better pruned models compared to absolute values.
+                 Next, we compare the difference in performance between ﬁrst-order methods. We see that straight-
+                 through based hard movement pruning (MvP) is comparable withL0 regularization (with a signiﬁcant
+                 gap in favor of movement pruning on QQP). Soft movement pruning (SMvP) consistently outperforms
+                    2 Preliminary experiments showed that increasing the number of pruning steps tended to improve the end
+                 performance
+
+                                                  5                 Figure 2: Comparisons between different pruning methods in high sparsity regimes.Soft move-
+                 ment pruning consistently outperforms other methods in high sparsity regimes.We plot the
+                 performance of the standard ﬁne-tuned BERT along with 95% of its performance.
+
+
+
+
+
+
+
+
+
+
+
+
+
+                 Table 2: Performance at high sparsity levels. (Soft) movement pruning outperforms current
+                  state-of-the art pruning methods at different high sparsity levels.
+
+                                BERT base  Remaining
+                                 ﬁne-tuned  Weights (%)   MaP   L0 Regu MvP soft MvP
+                      SQuAD - Dev             10%    67.7/78.5 69.9/80.1 71.9/81.7 71.3/81.580.4/88.1EM/F1               3%    40.1/54.5 61.6/73.6 65.2/76.3 69.6/79.9
+                      MNLI - Dev              10%    77.8/79.0 77.9/78.5 79.3/79.5 80.7/81.2acc/MM acc   84.5/84.9     3%    68.9/69.8 75.2/75.6 76.1/76.7 79.0/79.7
+                      QQP - Dev              10%    78.8/75.1 87.6/81.9 89.1/85.5 90.2/86.891.4/88.4acc/F1                3%    72.1/58.4 86.5/81.1 85.6/81.0 89.2/85.5
+
+
+                 hard movement pruning andL0 regularization by a strong margin and yields the strongest performance
+                 among all pruning methods in high sparsity regimes. These comparisons support the fact that even if
+                 movement pruning (and its relaxed version soft movement pruning) is simpler thanL0 regularization,
+                 it yet yields stronger performances for the same compute budget.
+                 Finally, movement pruning and soft movement pruning compare favorably to the other baselines, ex-
+                 cept for QQP where RPP is on par with soft movement pruning. Movement pruning also outperforms
+                 the ﬁne-tuned mini-BERT models. This is coherent with [Li et al., 2020]: it is both more efﬁcient and
+                 more effective to train a large model and compress it afterward than training a smaller model from
+                 scratch. We do note though that current hardware does not support optimized inference for sparse
+                 models: from an inference speed perspective, it might often desirable to use a small dense model
+                 such as mini-BERT over a sparse alternative of the same size.
+
+                 Distillation further boosts performance Following previous work, we can further leverage knowl-
+                 edge distillation [Bucila et al., 2006, Hinton et al., 2014] to boost performance for free in the pruned
+                 domain [Jiao et al., 2019, Sanh et al., 2019] using our baseline ﬁne-tunedBERT-basemodel as
+                 teacher. The training objective is a linear combination of the training loss and a knowledge distillation
+
+
+                 Figure 3: Comparisons between different pruning methods augmented with distillation.Distillation
+                 improves the performance across all pruning methods and sparsity levels.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                  6                 Table 3: Distillation-augmented performances for selected high sparsity levels.All pruning methods
+                 beneﬁt from distillation signal further enhancing the ratio Performance VS Model Size.
+
+                                BERT base  Remaining
+                                 ﬁne-tuned  Weights (%)   MaP   L0 Regu MvP soft MvP
+                      SQuAD - Dev             10%    70.2/80.1 72.4/81.9 75.6/84.3 76.6/84.980.4/88.1EM/F1               3%    45.5/59.6 65.5/75.9 67.5/78.0 72.9/82.4
+                      MNLI - Dev              10%    78.3/79.3 78.7/79.8 80.1/80.4 81.2/81.8acc/MM acc   84.5/84.9     3%    69.4/70.6 76.2/76.5 76.5/77.4 79.6/80.2
+                      QQP - Dev              10%    79.8/65.0 88.1/82.8 89.7/86.2 90.5/87.191.4/88.4acc/F1                3%    72.4/57.8 87.1/82.0 86.1/81.5 89.3/85.6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                            (a) Distribution of remaining weights     (b) Scores and weights learned by
+                                                        movement pruning
+                 Figure 4: Magnitude pruning and movement pruning leads to pruned models with radically different
+                 weight distribution.
+
+
+                 loss on the output distributions. Figure 3 shows the results on SQuAD, MNLI, and QQP for the three
+                 pruning methods boosted with distillation. Overall, we observe that the relative comparisons of the
+                 pruning methods remain unchanged while the performances are strictly increased. Table 3 shows for
+                 instance that on SQuAD, movement pruning at 10% goes from 81.7 F1 to 84.3 F1. When combined
+                 with distillation, soft movement pruning yields the strongest performances across all pruning methods
+                 and studied datasets: it reaches 95% ofBERT-basewith only a fraction of the weights in the encoder
+                 ( 5% on SQuAD and MNLI).
+
+                 7 Analysis
+
+                 Movement pruning is adaptive Figure 4a compares the distribution of the remaining weights for
+                 the same matrix of a model pruned at the same sparsity using magnitude and movement pruning. We
+                 observe that by deﬁnition, magnitude pruning removes all the weights that are close to zero, ending
+                 up with two clusters. In contrast, movement pruning leads to a smoother distribution, which covers
+                 the whole interval except for values close to 0.
+                 Figure 4b displays each individual weight against its associated importance score in movement
+                 pruning. We plot pruned weights in grey. We observe that movement pruning induces no simple
+                 relationship between the scores and the weights. Both weights with high absolute value or low
+                 absolute value can be considered important. However, high scores are systematically associated with
+                 non-zero weights (and thus the “v-shape”). This is coherent with the interpretation we gave to the
+                 scores (section 4): a high scoreSindicates that during ﬁne-tuning, the associated weight moved away
+                 from 0 and is thus non-null.
+
+                 Local and global masks perform similarly  We study the inﬂuence of the locality of the pruning
+                 decision. While local Top v selects thev% most important weights matrix by matrix, global Top v uncovers non-uniform sparsity patterns in the network by selecting thev% most important weights in
+
+                                                  7                 Figure 5: Comparison of local and global selec- Figure 6:Remaining weights per layer in thetions of weights on SQuAD at different sparsity Transformer.Global magnitude pruning tends tolevels.For magnitude and movement pruning, prune uniformly layers. Global 1st order meth-local and global Top v performs similarly at all ods allocate the weight to the lower layers whilelevels of sparsity.                     heavily pruning the highest layers.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                 the whole network. Previous work has shown that a non-uniform sparsity across layers is crucial to
+                 the performance in high sparsity regimes [He et al., 2018]. In particular, Mallya and Lazebnik [2018]
+                 found that the sparsity tends to increase with the depth of the network layer.
+                 Figure 5 compares the performance of local selection (matrix by matrix) against global selection
+                 (all the matrices) for magnitude pruning and movement pruning. Despite being able to ﬁnd a
+                 global sparsity structure, we found that global did not signiﬁcantly outperform local, except in high
+                 sparsity regimes (2.3 F1 points of difference with 3% of remaining weights for movement pruning).
+                 Even though the distillation signal boosts the performance of pruned models, the end performance
+                 difference between local and global selections remains marginal.
+                 Figure 6 shows the remaining weights percentage obtained per layer when the model is pruned until
+                 10% with global pruning methods. Global weight pruning is able to allocate sparsity non-uniformly
+                 through the network, and it has been shown to be crucial for the performance in high sparsity regimes
+                 [He et al., 2018]. We notice that except for global magnitude pruning, all the global pruning methods
+                 tend to allocate a signiﬁcant part of the weights to the lowest layers while heavily pruning in the
+                 highest layers. Global magnitude pruning tends to prune similarly to local magnitude pruning, i.e.,
+                 uniformly across layers.
+
+
+                 8 Conclusion
+
+                 We consider the case of pruning of pretrained models for task-speciﬁc ﬁne-tuning and compare
+                 zeroth- and ﬁrst-order pruning methods. We show that a simple method for weight pruning based on
+                 straight-through gradients is effective for this task and that it adapts using a ﬁrst-order importance
+                 score. We apply this movement pruning to a transformer-based architecture and empirically show that
+                 our method consistently yields strong improvements over existing methods in high-sparsity regimes.
+                 The analysis demonstrates how this approach adapts to the ﬁne-tuning regime in a way that magnitude
+                 pruning cannot. In future work, it would also be interesting to leverage group-sparsity inducing
+                 penalties [Bach et al., 2011] to remove entire columns or ﬁlters. In this setup, we would associate a
+                 score to a group of weights (a column or a row for instance). In the transformer architecture, it would
+                 give a systematic way to perform feature selection and remove entire columns of the embedding
+                 matrix.
+
+
+                 References
+                 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
+                   Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text
+                   transformer.ArXiv, abs/1910.10683, 2019.
+
+                                                  8                 Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep
+                   learning in nlp. InACL, 2019.
+                 Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for
+                   efﬁcient neural network. InNIPS, 2015.
+                 Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
+                   with pruning, trained quantization and huffman coding. InICLR, 2016.
+                 Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. InNIPS,
+                   2016.
+                 Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.ArXiv,
+                   abs/1902.09574, 2019.
+                 Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket
+                   hypothesis at scale.ArXiv, abs/1903.01611, 2019.
+                 Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients
+                   through stochastic neurons for conditional computation.ArXiv, abs/1308.3432, 2013.
+                 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
+                   bidirectional transformers for language understanding. InNAACL, 2019.
+                 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
+                   Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017.
+                 Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through
+                   l0 regularization. InICLR, 2017.
+                 Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for
+                   sentence understanding through inference. InNAACL, 2018.
+                 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for
+                   machine comprehension of text. InEMNLP, 2016.
+                 Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, ﬁxed network by
+                   learning to mask.ArXiv, abs/1801.06519, 2018.
+                 Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari.
+                   What’s hidden in a randomly weighted neural network? InCVPR, 2020.
+                 Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNIPS, 1989.
+                 Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and
+                   performance comparisons. InNIPS, 1993.
+                 Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with
+                   dense networks and ﬁsher pruning.ArXiv, abs/1801.05787, 2018.
+                 Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Ji Liu, and Jungong Han. Global sparse
+                   momentum sgd for pruning very deep neural networks. InNeurIPS, 2019.
+                 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
+                   bert: smaller, faster, cheaper and lighter. InNeurIPS EMC2 Workshop, 2019.
+                 Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling
+                   task-speciﬁc knowledge from bert into simple neural networks.ArXiv, abs/1903.12136, 2019.
+                 Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
+                   structured dropout. InICLR, 2020a.
+                 Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InNeurIPS,
+                   2019.
+
+                                                  9                 Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and
+                   multiple languages: lottery tickets in rl and nlp. InICLR, 2020.
+                 Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou,
+                   and Armand Joulin. Training with quantization noise for extreme model compression.ArXiv,
+                   abs/2004.07320, 2020b.
+                 Oﬁr Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert.ArXiv,
+                   abs/1910.06188, 2019.
+                 Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional
+                   networks using vector quantization.ArXiv, abs/1412.6115, 2014.
+                 Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gon-
+                   zalez. Train large, then compress: Rethinking model size for efﬁcient training and inference of
+                   transformers.ArXiv, abs/2002.11794, 2020.
+                 Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efﬁcacy of pruning for model
+                   compression. InICLR, 2018.
+                 Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of
+                   weight pruning on transfer learning.ArXiv, abs/2002.08307, 2020.
+                 Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in
+                   natural language processing. InNAACL, 2019.
+                 Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
+                   models are unsupervised multitask learners. 2019.
+                 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
+                   Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
+                   approach.ArXiv, abs/1907.11692, 2019.
+                 Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017.
+                   URLhttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
+                 Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lian Lin, and Yanzhi Wang. Reweighted proximal
+                   pruning for large-scale language representation.ArXiv, abs/1909.12486, 2019.
+                  Neal Parikh and Stephen P. Boyd. Proximal algorithms.Found. Trends Optim., 1:127–239, 2014.
+                 Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better:
+                   The impact of student initialization on knowledge distillation.ArXiv, abs/1908.08962, 2019.
+                 Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InKDD, 2006.
+                 Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
+                   InNIPS, 2014.
+                 Xiaoqi Jiao, Y. Yin, Lifeng Shang, Xin Jiang, Xusong Chen, Linlin Li, Fang Wang, and Qun Liu.
+                   Tinybert: Distilling bert for natural language understanding.ArXiv, abs/1909.10351, 2019.
+                 Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
+                   compression and acceleration on mobile devices. InECCV, 2018.
+                 Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity
+                   through convex optimization.Statistical Science, 27, 09 2011. doi: 10.1214/12-STS394.
+
+
+
+
+
+
+
+
+
+                                                  10                 A Appendices
+
+                 A.1 Guarantees on the decrease of the training loss
+
+                 As the scores are updated, the relative order of the importances is likely shufﬂed, and some connections
+                 will be replaced by more important ones. Under certain conditions, we are able to formally prove that
+                 as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from
+                 [Ramanujan et al., 2020] to consider the case of ﬁne-tuableW.
+                 We suppose that (a) the training lossLis smooth and admits a ﬁrst-order Taylor development
+                 everywhere it is deﬁned and (b) the learning rate ofW( W >0) is small. We deﬁne the TopK
+                 function as the analog of the Top v function, wherekis an integer instead of a proportion. We ﬁrst
+                 consider the case wherek= 1in the TopK masking, meaning that only one connection is remaining
+                 (and the other weights are deactivated/masked). Let’s denoteWi;j this sole remaining connection at
+                 stept. Following Eq (1), it means that81 u;v n;S (t)u;v  S(t) .i;j
+                 We suppose that at stept+ 1, connections are swapped and the only remaining connection at step
+                 t+ 1is(k;l). We have:
+                                   (
+                                    Att;    81 u;v n;S (t)u;v  S(t)
+                                                             i;j                   (6) Att+ 1; 81 u;v n;S (t+1)u;v   S(t+1)
+                                                              k;l
+
+                 Eq(6)gives the following inequality:S(t+1)  S(t)  S(t+1)  S(t) . After re-injecting the gradient k;l    k;l   i;j     i;j update in Eq (2), we have:
+                                          @L        )       @L
+                                         S   W(t x                                (7)@a  k;l l     S  W(t) x
+                                            k          @a  i;j ji
+
+                 Moreover, the conditions in Eq(6)lead to the following inferences:a(t) =W(t) x    a(t+1) =i     i;j j and k
+                 W(t+1) xk;l  l .
+
+                 Since                        t)W is small,jj(a(t+1) ;a (t+1) ) (a( ;a (t) )jj i    k      i  k  2 is also small. Because the training lossLis
+                 smooth, we can write the 1st order Taylor development ofLin point(a(t) ;a (t) ):i  k
+
+
+                            L(a(t+1) ;a (t+1) )  L(a(t) ;a (t) )i    k       i  k
+                               @L    (a(t+1)  a(t)   @L) +   (a(t+1)  a(t) )@a  k     kk            @a  i     ii
+                               @L=   W(t+1)   @Lx     W(t) x@a  k;l  l  
+                                k         @a  i;j ji
+                               @L           @L       @L                         (8) =   W(t+1) x                 W(t)    @Lx       (t) x@a  k;l  l + (    W(t) x
+                                k           @a  k;l l +
+                                              k       @a  k;l l )    Wi;j jk        @a i
+                               @L=   (W(t+1) x                )   @L
+                              @a   k;l  l  W(t)     @Lx          xk;l l ) + (   W(t
+                                k                 @a  k;l l     W(t) x
+                                                     k       @a  i;j j )
+                                                               i
+                               @L       @L            @L       @L=   x           (S(t) )           x@a l (  W   x     k;l ) + (   W(t) l     W(t) x
+                                k      @a l m
+                                          k            @a  k;lk       @a  i;j j )
+                                                                  i
+                 The ﬁrst term is null because of inequalities(6)and the second term is negative because of inequality
+                 (7). ThusL(a(t+1) ;a (t+1) )  L(a(t) ;a (t) ): when connection(k;l)becomes more important than i    k        i (i;j), the connections are swapped and the training loss decreases between step k                         tandt+ 1.            Similarly, we can generalize the proof to a setE=f(ai ;b i );(ci ;d i );i Ng ofNswapping
+                 connections.
+                 We note that this proof is not speciﬁc to theTopKmasking function. In fact, we can extend the proof
+                 using theThresholdmasking functionM:= (S>= )[Mallya and Lazebnik, 2018]. Inequalities
+                 (6) are still valid and the proof stays unchanged.
+
+                                                  11                 Last, we note these guarantees do not hold if we consider the absolute value of the scoresjSi;j j(as
+                 it is done in Ding et al. [2019] for instance). We prove it by contradiction. If it was the case, it
+                 would also be true one speciﬁc case: thenegative thresholdmasking function (M:= (S<  )where
+                   <0).
+                 We suppose that at stept+ 1, the only remaining connection(i;j)is replaced by(k;l):
+                                  (
+                                   Att;    81 u;v n;S (t)    S(t)
+                                                      i;j      u;v                 (9)Att+ 1; 81 u;v n;S (t+1)    S(t+1)
+                                                      k;l        u;v
+
+                 The inequality on the gradient update becomes:   @LS  W(t) x@a k k;l l <   @LS  W@a  i;j xj and following i
+                 the same development as in Eq(8), we haveL(a(t+1) ;a (t+1) )  L(a(t) ;a (t) ) 0: the loss increases. i    k       i We proved by contradiction that the guarantees on the decrease of the loss do not hold if we consider k
+                 the absolute value of the score as a proxy for importance.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                  12
\ No newline at end of file
diff --git a/Corpus/Network Pruning notes.txt b/Corpus/Network Pruning notes.txt
new file mode 100644
index 0000000..eb04717
--- /dev/null
+++ b/Corpus/Network Pruning notes.txt	
@@ -0,0 +1,150 @@
+  Network Pruning
+
+
+     As one of the earliest works in network pruning, Yann Lecun's Optimal brain 
+     damage (OBD) paper has been cited in many of the papers.
+     Some research focuses on module network designs. "These models, such as 
+     SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions 
+     convolution with lesser parameters and better performance."
+     Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a 
+     compression and regularization method, as opposed to other techniques such as 
+     non-structured pruning (weight sparsifying and connection pruning), low rank 
+     approximation and vector quantization (references to these approaches can be 
+     found in the related work sections of the following papers). 
+     Difference between structred and non-structured pruning:
+       "Non-structured pruning aims to remove single parameters that have little 
+       influence on the accuracy of networks". For example, L1-norm regularization on 
+       weights is noted as non-structured pruning- since it's basically a weight 
+       sparsifying method, i.e removes single parameter. 
+       The term 'structure' refers to a structured unit in the network. So instead of 
+       pruning individual weights or connections, structured pruning targets neurons, 
+       filters, channels, layers etc. But the general implementation idea is the same as 
+       penalizing individual weights: introducing a regularization term (mostly in the 
+       form of L1-norm) to the loss function to penalize (sparsify) structures.
+     I focused on structured pruning and read through the following papers:
+
+   1. Structured Pruning of Convolutional Neural Networks via L1 
+     Regularization (August 2019)
+       "(...) network pruning is useful to remove redundant parameters, filters, 
+       channels or neurons, and address the over-fitting issue."
+
+       Provides a good review of previous work on non-structured and structured 
+       pruning.
+       "This study presents a scheme to prune filters or neurons of fully-connected 
+       layers based on L1 regularization to zero out the weights of some filters or 
+       neurons."
+       Didn't quite understand the method and implementation. There are two key 
+       elements: mask and threshold. "(...) the problem of zeroing out the values of 
+       some filters can be transformed to zero some mask." || "Though the proposed 
+       method introduces mask, the network topology will be preserved because the        mask can be absorbed into weight." || "Here the mask value cannot be 
+       completely zeroed in practical application, because the objective function (7) is 
+       non-convex and the global optimal solution may not be obtained. A strategy is 
+       adopted in the proposed method to solve this problem. If the order of 
+       magnitude of the mask value is small enough, it can be considered almost as 
+       zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...) 
+       The average value of the product of the mask and the weight is used to 
+       determine whether the mask is exactly zero or not."
+       From what I understand they use L1 norm in the loss function to penalize 
+       useless filters through peenalizing masks. And a threshold value is introduced 
+       to determine when the mask is small enough to be considered zero. 
+       They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
+       32)
+
+   2. Learning Efficient Convolutional Networks through Network Slimming (August 
+     2017) + Git repo
+       "Our approach imposes L1 regular- ization on the scaling factors in batch 
+       normalization (BN) layers, thus it is easy to implement without introducing any 
+       change to existing CNN architectures. Pushing the values of BN scaling factors 
+       towards zero with L1 regularization enables us to identify insignificant channels 
+       (or neurons), as each scaling factor corresponds to a specific con- volutional 
+       channel (or a neuron in a fully-connected layer)."
+       They provide a good insight on advantages and disadvantages of other 
+       computation reduction methods such as low rank approximation, vector 
+       quantization etc. 
+       I belive here they use the word 'channel' to refer to filters (?).
+       "Our idea is introducing a scaling factor γ for each channel, which is multiplied 
+       to the output of that channel. Then we jointly train the network weights and 
+       these scaling factors, with sparsity regularization imposed on the latter. Finally 
+
+       we prune those channels with small factors, and fine-tune the pruned network. 
+       " --> so instead of 'mask' they use the 'scaling factor' and impose regularization 
+       on that, but the idea is very similar.
+       "The way BN normalizes the activations motivates us to design a simple and 
+       efficient method to incorporates the channel-wise scaling factors. Particularly, 
+       BN layer normalizes the internal activa- tions using mini-batch statistics." || " 
+       (...) we can directly leverage the γ parameters in BN layers as the scaling factors 
+       we need for network slim- ming. It has the great advantage of introducing no 
+       overhead to the network."       They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40), 
+       ImageNet (model: VGG-A) and MNIST (model: Lenet)
+
+   3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
+       " (...) we propose Structured Sparsity Learning (SSL) method to directly learn a 
+       compressed structure of deep CNNs by group Lasso regularization during the 
+       training. SSL is a generic regularization to adaptively adjust mutiple structures 
+       in DNN, including structures of filters, channels, and filter shapes within each 
+       layer, and structure of depth beyond the layers." || " (...) offering not only well-
+       regularized big models with improved accuracy but greatly accelerated 
+       computation."
+
+
+
+        "Here W represents the collection of all weights in the DNN; ED(W) is the loss 
+       on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
+       norm; and Rg(·) is the structured sparsity regularization on each layer. Because 
+       Group Lasso can effectively zero out all weights in some groups [14][15], we 
+       adopt it in our SSL. The regularization of group Lasso on a set of weights w can 
+       be represented as  
+
+
+        , where w(g) is a group of partial weights in w and G is the total number of 
+       groups. " || "In SSL, the learned “structure” is decided by the way of splitting 
+       groups of w(g). We investigate and formulate the filer-wise, channel-wise, 
+       shape-wise, and depth-wise structured sparsity (...)"
+       They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
+       20) and ImageNet (model:AlexNet)
+       The authors also provide a visualization of filters after pruning, showing that 
+       only important detectors of patterns remain after pruning.
+
+       In conclusions: "Moreover, a variant of SSL can be performed as structure 
+       regularization to improve classification accuracy of state-of-the-art DNNs."
+
+   4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
+       "After an initial training phase, we remove all connections whose weight is 
+       lower than a threshold. This pruning converts a dense, fully-connected layer to 
+       a sparse layer." || "We then retrain the sparse network so the remaining 
+       connections can compensate for the connections that have been removed. The 
+       phases of pruning and retraining may be repeated iteratively to further reduce        network complexity. In effect, this training process learns the network 
+       connectivity in addition to the weights (...)"
+       Although the description above implies the pruning was done only for FC 
+       layers, they also do pruning on convolutional layers - although they don't 
+       provide much detail on this in the methods. But there's this statement when 
+       they explain retraining: "(...) we fix the parameters for CONV layers and only 
+       retrain the FC layers after pruning the FC layers, and vice versa.". The results 
+       section also shows that convolutional layer connections were also 
+       pruned on the tested models.
+       They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and 
+       ImageNet (models: AlexNet, VGG-16)
+       The authors provide a visualization of the sparsity patterns of neurons after 
+       pruning (for an FC layer) which shows that pruning can detect visual attention 
+       regions.
+       The method used in this paper targets individual parameters (weights) to 
+       prune. So, technically this should be considered as a non-structured pruning 
+       method. However, the reason I think this is referenced as a structured pruning 
+       method is that if all connections of a neuron is pruned (i.e all input and output 
+       weights were below threshold), the neuron itself will be removed from the 
+       network:  "After pruning connections, neurons with zero input connections or 
+       zero output connections may be safely pruned."
+       SIDENOTE: They touch on the use of global average pooling instead of fully 
+       connected layers in CNNs: "There have been other attempts to reduce the 
+       number of parameters of neural networks by replacing the fully connected 
+       layer with global average pooling."
+
+   5. Many more can be picked from the references of these papers. 
+
+
+
+     There's a paper on Bayesion compression for Deep Learning from 2017. Their 
+     hypothesis is: "By employing sparsity inducing priors for hidden units (and not 
+     individual weights) we can prune neurons including all their ingoing and outgoing 
+     weights." However, the method is mathematically heavy and the related work 
+     references are quite old (1990s, 2000s). 
\ No newline at end of file
diff --git a/Corpus/Network Trimming_ A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures.txt b/Corpus/Network Trimming_ A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures.txt
new file mode 100644
index 0000000..b8bb596
Binary files /dev/null and b/Corpus/Network Trimming_ A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures.txt differ
diff --git a/Corpus/Optimal Brain - Le Cun.txt b/Corpus/Optimal Brain - Le Cun.txt
new file mode 100644
index 0000000..784576e
--- /dev/null
+++ b/Corpus/Optimal Brain - Le Cun.txt	
@@ -0,0 +1,1985 @@
+                                                                                                                                                                                                                                                                                                                                                                                                   Ł
+                                                                                                                                                                                                                                                                                                                                                                                                                    ﬂ
+                                                                                                                                                                                                                                                                                                                                                                                          ﬁ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                             œ                                                                                                                                                                                                                                                                        œ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                œ
+                                                                                                                                                                                                                                                                                                                     
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                          œ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                œ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    :
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        !
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          !
+                                                                                                                                                                                                                                                                                                                                                                                                      !
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 !
+                                                                                                                                                                                                                                                                                                                                                                                                                                                       !
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ¢
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ˛
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                )
+                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     '
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                $
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       $
+                                                                                                                                                                                                                                                                                                                                                                                                                                             $
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         2
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             E
+                                                                                                                                                                                                                                                                                                                                                                                                                                   E
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                     œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         '
+                                                                                                                                                                                                                                                                                                                                                                                                                                                           
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                    ˚
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                        ˆ
+                                                                                                                                                                                                                                                                                                                                                                                     ˘
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                :
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ˆ
+                                                                                                                                                                                                                                                                                                                                                                                                                                              ˘
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                œ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                              ˘
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               !
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     š
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ˜
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    '
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                 œ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               !
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ¢
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ˜
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ˛
+                                                                                                                                                                                                                                                                                                                                                                                                                                                    )
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     I
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        !
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               -
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   2
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ˆ
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             ˙
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                                                                                                                                                                                                                                          œ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   !
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        !
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                    E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ¢
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ˛
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       2
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   -
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         -
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                   E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              !
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            %
+                                                                                                                                                                                                                                                                                                                                                                                                                                                               !
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                !
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  -
+                                                                                                                                                                                                                                                                                                                                                                                     -
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               !
+
+                                                                                                                                                                                                                                                                                                                                                                                                !
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                 œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ˜
+                                                                                                                                                                                                                                                                                                                                                                           E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ¢
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            -
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ¢
+                                                                                                                                                                                                                                                                                                                                                                                                                ˛
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            &
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                                                       
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             -
+                                                                                                                                                                                                                                                                                                                                                                                                -
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             &
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          E
+                                                                                                                                                                                                                                                                                                                                                                             E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               &
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                           
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         !
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            !
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        !
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         !
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                      -
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                 !
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                           œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ¢
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     $
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ¢
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  '
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ¢
+                                                                                                                                                                                                                                                                                                                                                                                                                                  ˛
+                                                                                                                                                                                                                                                                                                                                                                                             E
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 '
+                                                                                                                                                                                                                                                                                                                                                                                                                                                   '
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                         
+
+                                                                                                                                                                                                                                                                                                                                                                                                                 -
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                               E
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                œ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                           U                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   I
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            œ
+
+
+                           16                               16
+                           14                    (a)        14                    (b)
+                           12                               12
+                           10                Magnitude       10
+                            8                                8
+
+
+
+                           log MSE                                 log MSE 6                                6
+                            4                                4
+                            2     OBD                       2
+                            0                                0
+                           -2                               -2
+                             0   500 1000 1500 2000 2500       0   500 1000 1500 2000 2500
+                                   Parameters                        Parameters
+
+
+                                                      œ
+                                  œ
+                               X
+
+                                                                                   œ
+                                                                      œ
+
+
+
+
+
+                                                                       œ
+
+
+
+
+                                                                                         16                               16
+                           14                    (a)        14                    (b)
+                           12                               12
+                           10                               10
+                            8                                8
+
+
+
+                           log MSE                                 log MSE 6                                6
+                            4                                4
+                            2                                2
+                            0                                0
+                           -2                               -2
+                             0   500 1000 1500 2000 2500       0   500 1000 1500 2000 2500
+                                   Parameters                        Parameters
+
+
+                                                   œ
+                                
+                               U
+
+                                                             œ
+                               œ
+
+                                                   œ
+                                  œ
+
+
+                              '
+
+                                                                                         œ
+
+                                                                               œ
+                                                                                             ¢
+                                                                                X
+                                                                               
+                                                                 I
+
+
+
+
+
+
+
+                                                                             œ
+
+
+
+
+
+                                                           X
+                                                          I
+
+                                                      œ
+
+
+                                                                          œ
+                                                  '
+
+                                                    œ                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                   
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                           
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                           œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                                                                                                          
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                                                  Y
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                  œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           !
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 !
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                         š
+                                                                                                                                                                                                                                                                                                                                                             
+                                                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                             5
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                                                                                                                                                                                                I
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                          œ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                )
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                                                                                              
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                   ž
+                                                                                                                                                                                                                                                                                           '
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                             œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                    
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                             
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                   
+                                                                                                                                                                                                                                              +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                        œ
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2
+                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                             $
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                     œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                 
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                           
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $
+                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                           5
+                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                          &
+                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                             *
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                          œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               +
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                            
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                 œ
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      œ
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    5
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  &
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                                                                                                                                                         
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                            +
+                                                                                                                                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                      
+                                                                                                                                                                                                                                                                                                                                          $
+                                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                             (
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        '
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                   œ
+                                                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                        
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 )
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           :
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     &
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                                                   
+                                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                                     &
+                                                                                                                                                                                                                                                                                                                                                                                :
+                                                                                                                                                                                                                                                                                                                                                         
+                                                                                                                                                                                                                                                                                                                                             &
+                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                   
+                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                 $
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                    2
+                                                                                                                                                                                                                                                                                                                                             '
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          œ
+                                                                                                                                                                                                                                                                           
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                             
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                     
+                                                                                                                                                                                                                                                                                                                            œ
+                                                                                                                                                                                                                                                                                                                                                                             2
+                                                                                                                                                                                                                                                     
+                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                      '
+                                                                                                                                                                                                                                                                                                                                                             
+                                                                                                                                                                                                                                                                                                                                                    '
+                                                                                                                                                                                                                                                                                                                                 '
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                       œ
+
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                             
+
+
+
+
+
+
+
+
+
+
+
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                š
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +
+                                                                                                                                                                                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                                                                                                                                                                                        
+                                                                                                                                                                                                                                                                                                                                                                                                                                  
+                                                                                                                                                                                                                                                                                                                                                                                                                      $
+                                                                                                                                                                                                                                                                                                                                                                                                       
+
+                                                                                                                                                                                                                                                                                                                                                                                    
+                                                                                                                                                                                                                                                                                                                                                                           
+                                                                                                                                                                                                                                                                                                               
+                                                                                                                                                                                                                                                                                                      
+                                                                                                                                                                                                                                                                                       
+                                                                                                                                                                                                                                                                                
+                                                                                                                                                                                                                                                                  5
+                                                                                                                                                                                                                                              
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        I
+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       '
\ No newline at end of file
diff --git a/Corpus/PLUG AND PLAY LANGUAGE MODELS.txt b/Corpus/PLUG AND PLAY LANGUAGE MODELS.txt
new file mode 100644
index 0000000..1aa8233
Binary files /dev/null and b/Corpus/PLUG AND PLAY LANGUAGE MODELS.txt differ
diff --git a/Corpus/Predicting Performance for Natural Language Processing Tasks.txt b/Corpus/Predicting Performance for Natural Language Processing Tasks.txt
new file mode 100644
index 0000000..7b400ba
Binary files /dev/null and b/Corpus/Predicting Performance for Natural Language Processing Tasks.txt differ
diff --git a/Corpus/Predicting trends in the quality of state-of-the-art neural networkswithout access to training or testing dat.txt b/Corpus/Predicting trends in the quality of state-of-the-art neural networkswithout access to training or testing dat.txt
new file mode 100644
index 0000000..32ad0dc
Binary files /dev/null and b/Corpus/Predicting trends in the quality of state-of-the-art neural networkswithout access to training or testing dat.txt differ
diff --git a/Corpus/Pruning neural networks without any databy iteratively conserving synaptic flow.txt b/Corpus/Pruning neural networks without any databy iteratively conserving synaptic flow.txt
new file mode 100644
index 0000000..3dcbfb6
Binary files /dev/null and b/Corpus/Pruning neural networks without any databy iteratively conserving synaptic flow.txt differ
diff --git a/Corpus/Scalable Gradients for Stochastic Differential Equations.txt b/Corpus/Scalable Gradients for Stochastic Differential Equations.txt
new file mode 100644
index 0000000..da0f938
Binary files /dev/null and b/Corpus/Scalable Gradients for Stochastic Differential Equations.txt differ
diff --git a/Corpus/Scaling Laws for Neural Language Models.txt b/Corpus/Scaling Laws for Neural Language Models.txt
new file mode 100644
index 0000000..d9ed768
Binary files /dev/null and b/Corpus/Scaling Laws for Neural Language Models.txt differ
diff --git a/Corpus/Structured Pruning of Convolutional Neural Networks via L1 Regularization - CHEN YANG.txt b/Corpus/Structured Pruning of Convolutional Neural Networks via L1 Regularization - CHEN YANG.txt
new file mode 100644
index 0000000..f89abe5
Binary files /dev/null and b/Corpus/Structured Pruning of Convolutional Neural Networks via L1 Regularization - CHEN YANG.txt differ
diff --git a/Corpus/THE LOTTERY TICKET HYPOTHESIS.txt b/Corpus/THE LOTTERY TICKET HYPOTHESIS.txt
new file mode 100644
index 0000000..c15f6d0
Binary files /dev/null and b/Corpus/THE LOTTERY TICKET HYPOTHESIS.txt differ
diff --git a/Corpus/TOWARDS THE SYSTEMATIC REPORTING OF THE ENERGY AND CARBON FOOTPRINTS OF MACHINE LEARNING.txt b/Corpus/TOWARDS THE SYSTEMATIC REPORTING OF THE ENERGY AND CARBON FOOTPRINTS OF MACHINE LEARNING.txt
new file mode 100644
index 0000000..4f12479
Binary files /dev/null and b/Corpus/TOWARDS THE SYSTEMATIC REPORTING OF THE ENERGY AND CARBON FOOTPRINTS OF MACHINE LEARNING.txt differ
diff --git a/Corpus/The 4 Research Techniques to Train Deep Neural Network Models More Efficiently.txt b/Corpus/The 4 Research Techniques to Train Deep Neural Network Models More Efficiently.txt
new file mode 100644
index 0000000..05d66fd
--- /dev/null
+++ b/Corpus/The 4 Research Techniques to Train Deep Neural Network Models More Efficiently.txt	
@@ -0,0 +1,535 @@
+                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+
+       To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy,
+       including cookie policy.
+
+
+
+
+
+
+
+         The 4 Research Techniques to
+
+         Train Deep Neural Network
+
+         Models More E:ciently
+
+
+               James Le Follow
+               Oct 29, 2019 · 9 min read
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                           Photo by Victor Freitas on Unsplash
+
+  https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205          Page 1 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      Deep learning and unsupervised feature learning have shown
+      great promise in many practical applications. State-of-the-art
+      performance has been reported in several domains, ranging
+      from speech recognition and image recognition to text
+      processing and beyond.
+
+
+      It’s also been observed that increasing the scale of deep
+      learning—with respect to numbers of training examples, model
+      parameters, or both—can drastically improve accuracy. These
+      results have led to a surge of interest in scaling up the training
+      and inference algorithms used for these models and in
+      improving optimization techniques for both.
+
+
+      The use of GPUs is a signiFcant advance in recent years that
+      makes the training of modestly-sized deep networks practical.
+      A known limitation of the GPU approach is that the training
+      speed-up is small when the model doesn’t Ft in a GPU’s
+      memory (typically less than 6 gigabytes).
+
+
+      To use a GPU eLectively, researchers often reduce the size of
+      the dataset or parameters so that CPU-to-GPU transfers are not
+      a signiFcant bottleneck. While data and parameter reduction
+      work well for small problems (e.g. acoustic modeling for speech
+      recognition), they are less attractive for problems with a large
+      number of examples and dimensions (e.g., high-resolution
+      images).
+
+
+                               In the previous post, we
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 2 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+                               talked about 5 diLerent
+                               algorithms for ePcient deep
+                               learning inference. In this
+                               article, we’ll discuss the
+                               upper right part of the
+                               quadrant on the left. What
+                               are the best research
+                               techniques to train deep
+                               neural networks more
+      ePciently?
+
+
+
+      1 — Parallelization Training
+      Let’s start with parallelization. As the Fgure below shows, the
+      number of transistors keeps increasing over the years. But
+      single-threaded performance and frequency are plateauing in
+      recent years. Interestingly, the number of cores is increasing.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 3 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+
+
+      So what we really need to know is how to parallelize the
+      problem to take advantage of parallel processing. There are a
+      lot of opportunities to do that in deep neural networks.
+
+
+      For example, we can do data parallelism: feeding 2 images
+      into the same model and running them at the same time. This
+      does not aLect latency for any single input. It doesn’t make it
+      shorter, but it makes the batch size larger. It also requires
+      coordinated weight updates during training.
+
+
+      For example, in JeL Dean’s paper “Large Scale Distributed Deep
+      Networks,” there’s a parameter server (as a master) and a
+      couple of model workers (as slaves) running their own pieces of
+      training data and updating the gradient to the master.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+      Another idea is model parallelism — splitting up the model
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 4 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      and distributing each part to diLerent processors or diLerent
+      threads. For example, imagine we want to run convolution in
+      the image below by doing a 6-dimension “for” loop. What we
+      can do is cut the input image by 2x2 blocks, so that each
+      thread/processor handles 1/4 of the image. Also, we can
+      parallelize the convolutional layers by the output or input
+      feature map regions, and the fully-connected layers by the
+      output activation.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                          ...
+
+
+
+
+         Machine learning models are moving closer
+
+         and closer to edge devices. Fritz AI is here
+
+         to help with this transition. Explore our
+
+         suite of developer tools that makes it easy to
+
+         teach devices to see, hear, sense, and think.
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 5 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+                    ...
+
+
+     2 — Mixed Precision Training
+     Larger models usually require more compute and memory
+     resources to train. These requirements can be lowered by using
+     reduced precision representation and arithmetic.
+
+     Performance (speed) of any program, including neural network
+     training and inference, is limited by one of three factors:
+     arithmetic bandwidth, memory bandwidth, or latency.
+     Reduced precision addresses two of these limiters. Memory
+     bandwidth pressure is lowered by using fewer bits to store the
+     same number of values. Arithmetic time can also be lowered on
+     processors that oLer higher throughput for reduced precision
+     math. For example, half-precision math throughput in recent
+     GPUs is 2× to 8× higher than for single-precision. In addition
+     to speed improvements, reduced precision formats also reduce
+     the amount of memory required for training.
+
+     Modern deep learning training systems use a single-precision
+     (FP32) format. In their paper “Mixed Precision Training,”
+     researchers from NVIDIA and Baidu addressed training with
+     reduced precision while maintaining model accuracy.
+
+     SpeciFcally, they trained various neural networks using the
+     IEEE half-precision format (FP16). Since FP16 format has a
+     narrower dynamic range than FP32, they introduced three
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205     Page 6 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      techniques to prevent model accuracy loss: maintaining a
+      master copy of weights in FP32, loss-scaling that minimizes
+      gradient values becoming zeros, and FP16 arithmetic with
+      accumulation in FP32.
+
+
+                               Using these techniques, they
+                               demonstrated that a wide
+                               variety of network
+                               architectures and
+                               applications can be trained
+                               to match the accuracy of
+                               FP32 training. Experimental
+                               results include convolutional
+                               and recurrent network
+      architectures, trained for classiFcation, regression, and
+      generative tasks.
+
+
+      Applications include image classiFcation, image generation,
+      object detection, language modeling, machine translation, and
+      speech recognition. The proposed methodology requires no
+      changes to models or training hyperparameters.
+
+
+
+      3 — Model Distillation
+      Model distillation refers to the idea of model compression by
+      teaching a smaller network exactly what to do, step-by-step,
+      using a bigger, already-trained network. The ‘soft labels’ refer
+      to the output feature maps by the bigger network after every
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 7 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      convolution layer. The smaller network is then trained to learn
+      the exact behavior of the bigger network by trying to replicate
+      its outputs at every level (not just the Fnal loss).
+
+
+      The method was Frst proposed by Bucila et al., 2006 and
+      generalized by Hinton et al., 2015. In distillation, knowledge is
+      transferred from the teacher model to the student by
+      minimizing a loss function in which the target is the
+      distribution of class probabilities predicted by the teacher
+      model. That is — the output of a softmax function on the
+      teacher model’s logits.
+
+
+                               So how do teacher-student
+                               networks exactly work?
+
+
+                               The highly-complex teacher
+                               network is Frst trained
+                               separately using the
+                               complete dataset. This step
+                               requires high computational
+                               performance and thus can
+                               only be done ohine (on
+         high-performing GPUs).
+
+         While designing a student network, correspondence needs
+         to be established between intermediate outputs of the
+         student network and the teacher network. This
+         correspondence can involve directly passing the output of a
+         layer in the teacher network to the student network, or
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 8 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+         performing some data augmentation before passing it to the
+         student network.
+
+         Next, the data are forward-passed through the teacher
+         network to get all intermediate outputs, and then data
+         augmentation (if any) is applied to the same.
+
+         Finally, the outputs from the teacher network are back-
+         propagated through the student network so that the student
+         network can learn to replicate the behavior of the teacher
+         network.
+
+                          ...
+
+
+
+
+         The future of machine learning is on the
+
+         edge. Subscribe to the Fritz AI Newsletter
+
+         to discover the possibilities and beneIts of
+
+         embedding ML models inside mobile apps.
+
+                          ...
+
+
+
+      4 — Dense-Sparse-Dense Training
+      The research paper “Dense-Sparse-Dense Training for Deep
+      Neural Networks” was published back in 2017 by researchers
+      from Stanford, NVIDIA, Baidu, and Facebook. Applying Dense-
+      Sparse-Dense (DSD) takes 3 sequential steps:
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 9 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+         Dense: Normal neural net training…business as usual. It’s
+         notable that even though DSD acts as a regularizer, the
+         usual regularization methods such as dropout and weight
+         regularization can be applied as well. The authors don’t
+         mention batch normalization, but it would work as well.
+
+
+                               Sparse: We regularize the
+                               network by removing
+                               connections with small
+                               weights. From each layer in
+                               the network, a percentage of
+                               the layer’s weights that are
+         closest to 0 in absolute value is selected to be pruned. This
+         means that they are set to 0 at each training iteration. It’s
+         worth noting that the pruned weights are selected only
+         once, not at each SGD iteration. Eventually, the network
+         recovers the pruned weights’ knowledge and condenses it in
+         the remaining ones. We train this sparse net until
+         convergence.
+
+         Dense: First, we re-enable the pruned weights from the
+         previous step. The net is again trained normally until
+         convergence. This step increases the capacity of the model.
+         It can use the recovered capacity to store new knowledge.
+         The authors note that the learning rate should be 1/10th of
+         the original. Since the model is already performing well, the
+         lower learning rate helps preserve the knowledge gained in
+         the previous step.
+
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 10 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      Removing pruning in the dense step allows the training to
+      escape saddle points to eventually reach a better minimum.
+      This lower minimum corresponds to improved training and
+      validation metrics.
+
+
+      Saddle points are areas in the multidimensional space of the
+      model that might not be a good solution but are hard to escape
+      from. The authors hypothesize that the lower minimum is
+      achieved because the sparsity in the network moves the
+      optimization problem to a lower-dimensional space. This space
+      is more robust to noise in the training data.
+
+
+      The authors tested DSD on image classiFcation (CNN), caption
+      generation (RNN), and speech recognition (LSTM). The
+      proposed method improved accuracy across all three tasks. It’s
+      quite remarkable that DSD works across domains.
+
+
+         DSD improved all CNN models tested — ResNet50, VGG,
+         and GoogLeNet. The improvement in absolute top-1
+         accuracy was respectively 1.12%, 4.31%, and 1.12%. This
+         corresponds to a relative improvement of 4.66%, 13.7%,
+         and 3.6%. These results are remarkable for such Fnely-
+         tuned models!
+
+
+                               DSD was applied to
+                               NeuralTalk, an amazing
+                               model that generates a
+                               description from an image.
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 11 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+                               To verify that the Dense-
+                               Sparse-Dense method works
+                               on an LSTM, the CNN part of
+                               Neural Talk is frozen. Only
+         the LSTM layers are trained. Very high (80% deducted by
+         the validation set) pruning was applied at the Sparse step.
+         Still, this gives the Neural Talk BLEU score an average
+         relative improvement of 6.7%. It’s fascinating that such a
+         minor adjustment produces this much improvement.
+
+         Applying DSD to speech recognition (Deep Speech 1)
+         achieves an average relative improvement of Word Error
+         Rate of 3.95%. On a similar but more advanced Deep
+         Speech 2 model Dense-Sparse-Dense is applied iteratively
+         two times. On the Frst iteration, pruning 50% of the
+         weights, then 25% of the weights are pruned. After these
+         two DSD iterations, the average relative improvement is
+         6.5%.
+
+
+
+      Conclusion
+      I hope that I’ve managed to explain these research techniques
+      for ePcient training of deep neural networks in a transparent
+      way. Work on this post allowed me to grasp how novel and
+      clever these techniques are. A solid understanding of these
+      approaches will allow you to incorporate them into your model
+      training procedure when needed.
+
+                          ...
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 12 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+
+
+      Editor’s Note: Heartbeat is a contributor-driven online
+      publication and community dedicated to exploring the emerging
+      intersection of mobile app development and machine learning.
+      We’re committed to supporting and inspiring developers and
+      engineers from all walks of life.
+
+
+      Editorially independent, Heartbeat is sponsored and published by
+      Fritz AI, the machine learning platform that helps developers
+      teach devices to see, hear, sense, and think. We pay our
+      contributors, and we don’t sell ads.
+
+
+      If you’d like to contribute, head on over to our call for
+      contributors. You can also sign up to receive our weekly
+      newsletters (Deep Learning Weekly and the Fritz AI
+      Newsletter), join us on Slack, and follow Fritz AI on Twitter for
+      all the latest in mobile machine learning.
+
+
+
+       Neural Networks  Deep Learning  Heartbeat  Guides And Tutorials
+
+       Machine Learning
+
+
+
+
+
+
+
+
+
+
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 13 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+
+
+      Discover Medium   Make Medium     Become a member
+                     yours Welcome to a place where                 Get unlimited access to the
+      words matter. On Medium,  Follow all the topics you   best stories on Medium —
+      smart voices and original   care about, and we’ll     and support writers while
+      ideas take center stage -   deliver the best stories for  you’re at it. Just $5/month.
+      with no ads in sight. Watch  you to your homepage and  Upgrade
+                     inbox. Explore
+
+
+
+
+                                  About   Help   Legal
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 14 of 14
\ No newline at end of file
diff --git a/Corpus/The State of Sparsity in Deep Neural Networks - Trevor Gale.txt b/Corpus/The State of Sparsity in Deep Neural Networks - Trevor Gale.txt
new file mode 100644
index 0000000..ba90caa
--- /dev/null
+++ b/Corpus/The State of Sparsity in Deep Neural Networks - Trevor Gale.txt	
@@ -0,0 +1,678 @@
+                         The State of Sparsity in Deep Neural Networks
+
+
+
+                                Trevor Gale * 1y Erich Elsen * 2 Sara Hooker 1y
+
+
+                         Abstract                  like image classiﬁcation and machine translation commonly
+                                                   have tens of millions of parameters, and require billions ofWe rigorously evaluate three state-of-the-art tech-      ﬂoating-point operations to make a prediction for a singleniques for inducing sparsity in deep neural net-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+     arXiv:1902.09574v1  [cs.LG]  25 Feb 2019                                             input sample.works on two large-scale learning tasks: Trans-
+            former trained on WMT 2014 English-to-German,      Sparsity has emerged as a leading approach to address these
+            and ResNet-50 trained on ImageNet. Across thou-      challenges. By sparsity, we refer to the property that a subset
+            sands of experiments, we demonstrate that com-      of the model parameters have a value of exactly zero 2 . With
+            plex techniques (Molchanov et al.,2017;Louizos      zero valued weights, any multiplications (which dominate
+            et al.,2017b) shown to yield high compression      neural network computation) can be skipped, and models
+            rates on smaller datasets perform inconsistently,      can be stored and transmitted compactly using sparse matrix
+            and that simple magnitude pruning approaches      formats. It has been shown empirically that deep neural
+            achieve comparable or better results. Based on      networks can tolerate high levels of sparsity (Han et al.,
+            insights from our experiments, we achieve a      2015;Narang et al.,2017;Ullrich et al.,2017), and this
+            new state-of-the-art sparsity-accuracy trade-off      property has been leveraged to signiﬁcantly reduce the cost
+            for ResNet-50 using only magnitude pruning. Ad-      associated with the deployment of deep neural networks,
+            ditionally, we repeat the experiments performed      and to enable the deployment of state-of-the-art models in
+            byFrankle & Carbin(2018) andLiu et al.(2018)      severely resource constrained environments (Theis et al.,
+            at scale and show that unstructured sparse archi-      2018;Kalchbrenner et al.,2018;Valin & Skoglund,2018).
+            tectures learned through pruning cannot be trained      Over the past few years, numerous techniques for induc-from scratch to the same test set performance as      ing sparsity have been proposed and the set of models anda model trained with joint sparsiﬁcation and op-      datasets used as benchmarks has grown too large to rea-timization. Together, these results highlight the      sonably expect new approaches to explore them all. Inneed for large-scale benchmarks in the ﬁeld of      addition to the lack of standardization in modeling tasks, themodel compression. We open-source our code,      distribution of benchmarks tends to slant heavily towardstop performing model checkpoints, and results of      convolutional architectures and computer vision tasks, andall hyperparameter conﬁgurations to establish rig-      the tasks used to evaluate new techniques are frequentlyorous baselines for future work on compression      not representative of the scale and complexity of real-worldand sparsiﬁcation.                          tasks where model compression is most useful. These char-
+                                                   acteristics make it difﬁcult to come away from the sparsity
+                                                   literature with a clear understanding of the relative merits
+         1. Introduction                             of different approaches.
+         Deep neural networks achieve state-of-the-art performance  In addition to practical concerns around comparing tech-
+         in a variety of domains including image classiﬁcation (He   niques, multiple independent studies have recently proposed
+         et al.,2016), machine translation (Vaswani et al.,2017),  that the value of sparsiﬁcation in neural networks has been
+         and text-to-speech (van den Oord et al.,2016;Kalchbren-  misunderstood (Frankle & Carbin,2018;Liu et al.,2018).
+         ner et al.,2018). While model quality has been shown to  While both papers suggest that sparsiﬁcation can be viewed
+         scale with model and dataset size (Hestness et al.,2017),  as a form of neural architecture search, they disagree on
+         the resources required to train and deploy large neural net-  what is necessary to achieve this. Speciﬁcally,Liu et al.
+         works can be prohibitive. State-of-the-art models for tasks     2 The term sparsity is also commonly used to refer to the pro-
+          * Equal contribution y This work was completed as part of the   portion of a neural networks weights that are zero valued. Higher
+         Google AI Residency 1 Google Brain 2 DeepMind. Correspondence   sparsity corresponds to fewer weights, and smaller computational
+         to: Trevor Gale<tgale@google.com>.                  and storage requirements. We use the term in this way throughout
+                                                   this paper.                                  The State of Sparsity in Deep Neural Networks
+
+         (2018) re-train learned sparse topologies with a random  Some of the earliest techniques for sparsifying neural net-
+         weight initialization, whereasFrankle & Carbin(2018) posit  works make use of second-order approximation of the loss
+         that the exact random weight initialization used when the   surface to avoid damaging model quality (LeCun et al.,
+         sparse architecture was learned is needed to match the test  1989;Hassibi & Stork,1992). More recent work has
+         set performance of the model sparsiﬁed during optimization.  achieved comparable compression levels with more com-
+                                                   putationally efﬁcient ﬁrst-order loss approximations, andIn this paper, we address these ambiguities to provide a   further reﬁnements have related this work to efﬁcient em-strong foundation for future work on sparsity in neural net-  pirical estimates of the Fisher information of the modelworks.Our main contributions:(1) We perform a com-  parameters (Molchanov et al.,2016;Theis et al.,2018).prehensive evaluation of variational dropout (Molchanov
+         et al.,2017),l0 regularization (Louizos et al.,2017b), and  Reinforcement learning has also been applied to automat-
+         magnitude pruning (Zhu & Gupta,2017) on Transformer  ically prune weights and convolutional ﬁlters (Lin et al.,
+         trained on WMT 2014 English-to-German and ResNet-50  2017;He et al.,2018), and a number of techniques have
+         trained on ImageNet. To the best of our knowledge, we   been proposed that draw inspiration from biological phe-
+         are the ﬁrst to apply variational dropout andl0 regulariza-  nomena, and derive from evolutionary algorithms and neu-
+         tion to models of this scale. While variational dropout and   romorphic computing (Guo et al.,2016;Bellec et al.,2017;
+         l0 regularization achieve state-of-the-art results on small   Mocanu et al.,2018).
+         datasets, we show that they perform inconsistently for large-  A key feature of a sparsity inducing technique is if andscale tasks and that simple magnitude pruning can achieve  how it imposes structure on the topology of sparse weights.comparable or better results for a reduced computational  While unstructured weight sparsity provides the most ﬂex-budget. (2) Through insights gained from our experiments,  ibility for the model, it is more difﬁcult to map efﬁcientlywe achieve a new state-of-the-art sparsity-accuracy trade-off   to parallel processors and has limited support in deep learn-for ResNet-50 using only magnitude pruning. (3) We repeat   ing software packages. For these reasons, many techniquesthe lottery ticket (Frankle & Carbin,2018) and scratch (Liu   focus on removing whole neurons and convolutional ﬁlters,et al.,2018) experiments on Transformer and ResNet-50   or impose block structure on the sparse weights (Liu et al.,across a full range of sparsity levels. We show that unstruc-  2017;Luo et al.,2017;Gray et al.,2017). While this is prac-tured sparse architectures learned through pruning cannot   tical, there is a trade-off between achievable compressionbe trained from scratch to the same test set performance as  levels for a given model quality and the level of structurea model trained with pruning as part of the optimization   imposed on the model weights. In this work, we focusprocess. (4) We open-source our code, model checkpoints,  on unstructured sparsity with the expectation that it upperand results of all hyperparameter settings to establish rig-  bounds the compression-accuracy trade-off achievable withorous baselines for future work on model compression and  structured sparsity techniques.sparsiﬁcation 3 .
+
+                                                   3. Evaluating Sparsiﬁcation Techniques at2. Sparsity in Neural Networks                 Scale
+         We brieﬂy provide a non-exhaustive review of proposed
+         approaches for inducing sparsity in deep neural networks.   As a ﬁrst step towards addressing the ambiguity in the
+                                                   sparsity literature, we rigorously evaluate magnitude-based
+         Simple heuristics based on removing small magnitude   pruning (Zhu & Gupta,2017), sparse variational dropoutweights have demonstrated high compression rates with  (Molchanov et al.,2017), andl0 regularization (Louizosminimal accuracy loss (Strom¨ ,1997;Collins & Kohli,2014;  et al.,2017b) on two large-scale deep learning applications:
+         Han et al.,2015), and further reﬁnement of the sparsiﬁca-  ImageNet classiﬁcation with ResNet-50 (He et al.,2016),
+         tion process for magnitude pruning techniques has increased   and neural machine translation (NMT) with the Transformer
+         achievable compression rates and greatly reduced computa-  on the WMT 2014 English-to-German dataset (Vaswani
+         tional complexity (Guo et al.,2016;Zhu & Gupta,2017).   et al.,2017). For each model, we also benchmark a random
+         Many techniques grounded in Bayesian statistics and in-  weight pruning technique, representing the lower bound
+         formation theory have been proposed (Dai et al.,2018;  of compression-accuracy trade-off any method should be
+         Molchanov et al.,2017;Louizos et al.,2017b;a;Ullrich   expected to achieve.
+         et al.,2017). These methods have achieved high compres-  Here we brieﬂy review the four techniques and introduce sion rates while providing deep theoretical motivation and   our experimental framework. We provide a more detailed
+         connections to classical sparsiﬁcation and regularization   overview of each technique in AppendixA.
+         techniques.
+           3 https://bit.ly/2ExE8Yj                                  The State of Sparsity in Deep Neural Networks
+
+         3.1. Magnitude Pruning                         Table 1.Constant hyperparameters for all Transformer exper-
+         Magnitude-based weight pruning schemes use the magni-  iments.More details on the standard conﬁguration for training the
+         tude of each weight as a proxy for its importance to model  Transformer can be found inVaswani et al.(2017).
+         quality, and remove the least important weights according     Hyperparameter Value
+         to some sparsiﬁcation schedule over the course of training.        dataset translatewmtendepacked
+         For our experiments, we use the approach introduced in     training iterations 500000
+         Zhu & Gupta(2017), which is conveniently available in the       batch size 2048 tokens
+         TensorFlow modelpruning library 4 . This technique allows   learning rate schedule standard transformerbase
+         for masked weights to reactivate during training based on        optimizer Adam
+         gradient updates, and makes use of a gradual sparsiﬁcation      sparsity range 50% - 98%
+         schedule with sorting-based weight thresholding to achieve       beam search beam size 4; length penalty 0.6
+         a user speciﬁed level of sparsiﬁcation. These features enable
+         high compression ratios at a reduced computational cost rel-  optimized directly using the reparameterization trick, and
+         ative to the iterative pruning and re-training approach used  the expectedl0 -norm can be computed using the value of the
+         byHan et al.(2015), while requiring less hyperparame-  cumulative distribution function of the random gate variable
+         ter tuning relative to the technique proposed byGuo et al.  evaluated at zero.
+         (2016).
+                                                   3.4. Random Pruning Baseline
+         3.2. Variational Dropout                        For our experiments, we also include a random sparsiﬁcation
+         Variational dropout was originally proposed as a re-  procedure adapted from the magnitude pruning technique
+         interpretation of dropout training as variational inference,  ofZhu & Gupta(2017). Our random pruning technique
+         providing a Bayesian justiﬁcation for the use of dropout   uses the same sparsity schedule, but differs by selecting the
+         in neural networks and enabling useful extensions to the  weights to be pruned each step at random rather based on
+         standard dropout algorithms like learnable dropout rates   magnitude and does not allow pruned weights to reactivate.
+         (Kingma et al.,2015). It was later demonstrated that by  This technique is intended to represent a lower-bound of the
+         learning a model with variational dropout and per-parameter  accuracy-sparsity trade-off curve.
+         dropout rates, weights with high dropout rates can be re-
+         moved post-training to produce highly sparse solutions   3.5. Experimental Framework
+         (Molchanov et al.,2017).                         For magnitude pruning, we used the TensorFlow model
+         Variational dropout performs variational inference to learn   pruning library. We implemented variational dropout and
+         the parameters of a fully-factorized Gaussian posterior over  l0 regularization from scratch. For variational dropout, we
+         the weights under a log-uniform prior. In the standard for-  veriﬁed our implementation by reproducing the results from
+         mulation, we apply a local reparameterization to move the   the original paper. To verify ourl0 regularization implemen-
+         sampled noise from the weights to the activations, and then   tation, we applied our weight-level code to Wide ResNet
+         apply the additive noise reparameterization to further reduce  (Zagoruyko & Komodakis,2016) trained on CIFAR-10 and
+         the variance of the gradient estimator. Under this parame-  replicated the training FLOPs reduction and accuracy re-
+         terization, we directly optimize the mean and variance of   sults from the original publication. Veriﬁcation results for
+         the neural network parameters. After training a model with  variational dropout andl0 regularization are included in
+         variational dropout, the weights with the highest learned  AppendicesBandC. For random pruning, we modiﬁed
+         dropout rates can be removed to produce a sparse model.    the TensorFlow model pruning library to randomly select
+                                                   weights as opposed to sorting them based on magnitude.
+         3.3.l0 Regularization                          For each model, we kept the number of training steps con-
+         l0 regularization explicitly penalizes the number of non-  stant across all techniques and performed extensive hyper-
+         zero weights in the model to induce sparsity. However,  parameter tuning. While magnitude pruning is relatively
+         thel0 -norm is both non-convex and non-differentiable. To   simple to apply to large models and achieves reasonably
+         address the non-differentiability of thel0 -norm,Louizos  consistent performance across a wide range of hyperparame-
+         et al.(2017b) propose a reparameterization of the neural   ters, variational dropout andl0 -regularization are much less
+         network weights as the product of a weight and a stochastic  well understood. To our knowledge, we are the ﬁrst to apply
+         gate variable sampled from a hard-concrete distribution.  these techniques to models of this scale. To produce a fair
+         The parameters of the hard-concrete distribution can be  comparison, we did not limit the amount of hyperparameter
+                                                   tuning we performed for each technique. In total, our results 4 https://bit.ly/2T8hBGn                   encompass over 4000 experiments.                                  The State of Sparsity in Deep Neural Networks
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                   Figure 2.Average sparsity in Transformer layers.Distributions
+                                                   calculated on the top performing model at 90% sparsity for each
+                                                   technique.l0 regularization and variational dropout are able to
+                                                   learn non-uniform distributions of sparsity, while magnitude prun-
+                                                   ing induces user-speciﬁed sparsity distributions (in this case, uni-
+                                                   form).
+                                                   form the random pruning technique, randomly removing
+                                                   weights produces surprisingly reasonable results, which is
+                                                   perhaps indicative of the models ability to recover from
+         Figure 1.Sparsity-BLEU trade-off curves for the Transformer.  damage during optimization.
+         Top: Pareto frontiers for each of the four sparsiﬁcation techniques
+         applied to the Transformer. Bottom: All experimental results with  What is particularly notable about the performance of mag-
+         each technique. Despite the diversity of approaches, the relative  nitude pruning is that our experiments uniformly remove the
+         performance of all three techniques is remarkably consistent. Mag-  same fraction of weights for each layer. This is in stark con-
+         nitude pruning notably outperforms more complex techniques for  trast to variational dropout andl0 regularization, where the
+         high levels of sparsity.                            distribution of sparsity across the layers is learned through
+                                                   the training process. Previous work has shown that a non-
+         4. Sparse Neural Machine Translation         uniform sparsity among different layers is key to achieving
+                                                   high compression rates (He et al.,2018), and variational
+         We adapted the Transformer (Vaswani et al.,2017) model   dropout andl0 regularization should theoretically be able to
+         for neural machine translation to use these four sparsiﬁca-  leverage this feature to learn better distributions of weights
+         tion techniques, and trained the model on the WMT 2014   for a given global sparsity.
+         English-German dataset. We sparsiﬁed all fully-connected
+         layers and embeddings, which make up 99.87% of all of   Figure2shows the distribution of sparsity across the differ-
+         the parameters in the model (the other parameters coming  ent layer types in the Transformer for the top performing
+         from biases and layer normalization). The constant hyper-  model at 90% global sparsity for each technique. Bothl0
+         parameters used for all experiments are listed in table1. We   regularization and variational dropout learn to keep more
+         followed the standard training procedure used byVaswani   parameters in the embedding, FFN layers, and the output
+         et al.(2017), but did not perform checkpoint averaging.  transforms for the multi-head attention modules and induce
+         This setup yielded a baseline BLEU score of 27.29 averaged  more sparsity in the transforms for the query and value in-
+         across ﬁve runs.                               puts to the attention modules. Despite this advantage,l0
+                                                   regularization and variational dropout did not signiﬁcantly
+         We extensively tuned the remaining hyperparameters for  outperform magnitude pruning, even yielding inferior re-
+         each technique. Details on what hyperparameters we ex-  sults at high sparsity levels.
+         plored, and the results of what settings produced the best
+         models can be found in AppendixD.                 It is also important to note that these results maintain a
+                                                   constant number of training steps across all techniques and
+                                                   that the Transformer variant with magnitude pruning trains4.1. Sparse Transformer Results & Analysis          1.24x and 1.65x faster thanl0 regularization and variational
+         All results for the Transformer are plotted in ﬁgure1. De-  dropout respectively. While the standard Transformer train-
+         spite the vast differences in these approaches, the relative   ing scheme produces excellent results for machine transla-
+         performance of all three techniques is remarkably consis-  tion, it has been shown that training the model for longer
+         tent. Whilel0 regularization and variational dropout pro-  can improve its performance by as much as 2 BLEU (Ott
+         duce the top performing models in the low-to-mid sparsity  et al.,2018). Thus, when compared for a ﬁxed training cost
+         range, magnitude pruning achieves the best results for highly  magnitude pruning has a distinct advantage over these more
+         sparse models. While all techniques were able to outper-  complicated techniques.                                  The State of Sparsity in Deep Neural Networks
+
+
+         Table 2.Constant hyperparameters for all RN50 experiments.
+            Hyperparameter Value
+                dataset ImageNet
+            training iterations 128000
+               batch size 1024 images
+          learning rate schedule standard
+               optimizer SGD with Momentum
+             sparsity range 50% - 98%
+
+
+
+         5. Sparse Image Classiﬁcation
+         To benchmark these four sparsity techniques on a large-
+         scale computer vision task, we integrated each method into
+         ResNet-50 and trained the model on the ImageNet large-
+         scale image classiﬁcation dataset. We sparsiﬁed all convolu-
+         tional and fully-connected layers, which make up 99.79%
+         of all of the parameters in the model (the other parameters  Figure 3.Sparsity-accuracy trade-off curves for ResNet-50.
+         coming from biases and batch normalization).           Top: Pareto frontiers for variational dropout, magnitude pruning,
+                                                   and random pruning applied to ResNet-50. Bottom: All experi- The hyperparameters we used for all experiments are listed  mental results with each technique. We observe large variation in
+         in Table2. Each model was trained for 128000 iterations   performance for variational dropout andl0 regularization between
+         with a batch size of 1024 images, stochastic gradient descent  Transformer and ResNet-50. Magnitude pruning and variational
+         with momentum, and the standard learning rate schedule  dropout achieve comparable performance for most sparsity levels,
+         (see AppendixE.1). This setup yielded a baseline top-1  with variational dropout achieving the best results for high sparsity
+         accuracy of 76.69% averaged across three runs. We trained   levels.
+         each model with 8-way data parallelism across 8 accelera-
+         tors. Due to the extra parameters and operations required for  will be non-zero. 5 .Louizos et al.(2017b) reported results
+         variational dropout, the model was unable to ﬁt into device  applyingl0 regularization to a wide residual network (WRN)
+         memory in this conﬁguration. For all variational dropout  (Zagoruyko & Komodakis,2016) on the CIFAR-10 dataset,
+         experiments, we used a per-device batch size of 32 images  and noted that they observed small accuracy loss at as low
+         and scaled the model over 32 accelerators.             as 8% reduction in the number of parameters during training.
+                                                   Applying our weight-levell0 regularization implementation
+         5.1. ResNet-50 Results & Analysis                  to WRN produces a model with comparable training time
+                                                   sparsity, but with no sparsity in the test-time parameters.Figure3shows results for magnitude pruning, variational   For models that achieve test-time sparsity, we observe sig-dropout, and random pruning applied to ResNet-50. Surpris-  niﬁcant accuracy degradation on CIFAR-10. This result isingly, we were unable to produce sparse ResNet-50 mod-  consistent with our observation forl els withl                                                           0 regularization applied
+               0 regularization that did not signiﬁcantly damage   to ResNet-50 on ImageNet.model quality. Across hundreds of experiments, our models
+         were either able to achieve full test set performance with  The variation in performance for variational dropout andl0
+         no sparsiﬁcation, or sparsiﬁcation with test set performance  regularization between Transformer and ResNet-50 is strik-
+         akin to random guessing. Details on all hyperparameter  ing. While achieving a good accuracy-sparsity trade-off,
+         settings explored are included in AppendixE.           variational dropout consistently ranked behindl0 regulariza-
+                                                   tion on Transformer, and was bested by magnitude pruningThis result is particularly surprising given the success ofl0   for sparsity levels of 80% and up. However, on ResNet-50regularization on Transformer. One nuance of thel0 regular-  we observe that variational dropout consistently producesization technique ofLouizos et al.(2017b) is that the model
+         can have varying sparsity levels between the training and     5 The fraction of time a parameter is set to zero during training
+         test-time versions of the model. At training time, a parame-  depends on other factors, e.g. the parameter of the hard-concrete
+         ter with a dropout rate of 10% will be zero 10% of the time   distribution. However, this point is generally true that the training
+                                                   and test-time sparsities are not necessarily equivalent, and that when sampled from the hard-concrete distribution. How-  there exists some dropout rate threshold below which a weight that
+         ever, under the test-time parameter estimator, this weight   is sometimes zero during training will be non-zero at test-time.                                  The State of Sparsity in Deep Neural Networks
+
+
+
+
+
+
+
+
+
+
+
+
+
+         Figure 4.Average sparsity in ResNet-50 layers.Distributions  Figure 5.Sparsity-accuracy trade-off curves for ResNet-50
+         calculated on the top performing model at 95% sparsity for each  with modiﬁed sparsiﬁcation scheme. Altering the distribution
+         technique. Variational dropout is able to learn non-uniform dis-  of sparsity across the layers and increasing training time yield
+         tributions of sparsity, decreasing sparsity in the input and output  signiﬁcant improvement for magnitude pruning.
+         layers that are known to be disproportionately important to model
+         quality.                                     5.2. Pushing the Limits of Magnitude Pruning
+                                                   Given that a uniform distribution of sparsity is suboptimal,
+                                                   and the signiﬁcantly smaller resource requirements for ap-
+                                                   plying magnitude pruning to ResNet-50 it is natural to won-
+         models on-par or better than magnitude pruning, and that   der how well magnitude pruning could perform if we were to
+         l0 regularization is not able to produce sparse models at  distribute the non-zero weights more carefully and increase
+         all. Variational dropout achieved particularly notable results   training time.
+         in the high sparsity range, maintaining a top-1 accuracy  To understand the limits of the magnitude pruning heuristic,over 70% with less than 4% of the parameters of a standard  we modify our ResNet-50 training setup to leave the ﬁrstResNet-50.                                  convolutional layer fully dense, and only prune the ﬁnal
+         The distribution of sparsity across different layer types in the  fully-connected layer to 80% sparsity. This heuristic is
+         best variational dropout and magnitude pruning models at  reasonable for ResNet-50, as the ﬁrst layer makes up a small
+         95% sparsity are plotted in ﬁgure4. While we kept sparsity  fraction of the total parameters in the model and the ﬁnal
+         constant across all layers for magnitude and random prun-  layer makes up only .03% of the total FLOPs. While tuning
+         ing, variational dropout signiﬁcantly reduces the amount of   the magnitude pruning ResNet-50 models, we observed that
+         sparsity induced in the ﬁrst and last layers of the model.    the best models always started and ended pruning during
+                                                   the third learning rate phase, before the second learning rateIt has been observed that the ﬁrst and last layers are often   drop. To take advantage of this, we increase the number ofdisproportionately important to model quality (Han et al.,  training steps by 1.5x by extending this learning rate region.2015;Bellec et al.,2017). In the case of ResNet-50, the   Results for ResNet-50 trained with this scheme are plottedﬁrst convolution comprises only .037% of all the parame-  in ﬁgure5.ters in the model. At 98% sparsity the ﬁrst layer has only
+         188 non-zero parameters, for an average of less than 3 pa-  With these modiﬁcations, magnitude pruning outperforms
+         rameters per output feature map. With magnitude pruning  variational dropout at all but the highest sparsity levels while
+         uniformly sparsifying each layer, it is surprising that it is   still using less resources. However, variational dropout’s per-
+         able to achieve any test set performance at all with so few  formance in the high sparsity range is particularly notable.
+         parameters in the input convolution.                 With very low amounts of non-zero weights, we ﬁnd it likely
+                                                   that the models performance on the test set is closely tied toWhile variational dropout is able to learn to distribute spar-  precise allocation of weights across the different layers, andsity non-uniformly across the layers, it comes at a signiﬁcant   that variational dropout’s ability to learn this distributionincrease in resource requirements. For ResNet-50 trained   enables it to better maintain accuracy at high sparsity levels.with variational dropout we observed a greater than 2x in-  This result indicates that efﬁcient sparsiﬁcation techniquescrease in memory consumption. When scaled across 32   that are able to learn the distribution of sparsity across layersaccelerators, ResNet-50 trained with variational dropout   are a promising direction for future work. completed training in 9.75 hours, compared to ResNet-50
+         with magnitude pruning ﬁnishing in 12.50 hours on only 8   Its also worth noting that these changes produced mod-
+         accelerators. Scaled to a 4096 batch size and 32 accelerators,  els at 80% sparsity with top-1 accuracy of 76.52%, only
+         ResNet-50 with magnitude pruning can complete the same   .17% off our baseline ResNet-50 accuracy and .41% better
+         number of epochs in just 3.15 hours.                 than the results reported byHe et al.(2018), without the                                  The State of Sparsity in Deep Neural Networks
+
+         extra complexity and computational requirements of their
+         reinforcement learning approach. This represents a new
+         state-of-the-art sparsity-accuracy trade-off for ResNet-50
+         trained on ImageNet.
+
+         6. Sparsiﬁcation as Architecture Search
+         While sparsity is traditionally thought of as a model com-
+         pression technique, two independent studies have recently
+         suggested that the value of sparsiﬁcation in neural net-
+         works is misunderstood, and that once a sparse topology
+         is learned it can be trained from scratch to the full perfor-
+         mance achieved when sparsiﬁcation was performed jointly
+         with optimization.
+         Frankle & Carbin(2018) posited that over-parameterized
+         neural networks contain small, trainable subsets of weights,
+         deemed ”winning lottery tickets”. They suggest that sparsity
+         inducing techniques are methods for ﬁnding these sparse
+         topologies, and that once found the sparse architectures can
+         be trained from scratch withthe same weight initialization  Figure 6.Scratch and lottery ticket experiments with magni- that was used when the sparse architecture was learned.  tude pruning.Top: results with Transformer. Bottom: Results They demonstrated that this property holds across different  with ResNet-50. Across all experiments, training from scratch
+         convolutional neural networks and multi-layer perceptrons   using a learned sparse architecture is unable to re-produce the
+         trained on the MNIST and CIFAR-10 datasets.          performance of models trained with sparsiﬁcation as part of the
+                                                   optimization process. Liu et al.(2018) similarly demonstrated this phenomenon
+         for a number of activation sparsity techniques on convolu-
+         tional neural networks, as well as for weight level sparsity  To clarify the questions surrounding the idea of sparsiﬁ-learned with magnitude pruning. However, they demon-  cation as a form of neural architecture search, we repeatstrate this result using a random initialization during re-  the experiments ofFrankle & Carbin(2018) andLiu et al.training.                                    (2018) on ResNet-50 and Transformer. For each model,
+         The implications of being able to train sparse architectures  we explore the full range of sparsity levels (50% - 98%)
+         from scratch once they are learned are large: once a sparse   and compare to our well-tuned models from the previous
+         topology is learned, it can be saved and shared as with   sections.
+         any other neural network architecture. Re-training then
+         can be done fully sparse, taking advantage of sparse linear  6.1. Experimental Framework
+         algebra to greatly accelerate time-to-solution. However, the  The experiments ofLiu et al.(2018) encompass taking thecombination of these two studies does not clearly establish  ﬁnal learned weight mask from a magnitude pruning model,how this potential is to be realized.                  randomly re-initializing the weights, and training the model
+         Beyond the question of whether or not the original random  with the normal training procedure (i.e., learning rate, num-
+         weight initialization is needed, both studies only explore  ber of iterations, etc.). To account for the presence of spar-
+         convolutional neural networks (and small multi-layer per-  sity at the start of training, they scale the variance of the
+         ceptrons in the case ofFrankle & Carbin(2018)). The   initial weight distribution by the number of non-zeros in the
+         majority of experiments in both studies also limited their  matrix. They additionally train a variant where they increase
+         analyses to the MNIST, CIFAR-10, and CIFAR-100 datasets.  the number of training steps (up to a factor of 2x) such that
+         While these are standard benchmarks for deep learning mod-  the re-trained model uses approximately the same number of
+         els, they are not indicative of the complexity of real-world   FLOPs during training as model trained with sparsiﬁcation
+         tasks where model compression is most useful.Liu et al.  as part of the optimization process. They refer to these two
+         (2018) do explore convolutional architectures on the Ima-  experiments as ”scratch-e” and ”scratch-b” respectively.
+         geNet datasets, but only at two relatively low sparsity levels   Frankle & Carbin(2018) follow a similar procedure, but use(30% and 60%). They also note that weight level sparsity  the same weight initialization that was used when the sparseon ImageNet is the only case where they are unable to re-  weight mask was learned and do not perform the longerproduce the full accuracy of the pruned model.          training time variant.                                  The State of Sparsity in Deep Neural Networks
+
+         For our experiments, we repeat the scratch-e, scratch-b and   sparsity levels, we observe that the quality of the models
+         lottery ticket experiments with magnitude pruning on Trans-  degrades relative to the magnitude pruning baseline as spar-
+         former and ResNet-50. For scratch-e and scratch-b, we also   sity increases. For unstructured weight sparsity, it seems
+         train variants that do not alter the initial weight distribution.  likely that the phenomenon observed byLiu et al.(2018)
+         For the Transformer, we re-trained ﬁve replicas of the best  was produced by a combination of low sparsity levels and
+         magnitude pruning hyperparameter settings at each spar-  small-to-medium sized tasks. We’d like to emphasize that
+         sity level and save the weight initialization and ﬁnal sparse   this result is only for unstructured weight sparsity, and that
+         weight mask. For each of the ﬁve learned weight masks,  prior workLiu et al.(2018) provides strong evidence that
+         we train ﬁve identical replicas for the scratch-e, scratch-  activation pruning behaves differently.
+         b, scratch-e with augmented initialization, scratch-b with
+         augmented initialization, and the lottery ticket experiments.  7. Limitations of This Study For ResNet-50, we followed the same procedure with three
+         re-trained models and three replicas at each sparsity level   Hyperparameter exploration. For all techniques and
+         for each of the ﬁve experiments. Figure6plots the averages   models, we carefully hand-tuned hyperparameters and per-
+         and min/max of all experiments at each sparsity level 6 .     formed extensive sweeps encompassing thousands of exper-
+                                                   iments over manually identiﬁed ranges of values. However,
+         6.2. Scratch and Lottery Ticket Results & Analysis      the number of possible settings vastly outnumbers the set
+                                                   of values that can be practically explored, and we cannotAcross all of our experiments, we observed that training  eliminate the possibility that some techniques signiﬁcantlyfrom scratch using a learned sparse architecture is not able   outperform others under settings we did not try.to match the performance of the same model trained with
+         sparsiﬁcation as part of the optimization process.         Neural architectures and datasets. Transformer and
+                                                   ResNet-50 were chosen as benchmark tasks to represent aAcross both models, we observed that doubling the number  cross section of large-scale deep learning tasks with diverseof training steps did improve the quality of the results for  architectures. We can’t exclude the possibility that somethe scratch experiments, but was not sufﬁcient to match the   techniques achieve consistently high performance acrosstest set performance of the magnitude pruning baseline. As   other architectures. More models and tasks should be thor-sparsity increased, we observed that the deviation between   oughly explored in future work.the models trained with magnitude pruning and those trained
+         from scratch increased. For both models, we did not observe
+         a beneﬁt from using the augmented weight initialization for  8. Conclusion
+         the scratch experiments.                         In this work, we performed an extensive evaluation of three
+         For ResNet-50, we experimented with four different learn-  state-of-the-art sparsiﬁcation techniques on two large-scale
+         ing rates schemes for the scratch-b experiments. We found  learning tasks. Notwithstanding the limitations discussed in
+         that scaling each learning rate region to double the number  section7, we demonstrated that complex techniques shown
+         of epochs produced the best results by a wide margin. These   to yield state-of-the-art compression on small datasets per-
+         results are plotted in ﬁgure6. Results for the ResNet-50   form inconsistently, and that simple heuristics can achieve
+         scratch-b experiments with the other learning rate variants   comparable or better results on a reduced computational bud-
+         are included with our release of hyperparameter tuning re-  get. Based on insights from our experiments, we achieve a
+         sults.                                      new state-of-the-art sparsity-accuracy trade-off for ResNet-
+                                                   50 with only magnitude pruning and highlight promisingFor the lottery ticket experiments, we were not able to repli-  directions for research in sparsity inducing techniques.cate the phenomenon observed byFrankle & Carbin(2018).
+         The key difference between our experiments is the complex-  Additionally, we provide strong counterexamples to two re-
+         ity of the tasks and scale of the models, and it seems likely  cently proposed theories that models learned through prun-
+         that this is the main factor contributing to our inability to   ing techniques can be trained from scratch to the same test
+         train these architecture from scratch.                 set performance of a model learned with sparsiﬁcation as
+                                                   part of the optimization process. Our results highlight theFor the scratch experiments, our results are consistent with  need for large-scale benchmarks in sparsiﬁcation and modelthe negative result observed by (Liu et al.,2018) for Im-  compression. As such, we open-source our code, check-ageNet and ResNet-50 with unstructured weight pruning.  points, and results of all hyperparameter conﬁgurations to By replicating the scratch experiments at the full range of   establish rigorous baselines for future work.
+           6 Two of the 175 Transformer experiments failed to train from
+         scratch at all and produced BLEU scores less than 1.0. We omit
+         these outliers in ﬁgure6                                  The State of Sparsity in Deep Neural Networks
+
+         Acknowledgements                         Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S.,
+                                                    Casagrande, N., Lockhart, E., Stimberg, F., van den Oord,We would like to thank Benjamin Caine, Jonathan Frankle,    A., Dieleman, S., and Kavukcuoglu, K. Efﬁcient NeuralRaphael Gontijo Lopes, Sam Greydanus, and Keren Gu for    Audio Synthesis. InProceedings of the 35th Interna-helpful discussions and feedback on drafts of this paper.      tional Conference on Machine Learning, ICML 2018,
+                                                    Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018¨                           ,
+         References                                  pp. 2415–2424, 2018.
+         Bellec, G., Kappel, D., Maass, W., and Legenstein, R. A.  Kingma, D. P. and Welling, M. Auto-encoding variational
+          Deep Rewiring: Training Very Sparse Deep Networks.    bayes.CoRR, abs/1312.6114, 2013.
+          CoRR, abs/1711.05136, 2017.                    Kingma, D. P., Salimans, T., and Welling, M. Variational
+         Collins, M. D. and Kohli, P. Memory Bounded Deep Con-    dropout and the local reparameterization trick. CoRR,
+          volutional Networks.CoRR, abs/1412.1442, 2014. URL    abs/1506.02557, 2015.
+          http://arxiv.org/abs/1412.1442.          LeCun, Y., Denker, J. S., and Solla, S. A. Optimal Brain
+         Dai, B., Zhu, C., and Wipf, D. P. Compressing Neural    Damage. InNIPS, pp. 598–605. Morgan Kaufmann,
+          Networks using the Variational Information Bottleneck.    1989.
+          CoRR, abs/1802.10399, 2018.                    Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning.
+                                                    InNIPS, pp. 2178–2188, 2017.Frankle, J. and Carbin, M. The Lottery Ticket Hy-
+          pothesis: Training Pruned Neural Networks. CoRR,  Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang,abs/1803.03635, 2018. URLhttp://arxiv.org/    C. Learning Efﬁcient Convolutional Networks throughabs/1803.03635.                           Network Slimming. InIEEE International Conference
+                                                    on Computer Vision, ICCV 2017, Venice, Italy, OctoberGray, S., Radford, A., and Kingma, D. P. Block-    22-29, 2017, pp. 2755–2763, 2017.sparse gpu kernels.https://blog.openai.com/
+          block-sparse-gpu-kernels/, 2017.          Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T.
+                                                    Rethinking the Value of Network Pruning.  CoRR,
+         Guo, Y., Yao, A., and Chen, Y. Dynamic Network Surgery    abs/1810.05270, 2018.
+          for Efﬁcient DNNs. InNIPS, 2016.                Louizos, C., Ullrich, K., and Welling, M. Bayesian Com-
+         Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both    pression for Deep Learning. InAdvances in Neural In-
+          Weights and Connections for Efﬁcient Neural Network.    formation Processing Systems 30: Annual Conference
+          InNIPS, pp. 1135–1143, 2015.                     on Neural Information Processing Systems 2017, 4-9 De-
+                                                    cember 2017, Long Beach, CA, USA, pp. 3290–3300,
+         Hassibi, B. and Stork, D. G. Second order derivatives for    2017a.
+          network pruning: Optimal brain surgeon. InNIPS, pp.
+          164–171. Morgan Kaufmann, 1992.                Louizos, C., Welling, M., and Kingma, D. P. Learn-
+                                                    ing Sparse Neural Networks through L0Regularization.
+         He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learn-    CoRR, abs/1712.01312, 2017b.
+          ing for Image Recognition. In2016 IEEE Conference on  Luo, J., Wu, J., and Lin, W. Thinet: A Filter Level PruningComputer Vision and Pattern Recognition, CVPR 2016,    Method for Deep Neural Network Compression. InIEEELas Vegas, NV, USA, June 27-30, 2016, pp. 770–778,    International Conference on Computer Vision, ICCV2016.                                      2017, Venice, Italy, October 22-29, 2017, pp. 5068–5076,
+                                                    2017.He, Y., Lin, J., Liu, Z., Wang, H., Li, L., and Han, S. AMC:
+          automl for model compression and acceleration on mo-  Mitchell, T. J. and Beauchamp, J. J. Bayesian Variablebile devices. InComputer Vision - ECCV 2018 - 15th    Selection in Linear Regression.Journal of the AmericanEuropean Conference, Munich, Germany, September 8-    Statistical Association, 83(404):1023–1032, 1988.14, 2018, Proceedings, Part VII, pp. 815–832, 2018.
+                                                   Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H.,
+         Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun,    Gibescu, M., and Liotta, A. Scalable Training of Artiﬁ-
+          H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and    cial Neural Networks with Adaptive Sparse Connectivity
+          Zhou, Y. Deep learning scaling is predictable, empirically.    Inspired by Network Science.Nature Communications,
+          CoRR, abs/1712.00409, 2017.                      2018.                                  The State of Sparsity in Deep Neural Networks
+
+         Molchanov, D., Ashukha, A., and Vetrov, D. P. Variational  Zagoruyko, S. and Komodakis, N. Wide Residual Networks.
+          Dropout Sparsiﬁes Deep Neural Networks. InProceed-    InProceedings of the British Machine Vision Conference
+          ings of the 34th International Conference on Machine    2016, BMVC 2016, York, UK, September 19-22, 2016,
+          Learning, ICML 2017, Sydney, NSW, Australia, 6-11 Au-    2016.
+          gust 2017, pp. 2498–2507, 2017.                  Zhu, M. and Gupta, S. To prune, or not to prune: exploring
+         Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J.    the efﬁcacy of pruning for model compression.CoRR,
+          Pruning Convolutional Neural Networks for Resource Ef-    abs/1710.01878, 2017. URLhttp://arxiv.org/
+          ﬁcient Transfer Learning.CoRR, abs/1611.06440, 2016.    abs/1710.01878.
+
+         Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. Ex-
+          ploring Sparsity in Recurrent Neural Networks.CoRR,
+          abs/1704.05119, 2017.
+
+         Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling
+          Neural Machine Translation. InProceedings of the Third
+          Conference on Machine Translation: Research Papers,
+          WMT 2018, Belgium, Brussels, October 31 - November 1,
+          2018, pp. 1–9, 2018.
+
+         Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
+          Backpropagation and Approximate Inference in Deep
+          Generative models. InICML, volume 32 ofJMLR
+          Workshop and Conference Proceedings, pp. 1278–1286.
+          JMLR.org, 2014.
+
+         Strom, N. Sparse Connection and Pruning in Large Dynamic¨
+          Artiﬁcial Neural Networks. InEUROSPEECH, 1997.
+
+         Theis, L., Korshunova, I., Tejani, A., and Huszar, F. Faster´
+          gaze prediction with dense networks and Fisher pruning.
+          CoRR, abs/1801.05787, 2018. URLhttp://arxiv.
+          org/abs/1801.05787.
+
+         Ullrich, K., Meeds, E., and Welling, M. Soft Weight-
+          Sharing for Neural Network Compression.  CoRR,
+          abs/1702.04008, 2017.
+
+         Valin, J. and Skoglund, J. Lpcnet: Improving Neural
+          Speech Synthesis Through Linear Prediction. CoRR,
+          abs/1810.11846, 2018. URLhttp://arxiv.org/
+          abs/1810.11846.
+
+         van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
+          Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,
+          and Kavukcuoglu, K. Wavenet: A Generative Model for
+          Raw Audio. InThe 9th ISCA Speech Synthesis Workshop,
+          Sunnyvale, CA, USA, 13-15 September 2016, pp. 125,
+          2016.
+
+         Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
+          L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten-
+          tion is All you Need. InAdvances in Neural Information
+          Processing Systems 30: Annual Conference on Neural In-
+          formation Processing Systems 2017, 4-9 December 2017,
+          Long Beach, CA, USA, pp. 6000–6010, 2017.                    The State of Sparsity in Deep Neural Networks: Appendix
+
+
+
+
+         A. Overview of Sparsity Inducing Techniques   p(w)with observed dataDinto an updated belief over the
+                                                   parameters in the form of the posterior distributionp(wjD).Here we provide a more detailed review of the three sparsity  In practice, computing the true posterior using Bayes’ ruletechniques we benchmarked.                      is computationally intractable and good approximations are
+                                                   needed. In variational inference, we optimize the parame-A.1. Magnitude Pruning                        ters of some parameterized modelq  (w)such thatq  (w)
+         Magnitude-based weight pruning schemes use the magni-  is a close approximation to the true posterior distribution
+         tude of each weight as a proxy for its importance to model  p(wjD)as measured by the Kullback-Leibler divergence
+         quality, and remove the least important weights according   between the two distributions. The divergence of our ap-
+         to some sparsiﬁcation schedule over the course of training.  proximate posterior from the true posterior is minimized in
+         Many variants have been proposed (Collins & Kohli,2014;  practice by maximizing the variational lower-bound
+         Han et al.,2015;Guo et al.,2016;Zhu & Gupta,2017),
+         with the key differences lying in when weights are removed,        L( ) = D              Lwhether weights should be sorted to remove a precise pro-                KL (q  (w)jjp(w)) + D ( )
+
+         portion or thresholded based on a ﬁxed or decaying value,               PwhereLand whether or not weights that have been pruned still re-        D ( ) =     Eq  (w) [logp(yjx;w)]
+                                                              (x;y)2D
+         ceive gradient updates and have the potential to return after  Using the Stochastic Gradient Variational Bayes (SGVB)being pruned.                                (Kingma et al.,2015) algorithm to optimize this bound,
+         Han et al.(2015) use iterative magnitude pruning and re-  LD ( )reduces to the standard cross-entropy loss, and the
+         training to progressively sparsify a model. The target model   KL divergence between our approximate posterior and prior
+         is ﬁrst trained to convergence, after which a portion of   over the parameters serves as a regularizer that enforces our
+         weights are removed and the model is re-trained with these   initial belief about the parametersw.
+         weights ﬁxed to zero. This process is repeated until the  In the standard formulation of variational dropout, we as-target sparsity is achieved.Guo et al.(2016) improve on   sume the weights are drawn from a fully-factorized Gaussianthis approach by allowing masked weights to still receive   approximate posterior.gradient updates, enabling the network to recover from in-
+         correct pruning decisions during optimization. They achieve
+         higher compression rates and interleave pruning steps with           wij  q  (wij ) =N( ij ;  ij  2 )ij gradient update steps to avoid expensive re-training.Zhu
+         & Gupta(2017) similarly allow gradient updates to masked  Where and are neural network parameters. For eachweights, and make use of a gradual sparsiﬁcation schedule   training step, we sample weights from this distribution andwith sorting-based weight thresholding to maintain accuracy  use thereparameterization trick(Kingma & Welling,2013; while achieving a user speciﬁed level of sparsiﬁcation.     Rezende et al.,2014) to differentiate the loss w.r.t. the pa-
+         Its worth noting that magnitude pruning can easily be   rameters through the sampling operation. Given the weights
+         adapted to induce block or activation level sparsity by re-  are normally distributed, the distribution of the activations
+         moving groups of weights based on their p-norm, average,  Bafter a linear operation like matrix multiplication or con-
+         max, or other statistics. Variants have also been proposed  volution is also Gaussian and can be calculated in closed
+         that maintain a constant level of sparsity during optimization   form 7 .
+         to enable accelerated training (Mocanu et al.,2018).
+                                                             q  (bmj jA)  N( mj ;  mj )
+         A.2. Variational Dropout
+         Consider the setting of a datasetDofNi.i.d. samples            PK               PK with  (x;y)and a standard classiﬁcation problem where the goal       mj =   ami  ij and mj =   a2  mi ij  2 and iji=1              i=1
+         is to learn the parameterswof the conditional probability  whereami 2Aare the inputs to the layer. Thus, rather
+         p(yjx;w). Bayesian inference combines some initial belief     7 We ignore correlation in the activations, as is done by over the parameterswin the form of a prior distribution   Molchanov et al.(2017)                               The State of Sparsity in Deep Neural Networks: Appendix
+
+         than sample weights, we can directly sample the activations   and and stretch the distribution s.t.zj takes value 0 or 1
+         at each layer. This step is known as thelocal reparame-  with non-zero probability.
+         terization trick, and was shown byKingma et al.(2015) to   On each training iteration,zreduce the variance of the gradients relative to the standard                      j is sampled from this distri-
+                                                   bution and multiplied with the standard neural networkformulation in which a single set of sampled weights must  weights. The expectedlbe shared for all samples in the input batch for efﬁciency.                    0 -normLC can then be calcu-
+                                                   lated using the cumulative distribution function of the hard-Molchanov et al.(2017) showed that the variance of the gra-  concrete distribution and optimized directly with stochasticdients could be further reduced by using anadditive noise  gradient descent.reparameterization, where we deﬁne a new parameter
+
+                        2 = ij   ij   2ij                      Xj j            Xj j                  LC =  (1 Qs (0j )) =   sigmoid(log 
+         Under this parameterization, we directly optimize the mean              j                  j   log  ) j=1           j=1
+         and variance of the neural network parameters.
+         Under the assumption of a log-uniform prior on the weights  At test-time,Louizos et al.(2017b) use the following esti-
+         w, the KL divergence component of our objective function   mate for the model parameters.
+         DKL (q  (wij )jjp(wij ))can be accurately approximated
+         (Molchanov et al.,2017):
+                                                                     = ~   z^
+                                                      z^=min(1;max(0;sigmoid(log )(   ) + ))
+                    DKL (q  (wij )jjp(wij )) 
+            k1  (k2 +k3 log ij ) 0:5log(1 +  1 + kij    1 )      Interestingly,Louizos et al.(2017b) showed that their ob-
+             k                                     jective function under thel1 = 0:63576 k2 = 1:87320 k3 = 1:48695                         0 penalty is a special case of a
+                                                   variational lower-bound over the parameters of the network
+                                                   under a spike and slab (Mitchell & Beauchamp,1988) prior.After training a model with variational dropout, the weights
+         with the highest values can be removed. For all their
+         experiments,Molchanov et al.(2017) removed weights with   B. Variational Dropout Implementation
+         log larger than 3.0, which corresponds to a dropout rate     Veriﬁcation
+         greater than 95%. Although they demonstrated good results,
+         it is likely that the optimal threshold varies across different  To verify our implementation of variational dropout, we
+         models and even different hyperparameter settings of the   applied it to LeNet-300-100 and LeNet-5-Caffe on MNIST
+         same model. We address this question in our experiments.   and compared our results to the original paper (Molchanov
+                                                   et al.,2017). We matched our hyperparameters to those
+                                                   used in the code released with the paper 8 . All results areA.3.l0 Regularization                          listed in table3
+         To optimize thel0 -norm, we reparameterize the model
+         weights as the product of a weight and a random vari-  Table 3.Variational Dropout MNIST Reproduction Results. able drawn from the hard-concrete distribution.           Network Experiment Sparsity (%) Accuracy (%)
+                                                            original (Molchanov et al.,2017) 98.57 98.08
+                                                            ours (log = 3.0) 97.52 98.42LeNet-300-100 ours (log = 2.0) 98.50 98.40
+                                                           ours (log = 0.1) 99.10 98.13
+                          j = ~j zj                            original (Molchanov et al.,2017) 99.60 99.25
+            wherez                                  LeNet-5-Caffe ours (log = 3.0) 99.29 99.26
+                 j  min(1;max(0;s));s=s(   ) +               ours (log = 2.0) 99.50 99.25
+             s=sigmoid((logu log(1 u) +log )= )
+                       andu  U(0;1)                 Our baseline LeNet-300-100 model achieved test set accu-
+                                                   racy of 98.42%, slightly higher than the baseline of 98.36%
+                                                   reported in (Molchanov et al.,2017). Applying our varia-In this formulation, the parameter that controls the posi-  tional dropout implementation to LeNet-300-100 with these tion of the hard-concrete distribution (and thus the proba-  hyperparameters produced a model with 97.52% global spar-bility thatzj is zero) is optimized with gradient descent. ,  sity and 98.42% test accuracy. The original paper produced , and are ﬁxed parameters that control the shape of the
+         hard-concrete distribution. controls the curvature ortem-    8 https://github.com/ars-ashuha/variational-dropout-sparsiﬁes-
+         peratureof the hard-concrete probability density function,  dnn                               The State of Sparsity in Deep Neural Networks: Appendix
+
+                                                   Our baseline WRN-28-10 implementation trained on
+                                                   CIFAR-10 achieved a test set accuracy of 95.45%. Using
+                                                   ourl0 regularization implementation and al0 -norm weight
+                                                   of .0003, we trained a model that achieved 95.34% accuracy
+                                                   on the test set while achieving a consistent training-time
+                                                   FLOPs reduction comparable to that reported byLouizos
+                                                   et al.(2017b). Floating-point operations (FLOPs) required
+                                                   to compute the forward over the course of training WRN-
+                                                   28-10 withl0 are plotted in ﬁgure7.
+                                                   During our re-implementation of the WRN experiments
+         Figure 7.Forward pass FLOPs for WRN-28-10 trained withl0   fromLouizos et al.(2017b), we identiﬁed errors in the orig- regularization.Our implementation achieves FLOPs reductions   inal publications FLOP calculations that caused the number comparable to those reported inLouizos et al.(2017b).         of ﬂoating-point operations in WRN-28-10 to be miscalcu-
+                                                   lated. We’ve contacted the authors, and hope to resolve this
+                                                   issue to clarify their performance results.
+         a model with 98.57% global sparsity, and 98.08% test accu-
+         racy. While our model achieves .34% higher tests accuracy  D. Sparse Transformer Experiments with 1% lower sparsity, we believe the discrepancy is mainly
+         due to difference in our software packages: the authors of   D.1. Magnitude Pruning Details
+         (Molchanov et al.,2017) used Theano and Lasagne for their  For our magnitude pruning experiments, we tuned four keyexperiments, while we use TensorFlow.               hyperparameters: the starting iteration of the sparsiﬁcation
+         Given our model achieves highest accuracy, we can decrease   process, the ending iteration of the sparsiﬁcation process,
+         thelog threshold to trade accuracy for more sparsity. With   the frequency of pruning steps, and the combination of other
+         alog threshold of 2.0, our model achieves 98.5% global   regularizers (dropout and label smoothing) used during train-
+         sparsity with a test set accuracy of 98.40%. With alog    ing. We trained models with 7 different target sparsities:
+         threshold of 0.1, our model achieves 99.1% global sparsity  50%, 60%, 70%, 80%, 90%, 95%, and 98%. At each of
+         with 98.13% test set accuracy, exceeding the sparsity and  these sparsity levels, we tried pruning frequencies of 1000
+         accuracy of the originally published results.            and 10000 steps. During preliminary experiments we identi-
+                                                   ﬁed that the best settings for the training step to stop pruningOn LeNet-5-Caffe, our implementation achieved a global   at were typically closer to the end of training. Based on thissparsity of 99.29% with a test set accuracy of 99.26%, ver-  insight, we explored every possible combination of start andsus the originaly published results of 99.6% sparsity with   end points for the sparsity schedule in increments of 10000099.25% accuracy. Lowering thelog threshold to 2.0, our  steps with an ending step of 300000 or greater.model achieves 99.5% sparsity with 99.25% test accuracy.
+                                                   By default, the Transformer uses dropout with a dropout
+         C.l                                      rate of 10% on the input to the encoder, decoder, and before 0 Regularization Implementation          each layer and performs label smoothing with a smooth- Veriﬁcation                             ing parameter of .1. We found that decreasing these other
+         The originall                                 regularizers produced higher quality models in the mid to 0 regularization paper uses a modiﬁed version
+         of the proposed technique for inducing group sparsity in   high sparsity range. For each hyperparameter combination,
+         models, so our weight-level implementation is not directly  we tried three different regularization settings: standard la-
+         comparable. However, to verify our implementation we   bel smoothing and dropout, label smoothing only, and no
+         trained a Wide ResNet (WRN) (Zagoruyko & Komodakis,  regularization.
+         2016) on CIFAR-10 and compared results to those reported
+         in the original publication for group sparsity.            D.2. Variational Dropout Details
+
+         As done byLouizos et al.(2017b), we applyl        For the Transformer trained with variational dropout, we 0 to the
+         ﬁrst convolutional layer in the residual blocks (i.e., where   extensively tuned the coefﬁcient for the KL divergence
+         dropout would normally be used). We use the weight decay  component of the objective function to ﬁnd models that
+         formulation for the re-parameterized weights, and scale the   achieved high accuracy with sparsity levels in the target
+         weight decay coefﬁcient to maintain the same initial length  range. We found that KL divergence weights in the range
+         scale of the parameters. We use the same batch size of 128   [:1 ;1 ], whereNis the number of samples in the training N N
+         samples and the same initial log , and train our model on a  set, produced models in our target sparsity range.
+         single GPU.                               The State of Sparsity in Deep Neural Networks: Appendix
+
+         (Molchanov et al.,2017) noted difﬁculty training some mod-  E. Sparse ResNet-50
+         els from scratch with variational dropout, as large portions
+         of the model adopt high dropout rates early in training be-  E.1. Learning Rate
+         fore the model can learn a useful representation from the   For all experiments, the we used the learning rate schemedata. To address this issue, they use a gradual ramp-up of the   used by the ofﬁcial TensorFlow ResNet-50 implementation 9 .KL divergence weight, linearly increasing the regularizer  With our batch size of 1024, this includes a linear ramp-upcoefﬁcient until it reaches the desired value.            for 5 epochs to a learning rate of .4 followed by learning
+         For our experiments, we explored using a constant regu-  rate drops by a factor of 0.1 at epochs 30, 60, and 80.
+         larizer weight, linearly increasing the regularizer weight,
+         and also increasing the regularizer weight following the   E.2. Magnitude Pruning Details
+         cubic sparsity function used with magnitude pruning. For  For magnitude pruning on ResNet-50, we trained modelsthe linear and cubic weight schedules, we tried each com-  with a target sparsity of 50%, 70%, 80%, 90%, 95%, andbination of possible start and end points in increments of   98%. At each sparsity level, we tried starting pruning at100000 steps. For each hyperparameter combination, we  steps 8k, 20k, and 40k. For each potential starting point, wealso tried the three different combinations of dropout and la-  tried ending pruning at steps 68k, 76k, and 100k. For everybel smoothing as with magnitude pruning. For each trained   hyperparameter setting, we tried pruning frequencies of 2k,model, we evaluated the model with 11log thresholds  4k, and 8k steps and explored training with and without labelin the range[0;5]. For all experiments, we initialized all   smoothing. During preliminary experiments, we observedlog 2 parameters to the constant value 10.            that removing weight decay from the model consistently
+                                                   caused signiﬁcant decreases in test accuracy. Thus, for allD.3.l0 Regularization Details                     hyperparameter combinations, we left weight decay on with
+         For Transformers trained withl                    the standard coefﬁcient. 0 regularization, we simi-
+         larly tuned the coefﬁcient for thel0 -norm in the objective  For a target sparsity of 98%, we observed that very few hy-function. We observed that much higher magnitude regu-  perparameter combinations were able to complete traininglarization coefﬁcients were needed to produce models with  without failing due to numerical issues. Out of all the hyper-the same sparsity levels relative to variational dropout. We   parameter conﬁgurations we tried, only a single model wasfound thatl                      100 -norm weights in the range[1 ; ]produced N N          able to complete training without erroring from the presencemodels in our target sparsity range.                  of NaNs. As explained in the main text, at high sparsity
+         For all experiments, we used the default settings for the   levels the ﬁrst layer of the model has very few non-zero
+         paramters of the hard-concrete distribution: = 2=3, =  parameters, leading to instability during training and low
+          0:1, and = 1:1. We initialized thelog parameters to  test set performance. Pruned ResNet-50 models with the
+         2:197, corresponding to a 10% dropout rate.            ﬁrst layer left dense did not exhibit these issues.
+
+         For each hyperparameter setting, we explored the three reg-  E.3. Variational Dropout Detailsularizer coefﬁcient schedules used with variational dropout
+         and each of the three combinations of dropout and label   For variational dropout applied to ResNet-50, we explored
+         smoothing.                                  the same combinations of start and end points for the kl-
+                                                   divergence weight ramp up as we did for the start and end
+         D.4. Random Pruning Details                     points of magnitude pruning. For all transformer experi-
+                                                   ments, we did not observe a signiﬁcant gain from using aWe identiﬁed in preliminary experiments that random prun-  cubic kl-divergence weight ramp-up schedule and thus onlying typically produces the best results by starting and ending   explored the linear ramp-up for ResNet-50. For each combi-pruning early and allowing the model to ﬁnish the rest of   nation of start and end points for the kl-divergence weight,the training steps with the ﬁnal sparse weight mask. For our  we explored 9 different coefﬁcients for the kl-divergenceexperiments, we explored all hyperparameter combinations   loss term: .01 / N, .03 / N, .05 / N, .1 / N, .3 / N, .5 / N, 1 /that we explored with magnitude pruning, and also included   N, 10 / N, and 100 / N.start/end pruning step combinations with an end step of less
+         than 300000.                                 Contrary to our experience with Transformer, we found
+                                                   ResNet-50 with variational dropout to be highly sensitive
+                                                   to the initialization for the log 2 parameters. With the
+                                                   standard setting of -10, we couldn’t match the baseline accu-
+                                                   racy, and with an initialization of -20 our models achieved
+                                                     9 https://bit.ly/2Wd2Lk0                               The State of Sparsity in Deep Neural Networks: Appendix
+
+         good test performance but no sparsity. After some exper-  pruning frequencies of 2k, 4k, and 8k and explored training
+         imentation, we were able to produce good results with an  with and without label smoothing.
+         initialization of -15.
+         While with Transformer we saw a reasonable amount of   E.6. Scratch-B Learning Rate Variants
+         variance in test set performance and sparsity with the same  For the scratch-b (Liu et al.,2018) experiments with ResNet-
+         model evaluated at different log thresholds, we did not   50, we explored four different learning rate schemes for the
+         observe the same phenomenon for ResNet-50. Across a   extended training time (2x the default number of epochs).
+         range of log values, we saw consistent accuracy and nearly
+         identical sparsity levels. For all of the results reported in the  The ﬁrst learning rate scheme we explored was uniformly
+         main text, we used a log threshold of 0.5, which we found  scaling each of the ﬁve learning rate regions to last for
+         to produce slightly better results than the standard threshold   double the number of epochs. This setup produced the best
+         of 3.0.                                     results by a wide margin. We report these results in the main
+                                                   text.
+         E.4.l0 Regularization Details                     The second learning rate scheme was to keep the standard
+                                                   learning rate, and maintain the ﬁnal learning rate for theForl0 regularization, we explored four different initial log   extra training steps as is common when ﬁne-tuning deep values corresponding to dropout rates of 1%, 5%, 10%,  neural networks. The third learning rate scheme was to and 30%. For each dropout rate, we extenively tuned thel0 -  maintain the standard learning rate, and continually drop norm weight to produce models in the desired sparsity range.  the learning rate by a factor of 0.1 every 30 epochs. The lastAfter identifying the proper range ofl0 -norm coefﬁcients,  scheme we explored was to skip the learning rate warm-up,we ran experiments with 20 different coefﬁcients in that  and drop the learning rate by 0.1 every 30 epochs. Thisrange. For each combination of these hyperparameters, we  learning rate scheme is closest to the one used byLiu et al.tried all four combinations of other regularizers: standard  (2018). We found that this scheme underperformed relativeweight decay and label smoothing, only weight decay, only  to the scaled learning rate scheme with our training setup.label smoothing, and no regularization. For weight decay,
+         we used the formulation for the reparameterized weights  Results for all learning rate schemes are included with the
+         provided in the original paper, and followed their approach  released hyperparameter tuning data.
+         of scaling the weight decay coefﬁcient based on the initial
+         dropout rate to maintain a constant length-scale between the
+         l0 regularized model and the standard model.
+         Across all of these experiments, we were unable to produce
+         ResNet models that achieved a test set performance better
+         than random guessing. For all experiments, we observed that
+         training proceeded reasonably normally until thel0 -norm
+         loss began to drop, at which point the model incurred severe
+         accuracy loss. We include the results of all hyperparameter
+         combinations in our data release.
+         Additionally, we tried a number of tweaks to the learning
+         process to improve the results to no avail. We explored
+         training the model for twice the number of epochs, training
+         with much higher initial dropout rates, modifying the 
+         parameter for the hard-concrete distribution, and a modiﬁed
+         test-time parameter estimator.
+
+         E.5. Random Pruning Details
+         For random pruning on ResNet-50, we shifted the set of
+         possible start and end points for pruning earlier in training
+         relative to those we explored for magnitude pruning. At
+         each of the sparsity levels tried with magnitude pruning,
+         we tried starting pruning at step 0, 8k, and 20k. For each
+         potential starting point, we tried ending pruning at steps 40k,
+         68k, and 76k. For every hyperparameter setting, we tried
\ No newline at end of file
diff --git a/Corpus/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.txt b/Corpus/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.txt
new file mode 100644
index 0000000..610ac21
Binary files /dev/null and b/Corpus/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.txt differ
diff --git a/Corpus/You Cannot Improve What You Do not Measure FPGA vs. ASIC Efficiency Gaps for ConvolutionalNeural Network Inference.txt b/Corpus/You Cannot Improve What You Do not Measure FPGA vs. ASIC Efficiency Gaps for ConvolutionalNeural Network Inference.txt
new file mode 100644
index 0000000..61dd649
--- /dev/null
+++ b/Corpus/You Cannot Improve What You Do not Measure FPGA vs. ASIC Efficiency Gaps for ConvolutionalNeural Network Inference.txt	
@@ -0,0 +1,1187 @@
+     You Cannot Improve What You Do not Measure:
+     FPGA vs. ASIC Efficiency Gaps for Convolutional
+     Neural Network Inference
+
+     ANDREW BOUTROS, SADEGH YAZDANSHENAS, and VAUGHN BETZ,
+     Department of Electrical and Computer Engineering, University of Toronto
+
+     Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational
+     cost that necessitates high-performance energy-efffcient acceleration. The reconffgurability of FPGAs is ap-
+     pealingduetotherapidchangeinDLmodelsbutalsocauseslowerperformanceandarea-efffciencycompared
+     to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional
+     neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations,
+     we highlight the area and performance costs of programmability to pinpoint the inefffciencies in current
+     FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16
+     and ResNet-50 to allow extensive comparisons. We ffnd that the performance gap varies signiffcantly from
+     2.8×to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among
+     different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area
+     ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural
+     changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking
+     the on-chip memories to reduce the programmability gap for DL applications.
+     CCSConcepts:•Hardware→ReconffgurablelogicandFPGAs;Hardwareaccelerators;Reconffgurable
+     logic applications;
+     Additional Key Words and Phrases: Deep learning, convolutional neural networks, FPGA, ASIC
+     ACM Reference format:
+     Andrew Boutros, Sadegh Yazdanshenas, and Vaughn Betz. 2018. You Cannot Improve What You Do not Mea-
+     sure: FPGA vs. ASIC Efffciency Gaps for Convolutional Neural Network Inference.ACM Trans. Reconffgurable
+     Technol. Syst.11, 3, Article 20 (December 2018), 23 pages.
+     https://doi.org/10.1145/3242898
+
+     1 INTRODUCTION
+     Recent advances in deep learning (DL) have led to breakthroughs in a myriad of ffelds, achiev-
+     ing unprecedented accuracy in tasks that were thought to be inherently unsuitable for our com-
+     puting machines to perform. It has become, in a very short time span, thede-factostandard for
+     numerous applications ranging from simple image classiffcation [36], machine translation [44],
+     Authors’ addresses: A. Boutros and V. Betz, Department of Electrical and Computer Engineering, University of Toronto,
+     10 King’s College Road, Toronto, Ontario M5S 3G4, Canada and Vector Institute, Toronto, ON, Canada; emails: andrew.
+     boutros@mail.utoronto.ca, vaughn@eecg.utoronto.ca; S. Yazdanshenas, Department of Electrical and Computer Engineer-
+     ing, University of Toronto, 10 King’s College Road, Toronto, Ontario M5S 3G4, Canada; email: sadegh.yazdanshenas@
+     mail.utoronto.ca.
+     Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee  20 provided that copies are not made or distributed for profft or commercial advantage and that copies bear this notice and
+     the full citation on the ffrst page. Copyrights for components of this work owned by others than ACM must be honored.
+     Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
+     prior speciffc permission and/or a fee. Request permissions frompermissions@acm.org.
+     © 2018 Association for Computing Machinery.
+     1936-7406/2018/12-ART20 $15.00
+     https://doi.org/10.1145/3242898
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:2 A.Boutrosetal.
+
+       and speech recognition [10] to generating artistic paintings [9], composing music [7], and beating
+       world champions in complex board games [41]. Interestingly, the basic foundations of DL and the
+       algorithm currently used to train deep neural networks (DNNs), known as back-propagation, were
+       established in the 1980s [35]. But it was not until recent years that it experienced a resurgence of
+       interest [20], powered by both the abundance of data required for training and the availability of
+       the tremendous compute-power necessary to train and deploy those models.
+         However, the main drawback of DNNs remains to be their high computational complexity when
+       compared to conventional detection and classiffcation computer vision algorithms based on hand-
+       crafted features. For example, a relatively simple eight-layer convolutional neural network (CNN),
+       AlexNet [20], has a computational complexity of 25.8GOP/Mpixel for its convolutional layers,
+       which is 36.9×higher than that of a conventional histogram of oriented gradients feature extractor
+       [43]. This gap grows even wider as we seek to improve the accuracy of CNNs by building deeper,
+       bigger and more complex models that can surpass human-level performance on visual recogni-
+       tion tasks [14]. The ImageNet large-scale visual recognition challenge witnessed a 15×increase
+       in operations required per image inference in return for an 11.7% reduction in classiffcation error
+       between 2012 and 2015 [15,36]. This substantial increase in compute requirements motivates high-
+       performance and energy-efffcient hardware accelerators to replace or co-exist with conventional
+       CPUs in executing both CNN training and inference tasks.
+         The training of CNN models is commonly performed in ffoating-point representation on graph-
+       ics processing units (GPUs) having thousands of cores and large external memory bandwidth. It
+       does not require much effort to deploy existing models or train new ones on GPUs using various
+       frameworks (e.g., Caffe [18] and TensorFlow [1]) that exploit highly optimized GPU libraries such
+       as Nvidia CuDNN [5] for dense and sparse matrix operations. Although GPUs can deliver high
+       performance by performing batch computations, they are extremely power-hungry. This is afford-
+       able for training, which has no constraints on output latency and is carried out a limited number
+       of times during the development phase. However, when it comes to inference, this is not ideal for
+       a wide class of applications that have limited power budget and tight latency constraints such as
+       mobile embedded platforms, self-driving cars or large-scale datacenter services.
+         Toachievethebestperformanceandenergy-efffciency,manyresearchershavefocusedonbuild-
+       ing custom application-speciffc integrated circuits (ASICs) for accelerating CNNs inference work-
+       loads. Some examples are DaDianNao [3] that accelerates different types of DNNs using a multi-
+       chip architecture and Eyeriss [4] that focuses on energy-efffcient acceleration of convolutional
+       layers by maximizing data re-use, performing data compression and using a zero-skipping tech-
+       nique. Despite being an attractive solution, ASICs do not offer enough ffexibility to accommodate
+       the rapid evolution of CNN models and the emergence of new types of layers used in them includ-
+       ing the branching, elementwise addition and batch normalization layers as in more recent models
+       (e.g., GoogLeNet [45] and ResNet [15]). As well, the high non-recurring engineering (NRE) cost
+       and time for design, veriffcation and fabrication of a large ASIC chip makes it difffcult to keep pace
+       with the rapid model improvements in this space.
+         As a trade-off between performance, power-efffciency, and ffexibility, FPGAs offer an interest-
+       ing design point between GPUs and ASICs and recently have had much success in accelerating
+       datacenter workloads in general [32] and more speciffcally CNN inference tasks [30]. In contrast
+       to GPUs, FPGAs are generally more energy-efffcient. A high-end Titan X Nvidia GPU can consume
+       up to 5×more power compared to a high-end Intel Arria 10 FPGA running AlexNet inference tasks
+       [2]. Several studies have also shown that CNN inference does not require high-precision ffoating-
+       point computations and can be carried out using ffxed-point arithmetic for less than 1% accuracy
+       degradation [13]. This wide variety of precisions used in CNN inference matches well with FP-
+       GAs as they can execute non-standard custom bit-width datapaths with much higher efffciency
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:3
+
+       and ffexibility than GPUs. However, they have a shorter turn-around time, less NRE cost, and can
+       be re-conffgured to support new models and layer types when compared to ASIC accelerators.
+       Another interesting advantage for FPGAs is that they offer a variety of I/Os that support different
+       communication protocols. This is useful when the CNN accelerator is a part of a larger system
+       and receives inputs from different types of digital and analog sensors as the case in automotive
+       applications. However, FPGAs run at signiffcantly lower frequencies due to their reconffgurability
+       overhead and thus have lower raw performance compared to both GPUs and ASICs.
+         For this reason and despite their drawbacks, several companies have developed ASIC solutions
+       to meet the processing needs of high-performance DL applications. A recent example for that is
+       Google’sTensorProcessingUnit[19]thatwasdeployedindatacenterstoaccelerateinferencetasks
+       for various types of DNNs. It has almost 17×more multiply accumulate (MAC) units, 5.6×more
+       on-chip memory and runs at 3.5×higher frequency when compared to Microsoft’s Catapult V1
+       [32] that uses Intel Stratix V FPGAs. In this work, we study the area and performance gap between
+       FPGAs and ASICs in accelerating inference tasks using multiple CNN computing architectures
+       (CAs) to highlight the limitations of current FPGA architectures and how they affect the overall
+       performance of DL accelerators. The motive behind this study is twofold; First, it shows which
+       design practices are more suitable for FPGA platforms and make the best use of current FPGA
+       architectures. Second, it provides FPGA architects with data on where FPGAs have the largest
+       efffciency gap compared to ASICs, which can lead to insights on how current FPGA architectures
+       could be modiffed to shrink this gap and deliver higher performance in a domain with extremely
+       high demand such as DL.
+         In this article, we make the following contributions:
+          •WeimplementhighlyoptimizedRTLdesignsforthreestate-of-the-artCAsthatusedifferent
+            parallelization schemes to accelerate CNNs. We then extend each of these previously pub-
+            lished architectures to support all layer types required to implement three different CNN
+            models: AlexNet, VGG-16, and ResNet-50 to ensure our comparisons consider a broadly
+            representative set of CNN models and implementations.
+          •We present a quantitative comparison of area and performance results to measure the gap
+            between the same CAs implemented on a high-end Intel Arria 10 FPGA and a 28nm ASIC.
+          •We trace back the bottlenecks resulting in this gap and pinpoint the limitations of current
+            FPGA architectures in accelerating CNNs.
+       2 BACKGROUND
+       Deep Neural Networks are a class of machine-learning algorithms that were developed to mimic
+       the information-processing paradigm in biological nervous systems. The human brain as an ex-
+       ample has an average of around 86 billion neurons [16] connected in a complex network in which
+       each neuron receives inputs from its surrounding neurons and ffres an activation if those inputs
+       are greater than a speciffc threshold. Inspired by this system, DNNs typically consist of several
+       layers each of which hasd(l) neurons wherelis the layer number ranging from 1 toL.Eacharti-
+       ffcial neuron performs a biased weighted sum of all its inputs followed by a non-linear activation
+       function to produce its output as shown in Equation (1), wherex(l) is the output of neuroniof i
+       layerl,w(l) is the weight parameter between the neuronjin layerland neuroniin layerl−1, ij
+       w(l) is the bias term andθis the non-linear activation function that can be a sigmoid, tanh, or 0j rectiffed linear unit (ReLU) function. This equation can be viewed as a series of MAC operations,
+       which form the majority of computations in DNNs:
+
+                             x(l) =θ    d  (l−1)
+                                   w(l) +   x(l−1) wl .                    (1) j      0j      i   ij
+                                       i=1      
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:4 A.Boutrosetal.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                         Fig. 1. Different layer types in an example CNN.
+
+         CNNs are a subset of DNNs in which the connections between neurons of successive layers
+       are sparse. Each neuron receives inputs only from neighboring neurons of the previous layer or
+       so-called itsreceptive ffeld. This signiffcantly reduces the number of weights and MAC operations
+       required and achieves high accuracy in applications with spacial or temporal correlation between
+       input samples such as image classiffcation, gesture and speech recognition. Sections2.1and2.2
+       describe the main layers of a CNN and present a summary of the previous related work on accel-
+       erating CNNs on FPGAs.
+
+       2.1 Overview of CNN Layers
+       CNN models typically consist of different layer types cascaded together such that the output of a
+       speciffc layer is consumed by the subsequent one in a feed-forward scheme during inference. In
+       Figure1, we show an example CNN, and we illustrate the functionality of each of the layer types
+       subsequently explained in this section.
+
+         2.1.1 Convolutional (CONV) Layers.A CONV layer takes a set ofNIM two-dimensional input
+       feature maps. It accumulates the results of 2D convolutions with strideSbetween each input fea-
+       ture map and its correspondingK×Kkernel of learnable weights to produce a two-dimensional
+       output feature map. This is performed usingNOM different sets of kernels to generateNOM output
+       feature maps that are consumed by the subsequent layer. CONV layers are very compute-intensive
+       and represent the majority of computation in a CNN, which motivated many designers to focus
+       on accelerating only the CONV and not all CNN layers [55]. We also notice that as CNN models
+       get deeper, the portion of CONV layers operations compared to the total number of operations
+       increases as they constitute 91.6%, 99.1%, and 99.8% of the total operations count for AlexNet,
+       VGG-16, and ResNet-50, respectively.
+         The computation of CONV layers can be summarized using the six nested loops in Algorithm1;
+       they are highly parallelizable and can achieve high gains through hardware acceleration. However,
+       it is a non-trivial optimization problem to choose the tiling and unrolling factors of those loops
+       to achieve the best performance within the limited available hardware resources [27]. Typically, a
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:5
+
+
+       ALGORITHM 1:Nested loops for CONV layers computation
+       Loop 1: for(j=0;j<NOM ;j++)do
+          Loop 2: for(x=0;x<NOX ;x+=S)do
+            Loop 3: for(y=0;y<NOY ;y+=S)do
+               Loop 4: for(i=0;i<NIM ;i++)do
+                 Loop 5: for(kx =0;kx <K;kx ++)do
+                    Loop 6: for(ky =0;ky <K;ky ++)do
+                      out(j,x,y)+=in(i,x+kx ,y+ky )×weiдht(j,i,kx ,ky )
+               out(j,x,y)+=bias(j)
+
+       non-linearactivationfunctionsuchastheReLUfunctionθ(x)=max(0,x)isappliedto theoutputs
+       of a CONV layer before passing them to the next layers.
+         2.1.2 Local Response Normalization (LRN) and Batch Normalization (BNORM) Layers.LRNisa
+       heavily arithmetic layer that was used in the early CNN models such as AlexNet to normalize each
+       element in its input feature maps with respect to the elements at the same location in the adjacent
+       KN maps using the formula in Equation (2). The function of the LRN layer is to create lateral
+       inhibition for the output values especially when using ReLU as an unbounded activation function
+       [20]. However, this layer is removed in newer models and is sometimes replaced by BNORM layer
+       followed by scaling, as in ResNets, which cuts down the required training steps and achieves the
+       same accuracy. The computation for the BNORM layer is shown in Equation (3)whereμandσ2
+       are statistically computed over the training data set andγandβare learned during the training
+       phase of the CNN [17] but are all constants for inference:
+                                                           −β
+                                    out(j,x,y)=in(j,x,y)×   α min(j+  KN ,N2  OM )       1+             in 2 (i,x,y)K                   ,         (2)
+                                        N      i=max(0,j−KN )       2
+                                                 in(j,x,y)−μout(j,x,y)=γ    √      +β.                   (3)σ2
+         2.1.3 Pooling (POOL) Layers.Another key layer in CNN is the POOL layer, which acts as a
+       down-sampling function such that its input feature maps of sizeNX ×NY are reduced in size but
+       the number of input and output feature maps stays the same. There are different variations for
+       POOL layers such as Max-POOL and Average-POOL, where each element in the output feature
+       map represents the maximum or average value of a window of sizeKP ×KP in the original input
+       feature map, respectively.
+         2.1.4 Element-Wise (ELTWISE) Layers.Recent CNNs have more complex models with branch-
+       ing layers and skipping connections forming a directed acyclic graph as shown in Figure1after
+       CONV2 layer. An ELTWISE layer combines two branches by performing an element-wise addi-
+       tion of the elements of a skipping branch and the results of a CONV layer. Reference [38]proposed
+       the use of weighted addition in ELTWISE layers for deeper networks with more than 100 layers;
+       however, we focus on the unweighted variation of ELTWISE layers in this work. For this layer, the
+       dimensions of the output feature maps match those of the input feature maps.
+         2.1.5 Fully Connected (FC) Layers.The last layers of CNNs are typically FC layers, which are
+       similar to those of conventional DNNs. The output of an FC layer is a one-dimensional vector
+       of sizeNFC . Each element in this vector is a weighted sum of all the outputs of the previous
+       layer, which were re-shaped into a one-dimensional vector of size out                                    NFC . As shown in Figure1,in
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:6 A.Boutrosetal.
+
+       it is characterized by the large number of weights involved in computation (NFC ×Nin   FC )that,
+       unliketheconvolutionkernels,cannotbere-used.Therefore,FClayersareusuallymemory-bound, out
+
+       but more recent CNN models have a smaller number of FC layers with fewer weights making them
+       less problematic. For instance, ResNet-50 has only 1 FC layer that has about 8% of the total number
+       ofweightsinthenetworkcomparedtothreeFClayerswith96%ofthenetworkweightsinAlexNet,
+       which further prioritizes the acceleration of CONV layers over other types of layers.
+
+       2.2 Related Work
+       Research efforts to accelerate CNNs on FPGAs can be classiffed into two major categories. The
+       ffrst category of work focuses on optimizing the mapping of CNN models to current FPGA archi-
+       tectures. For example, Reference [55] presents an analytical design methodology for design space
+       exploration using the roof-line model to ffnd the optimal loop unrolling and tiling parameters for
+       the CONV loops shown in Algorithm1. This work is extended in Reference [56] to a multi-FPGA
+       cluster using dynamic programming with the target of maximizing throughput or minimizing la-
+       tency. To overcome under-utilization of resources resulting from different sizes of CONV layers,
+       References [39] and [40] partition the available resources using a dynamic programming technique
+       into multiple convolutional layer processors, each of which is optimized for a subset of CONV lay-
+       ers. Another aspect of optimizing CNNs for FPGA acceleration is model compression by using
+       techniques such as Singular Value Decomposition for FC layers [33]. Another compression tech-
+       nique reduces precision down to ternary [31,49] or binary [29,47] networks that are inherently
+       more FPGA-friendly, and exhibit little or no accuracy degradation by increasing the size of the
+       network as in Reference [28]. The use of non-standard ffoating-point number representations has
+       also been proposed by Microsoft’s BrainWave project [6] that uses its custom 8-bit/9-bit ffoating-
+       point precision without suffering any accuracy loss. Recent work has also proposed the use of
+       mathematical optimizations such as Winograd and Fast Fourier Transformations to decrease the
+       number of MAC operations required in CONV layers as in References [2,24,57].
+         The second category seeks to ease development of DL accelerators on FPGAs such that it
+       requires minimal hardware design expertise. Some works have investigated the use of High-Level
+       Synthesis FPGA tools to implement CNNs in high-level programming languages that are synthe-
+       sized into hardware [42]. Another widely investigated approach is to build automatic compilers
+       to produce an end-to-end optimized accelerator for a speciffc CNN model and a speciffc FPGA
+       platform [23,25,26]. In Reference [48], the authors present a framework that takes a CNN model
+       described in a domain-speciffc language, converts it to a synchronous dataffow graph, optimizes
+       performance and resource utilization via algebraic transformations, and ffnally generates a Vivado
+       HLS hardware design. An open-source RTL template-based compiler that transforms a high-level
+       description of the CNN model in the sameprototxtformat used by Caffe into an FPGA accelerator
+       is also presented in Reference [37]. Similar frameworks were presented in References [51] and [11]
+       that use Caffe-described and TensorFlow-described models along with RTL and RTL-HLS hybrid
+       templates, respectively, to implement FPGA accelerators for not only CNN models but also Multi-
+       Layer Perceptrons and Recurrent Neural Networks. The authors of Reference [52] implement
+       an automated design ffow that generates high-performance systolic array CNN architectures
+       and a two-phase design space exploration scheme using analytical models as well as on-board
+       implementations.
+         Our work is complementary to these studies and serves as the ffrst step toward improving the
+       current FPGA architecture, which was considered a constant factor by all previous works, for
+       more efffcient acceleration of emerging and highly motivated applications as DL. To the best of
+       our knowledge, this work is the ffrst attempt to quantify the area and performance gap between
+       FPGA and ASIC implementations of state-of-the-art CNN CAs, highlight the architectural features
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:7
+
+                         Table 1. Main Differences between the Three CAs
+        Comparison Aspect     ASU-like       Intel-DLA-like     Chain-NN-like
+        MAC Units Array    Three-dimensional   Two-dimensional    One-dimensional
+                            Conventional    Winograd Transform    ConventionalConvolution Method   sliding-window   for 3×3 convolutions   sliding-window
+                          A centralized buffer A centralized buffer for A small distributedWeight Buffers      for a group of PEs     a group of PEs     buffer for each PE
+                          Double buffers for  Interchangeable double No double buffering
+        Double Buffering      weights in FC    buffers for features in
+                                              CONV
+
+
+       of current FPGA architectures causing it, and present suggested architectural solutions that can
+       reduce this gap.
+
+       3 COMPUTING ARCHITECTURES
+       We implement three different highly optimized state-of-the-art CAs for accelerating CNN infer-
+       encetasksinRTLusingparameterizableSystemVerilogHDL.WerefertothethreeCAsasASU-like
+       [26,27],Intel-DLA-like[2],andChain-NN-like[50].Weimplementallthehardwarecomputational
+       blocks required to execute all the layers described in Section2.1for three different CNN models:
+       AlexNet, VGG-16, and ResNet-50. We also implement the control logic required to run the CAs
+       starting from reading the input features and weights from on-chip buffers, transferring them to
+       the computational blocks, and writing the ffnal results in the output feature buffers. The on-chip
+       buffer sizes and the parallelization factors for each of the nested CONV loops are ffxed in both
+       the FPGA and ASIC implementations for each of these CAs according to the optimal design point
+       originally reported in References [2,27,50]. For consistency and to enable fair comparisons, we
+       also use a ffxed-point data representation for all three CAs with 16-bit features and 8-bit weights
+       as in Reference [27], which causes less than 2% accuracy degradation. We consider the external
+       memory interface and direct memory access engines to be out of the scope of this work, as they
+       do not affect the conclusions we seek to draw about the performance and area gaps or the bot-
+       tlenecks of current FPGA architectures in accelerating CNNs. However, our performance models
+       put off-chip data transfer into consideration according to any external memory interface that we
+       specify. In our experiments, we report two sets of results: one assuming inffnite bandwidth and the
+       other assuming one bank of DDR4 memory at 1200MHz with a total bandwidth of 17GB/s similar
+       to that used in Reference [2].
+         We carefully chose those three CAs out of numerous architectures proposed in the literature
+       to be diverse; the wide variations between them help ensure our analysis of FPGA vs. ASIC efff-
+       ciencyhasbroadapplicability.ThemaindifferencesbetweenthethreeCAs,summarizedinTable1,
+       are:
+          •All three CAs have different parallelization schemes. In other words, the array of MAC units
+            in each CA has a different number of dimensions leading to different execution orders, tiling
+            and unrolling factors for the CONV loops in Algorithm1. Output tiles of size(POM ×POX ×
+            POY ),(POM ×POX ×1),and(POM ×1×1)are produced by the ASU-like, Intel-DLA-like,
+            and Chain-NN-like PE arrays, respectively.
+          •The Intel-DLA-like CA uses a mathematical optimization for CONV layers with kernels of
+            size 3×3 known as the Winograd Transform [22], which reduces the number of MAC op-
+            erations needed to compute convolutions. However, the ASU-like and Chain-NN-like CAs
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:8 A.Boutrosetal.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                     Fig. 2. ASU-like CA tiling schemes and hardware architecture.
+
+            perform conventional sliding-window convolution operations. This enables us to explore
+            different convolution schemes with different degrees of control logic complexity and ob-
+            serve their effect on the area and performance gaps.
+          • The three CAs implement their weight buffers differently. The Chain-NN-like CA stores the
+            kernel weights in small distributed buffers such that every MAC unit has its local scratch-
+            pad for weights implemented in the FPGA’s soft logic (MLABs). In contrast, both the ASU-
+            like and Intel-DLA-like CAs have larger weight buffers implemented using on-chip memory
+            blocks (BRAMs) to feed a group of MAC units. In FC layers, the Intel-DLA-like CA also
+            interchanges the roles of weight and feature buffers.
+          • The CAs differ in whether and how they use double-buffering to hide memory transfer
+            time. The ASU-like CA uses double-buffering for weights to hide the computation time of
+            FC layers by fflling one buffer from off-chip memory while using the weights in the other
+            buffer for computations. The Intel-DLA-like CA uses double-buffering by interchanging
+            input and output buffers after each layer to eliminate any external memory transfers if all
+            the output feature maps of a layer can fft in on-chip buffers. The Chain-NN-like CA does
+            not use any double-buffering techniques.
+         None of the three CAs is available as an open-source implementation, and hence we imple-
+       mented them from scratch to carry out the study presented in this article under controlled condi-
+       tions (e.g., RTL implementation, same FPGA platform, same weight and activation precisions, etc.)
+       to enable fair comparisons and focus only on the architectural aspects of these CAs. In Sections3.1,
+       3.2,and3.3, we describe the details of the three CAs we implemented and any extensions added
+       to them for the sake of our study.
+
+       3.1 ASU-like CA
+       This CA was proposed in Reference [27] by Ma et al. from Arizona State University (ASU) and
+       then expanded in Reference [26] to support the ELTWISE and BNORM layers used in recent CNN
+       models. The core of this CA, shown in Figure2(c), is a three-dimensional MAC unit array of size
+       POM ×POX ×POY that can compute both CONV and FC layers.
+         Feature maps and weights are tiled to minimize external memory transfers by either buffering
+       all weights or all input feature maps in on-chip memory at any layer of the CNN model. In the
+       shallower layers of the network, all the weights but onlyN  +K−1 rows of the input feature OY maps are buffered on-chip such that 0<N  ≤NOY   OY as shown in Figure2(a). In the deeper layers
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:9
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 3. Data re-use shiff register network operation for ASU-like CA withPOX =POY =K=3andPOM =1.
+
+       with smaller input and output feature maps and more weights, all features but onlyN   sets OM of weight kernels are buffered on-chip such that 0<N   ≤NOM   OM asshowninFigure2(b). The
+       on-chip input and weight buffers, all implemented in BRAMs, are organized to supply the MAC
+       units in the convolution engine with enough inputs to keep them busy at every clock cycle. There
+       arePOY input buffers, each of which supplies the MAC units withPOX input features that get
+       multiplied by weights fromPOM different weight buffers as shown in Figure2(c).
+         The convolution engine performs the computation of Loops 1, 2, and 3 in Algorithm1in parallel
+       using the three-dimensional array of MAC units. Each MAC unit sequentially accumulates the
+       results of one kernel (Loops 5 and 6) across all input feature maps (Loop 4) and stores the partial
+       sum locally in the accumulator. This means that afterK×K×NIM cycles, each MAC unit outputs
+       its ffnal result producingPOM ×POX ×POY outputs at the same time. This parallelization scheme
+       hasseveraladvantages;itdoesnotrequireanymovementofpartialsumsaseveryMACunitlocally
+       accumulatestheresultsacrossLoops4,5,and6withouttheneedforcommunicationbetweenMAC
+       units or any intermediate on-chip storage. It also allows ffexible implementation of convolutions
+       of any input feature map count and any kernel size as a result of sequentially executing Loops 4,
+       5, and 6. For example, for any input feature map count, convolutions of size 3×3 and 5×5
+       are executed in 9 and 25 cycles, respectively. The convolution engine is preceded by a complex
+       network ofPOY circular shift registers of size(POX +K−1)each. Figure3shows howPOX ×POY
+       convolution results are computed using this shift register network overK×Ktime steps, where
+       colored boxes are input/output features, white numbered boxes are kernel weights and colored
+       numbered boxes indicate a multiplication operation between an input feature and a kernel weight.
+       At every time step, the multiplication result is accumulated inside the MAC unit and a shift left of
+       the input data is performed. EveryKtime steps a new row is loaded from the input buffers and
+       data is re-arranged and transfered between the circular shift registers as indicated by the dashed
+       arrows in the ffgure. AfterK×Ktime steps, this is repeated forNIM input maps before each MAC
+       unit produces its ffnal result. The convolution engine is followed by an output serializer that takes
+       POM ×POX ×POY results and serializes them overPOY cycles. After the output serializer, there
+       can be a normalization block that is either LRN or BNORM according to the implemented CNN
+       model, then max pooling block and ffnally the output buffers. An optional ELTWISE block is used
+       in the ResNet-50 model.
+
+         Extensions:Both References [27] and [26] originally implement this CA for several CNN mod-
+       els including ResNet-50 and VGG-16. Therefore, they implement all the hardware blocks shown
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:10 A. Boutros et al.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+               Fig. 4. Intel-DLA-like CA and the internal architecture of each processing element.
+
+       in Figure2(c) except for the LRN block used in the AlexNet model. The LRN block is a heavily
+       arithmetic block as it contains squaring, addition, multiplication and exponentiation operations.
+       Since all the DSP blocks are consumed by the convolution engine, we implement all multiplication
+       operations in the LRN block using soft multipliers that are found to be not limiting the maxi-
+       mum operating frequency. We implement the exponentiation operation of Equation (2) using a
+       piecewise-linear function consisting of 20 points that we computed using theαandβvalues from
+       the AlexNet model similar to Reference [42].
+
+       3.2 Intel-DLA-like CA
+       In Reference [2], Intel presented the Deep Learning Accelerator (DLA), which is considered to be
+       the state-of-the-art FPGA accelerator for the AlexNet CNN model. The core of this CA is an array
+       ofPOM processing elements (PEs) connected in a daisy chain scheme where each PE receives input
+       features and passes them to the subsequent PE in the next clock cycle as shown in Figure4.
+         This CA uses double-bufferedstream bufferssuch that input features of a CONV layer are read
+       from one buffer and its outputs are stored in the other one, which then serves as the input buffer
+       for the next layer. The two buffers continue to interchange roles as input and output buffers after
+       every layer without the need to store any intermediate results in external memory. After the last
+       CONV layer, outputs are stored in off-chip memory before starting the computations of FC layers.
+       Each PE contains local weight buffers that feed its dot product units with inputs at every clock
+       cycle. For the FC layers, batch processing is used to allow weight re-use among multiple input
+       features. In contrast to the CONV layers, features of a batch of sizeBinputs are stored in “weight”
+       buffers inside the PEs while the weights are stored in the stream buffers and are passed between
+       the PEs using the daisy chain connection. For our study, we report the results for bothB=1that
+       minimizes latency and can be compared to other CAs that do not support batch processing and
+       B=96 that maximizes throughput and aligns with the reported results in Reference [2].
+         A major feature of this CA is its use of a mathematical optimization known as the Winograd
+       Transform to reduce the number of MAC operations required to compute a convolution [22]. In
+       Reference [2], anF(4×4,3×3)transform is performed using a weight matrix of size 3×3andan
+       input feature matrix of size 6×6 resulting in an output matrix of size 4×4. Equation (4)showsthe
+       Winograd transform and inverse transform for these sizes whereGandBT are used to transform
+       the weight matrixWand the input feature matrixX,respectively, is the element-wise multipli-
+       cation operator and thenAT is used to perform the inverse transform and obtain the output matrix
+       Y. For this CA, the transform of the learned weights is done beforehand for the CONV layers of
+       kernel size 3×3,since they are ffxed after training the model while the transform of input features
+       and inverse transform of the ffnal result cannot be performed in advance and hence are performed
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:11
+
+       on-chip:                                  Y=AT [GWG T ] [BT XB]A,
+                                  ⎡⎢ 1   00⎤⎢ 4        ⎥⎥    ⎡                ⎤⎥⎡                     −1  −1 −1 ⎥    ⎢40−5 010
+             ⎢11 11 10⎤    ⎢ 6   6   6 ⎥    ⎢   4 −4 110⎥⎥⎢               ⎥    ⎢⎢          ⎥    ⎢0 −            ⎥20⎥     −1   1 −1 ⎥    ⎢04−4 −110⎥AT =⎢01−12−⎢               ⎥    ⎢ 6   6   6 ⎥    ⎢                ⎥⎢               ⎥    ⎢⎢          ⎥    ⎢
+             ⎢01 14 40⎥G=  1   1   1 ⎥BT =⎢0 −2 −1 210⎥.  (4)⎥⎢               ⎥    ⎢⎢24   12   6 ⎥    ⎢                ⎥⎣01−18−81⎥⎦    ⎢⎢          ⎥    ⎢02−1 −210⎥⎢ 1  −1   1 ⎥    ⎢                ⎥⎢24   12   6 ⎥    ⎢
+                                  ⎢          ⎥    ⎢040−501⎥⎦⎣ 001⎥    ⎣⎦
+         Each PE in the convolution engine of this CA consists of a buffer for the Winograd-transformed
+       weights,POX dot-product units and their corresponding circular shift registers for storing partial
+       sums. Each dot product unit is pipelined intoLstages and uses the dedicated chain between DSP
+       blocks on the FPGA to multiply and accumulatePIM Winograd features and weights and then
+       store the partial result in a circular shift register (CSR) of sizeLas shown in Figure4. Therefore,
+       each dot product unit can interleave the computation ofLdifferent MACs such that afterLcycles,
+       it takes as an input the partial sum previously produced and adds to it the MAC result of the next
+       PIM features and weights. After allNIM features are processed, the ffnal result is produced and
+       the circular shift register is reset to zeros before starting the processing of the next set of input
+       features. The convolution engine consists ofPOM PEs connected in a daisy chain scheme allowing
+       a better ffoorplan of the design on the FPGA with less fan-out from the input stream buffer to
+       the convolution engine, and thus enabling a higher operating frequency. The convolution engine
+       is followed by an inverse Winograd block that transformsPOX ×POM inputs intoP  ×POX  OM
+       outputs. This is followed by LRN and POOL blocks that processP  ×POX  OM results in parallel
+       before storing them back into the output stream buffer. Both thePOX andP   parameters are OX speciffed to be 6 and 4,respectively, according to the Winograd transform size used. Design space
+       exploration was carried out in Reference [2] to ffnd the optimal values forPIM andPOM and they
+       were chosen to be 8 and 48,respectively.
+         Extensions:This CA was originally implemented for the relatively small AlexNet CNN model
+       in which input and output feature maps can fft in on-chip buffers. This enables the use of inter-
+       changeable input and output buffers that eliminates the need to store any intermediate results in
+       external memory. However, this feature is inapplicable to at least the ffrst layers of the other CNN
+       models used in our study as their feature maps exceed the capacity of on-chip buffers. For this
+       case, we use a scheme similar to that of the ASU-like CA to tile input and output feature maps and
+       store intermediate results in off-chip memory. For layers that have small enough feature maps,
+       we maintain the double buffering technique to eliminate data transfers from and to the external
+       memory. We also carried out an experiment in which we increased the size of stream buffers such
+       that more layers can make use of the double buffering technique. However, this resulted in de-
+       grading the maximum operating frequency of the design, leading to a net loss in performance, and
+       therefore we decided to keep the sizes of the stream buffers the same as that used for the AlexNet
+       model. In addition, we implemented BNORM and ELTWISE blocks for this CA that were not part
+       of the original implementation in Reference [2].
+       3.3 Chain-NN-like CA
+       This CA was proposed in Reference [50] by Wang et al. from Waseda University. It was imple-
+       mented as an ASIC (using TSMC 28nm process technology), speciffcally for accelerating the CONV
+       layers of AlexNet. It uses a dual-channel 1D systolic chain ofNchain PEs to ffexibly compute 2D
+       convolutions of any kernel size. Each PE has a multiplier and a set of input multiplexers controlled
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:12 A. Boutros et al.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+         Fig. 5. Chain-NN CA with(Nchain =16,K=2,Nsub =4)and the internal architecture of each PE.
+
+       by complex central control logic that splits the PE chain intoNsub smaller sub-chains according
+       to the size of the convolution kernel, whereNsub =Nchain /(K×K), as shown in Figure5.We
+       implemented this CA for our study, because, despite being originally proposed as an ASIC imple-
+       mentation, it has compelling resemblance to FPGA architectures that can efffciently implement 1D
+       systolic chains of multipliers using the on-chip hard DSP blocks.
+         This CA separates the input feature maps into odd and even columns and uses two separate
+       input buffers to store them. The two input buffers supply inputs to the ffrst PE of every sub-chain
+       (i.e., the ffrst of every 9, 25, and 121 PEs to implement convolutions of kernel size 3×3, 5×5,
+       and 11×11,respectively). There areNsub−MAX output buffers, each of which stores the outputs
+       produced by a sub-chain whereNsub−MAX =Nchain /(3×3),sincethat3×3 is the smallest kernel
+       size used in AlexNet CONV layers. Each PE in the chain contains both a multiplier and a small local
+       buffer of 512 words for storing the weights needed for the computations performed in this speciffc
+       PE. The largest Arria 10 FPGA contains 3,136 multipliers but only 2,713 BRAMs. We therefore
+       implement the local weight buffers in the soft logic (MLABs) and use the BRAMs to implement
+       input and output feature buffers.
+         Figure5shows the details of the dual-channel PE used in the 1D systolic chain of this CA. The
+       two input channels receive odd-column and even-column input features either from the odd and
+       even input buffers, respectively, if it is the ffrst PE of a sub-chain, or from the channels of the
+       previous PE, otherwise through an input multiplexer. The odd-column and even-column inputs
+       propagate to the next PE after two cycles due to the systolic registers added to the chain. Another
+       odd/even multiplexer chooses the MAC unit input to be either the odd-column or even-column
+       input feature. The MAC unit multiplies the chosen input with the corresponding weight from the
+       local weight buffer and adds the output to the previous partial result from the output buffers if
+       it is the ffrst PE of a sub-chain or to the output of the previous PE otherwise. For a CONV layer
+       with kernel sizeK, the convolution engine produces the partial results of a tile of sizeNOX ×K
+       acrossNsub output feature maps. Then this is repeatedNIM times (Loop 4 in Algorithm1) with
+       the partial results used as inputs to the MAC units of the ffrst PE in each sub-chain to produce the
+       ffnal results of this tile. The next tile of the sameNsub output feature maps is processed in the same
+       manner (Loop 3) until the wholeNOX ×NOY ×Nsub are computed after which the computations
+       of the nextNsub output feature maps (Loop 1) starts.
+         The selection lines for the input multiplexer and output de-multiplexer of each PE are generated
+       by a central control unit and are dynamically changed after each CONV layer according to the
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:13
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+             Fig. 6. Odd-column and even-column input selection schemes forNOX =3andK=3.
+
+       layer’s kernel size. The control logic to choose between odd-column and even-column inputs is
+       explained in Figure6, which shows, as an example, a sub-chain of 9 PEs in the case of a CONV
+       layer withK=3andNOX =3. To compute a tile of sizeNOX ×Koutputs, it requires an input tile
+       of size(NOX +K−1)×(2K−1). The ffgure shows the inputs streamed from the input buffers to
+       the sub-chain at every time step starting from time step 9 when the pipeline is fflled. Input features
+       from the even-column buffer lag behind those from the odd-column buffer byKcycles as shown
+       in the ffrst time step in Figure6. After streaming a complete column of the input tile (2K−1 input
+       features), no new inputs are fed into the pipeline for the next time step after which features from
+       the next column of same type (odd or even) are fed into the sub-chain. The thick boxes in Figure6
+       show the odd/even selection for each PE in every time step. At any time step, the input selections
+       alternate between odd and even for everyKPEs in the sub-chain. After everyKtime steps the
+       selections are toggled to form all the convolution windows required.
+         Extensions:Since it was originally proposedas an ASIC architecture only for CONV layers, we
+       migrated and optimized this CA for FPGAs and added POOL, LRN, BNORM and ELTWISE blocks
+       that were not part of the original implementation in Reference [50]. The POOL block buffers the
+       ffnal results of theNsub output feature maps until a pooling window is ready to be computed.
+       The LRN block operates on results ofKN adjacent maps and the BNORM and ELTWISE blocks
+       operate on single results separately so their integration to this CA was straightforward. Since the
+       other two CAs compute both CONV and FC layers using the same hardware, to provide a fair
+       comparison, we extended this CA by mapping both the 1×1 CONV layers used in ResNet-50
+       and the FC layers to its convolution engine instead of implementing a dedicated engine for those
+       layers. Unlike the conventional CONV layers, each output feature in this layers is the result of a
+       dot-product of two vectors. Therefore, we use sub-chains of size 9 PEs as dot-product units that
+       multiply and accumulate an input feature vector withNsub weight vectors to produceNsub partial
+       results in parallel. The main drawbacks of this approach is that it does not exploit the dual-channel
+       architecture and the complex control logic, since there is no need to arrange data in convolutional
+       windows as previously explained. Also, the effective efffciency of the PEs is signiffcantly degraded
+       when executing these layers due to wasting the majority of cycles fflling and ffushing the pipeline
+       of the systolic sub-chain to produce the result of one dot-product.
+
+       4 METHODOLOGY
+       We implement the three CAs described in Section3using parameterizable SystemVerilog, in which
+       we specify the CA variation to be BSC, LRN, or ELT, which is the notion we will use for the rest of
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:14 A. Boutros et al.
+
+                         Table 2. CA Parameters and Experimental Setup
+        Feature/Weight Precision   16-bit/8-bit ffxed point
+        ASU-like Parameters      POX =POY =14,POM =16
+        Intel-DLA-like Parameters  PIM =8,POX =6,P  =4,POX    OM =48
+        Chain-NN-like Parameters  Nchain =2904 (LRN),Nchain =2304 (BSC and ELT)
+        ASIC Process Technology   28 nm STMicroelectronics standard-cell libraries
+        ASIC Design Corner       worst-case, 1.0V, 125°C
+        ASIC CAD              Synopsys Design Compiler 2013.03 and Cadence Innovus 16
+        FPGA Device            20nm Intel Arria 10 GX 1150 (10AX115N2F45I1SG)
+        FPGA Design Corner      slowest, 0.95V, 100°C
+        FPGA CAD              Intel Quartus Prime 17.00
+
+       the article to refer to CAs that implement VGG-16, AlexNet, and ResNet-50 CNN models, respec-
+       tively. The variations of each CA contain only the blocks required for each of their corresponding
+       CNN models. For instance, the BSC variation will not contain LRN, BNORM, or ELTWISE blocks as
+       there are no normalization or elementwise layers in the VGG-16 model. For all the CAs, we use 16-
+       bit and 8-bit ffxed-point features and weights, respectively. For the ASU-like and Intel-DLA-like
+       architectures, we use the same parameters reported in References [27] and [2]. For the Chain-
+       NN-like CA, since it was originally implemented as an ASIC, the parameters used in Reference
+       [50] will leave most of the FPGA’s DSP blocks unutilized. Therefore, we assigned the number of
+       PEs (Nchain ) to be the minimum value that achieves the highest performance given the available
+       DSP block count constraint. As an example, for an Arria 10 device with 3,036 hard multipliers,
+       in case of VGG-16 that has 3×3 CONV layers with 512 output channels, we can fft a maximum
+       of 3,036÷(3×3) =337 sub-chains that occupy 3,033 multipliers and compute this CONV layer
+       in 512÷337 =2 rounds. However, we can use only 2,304 hard multipliers (i.e. 256 sub-chains)
+       instead, which computes the same layer also in 2 rounds but uses fewer DSP blocks and does not
+       affect the performance of other layers as well. Table2summarizes the experimental setup and the
+       parameters used in each CA.
+         We optimize the performance of the three CAs implemented on the FPGA to achieve the highest
+       possible operating frequency for each one. We then migrate the exact same RTL implementations
+       to ASICs using the same architecture parameters indicated in Table2. One might argue that an
+       optimized ASIC design can achieve higher performance by, for example, building custom highly
+       efffcient inter-PE network-on-chip such as in Reference [4] or fftting signiffcantly more MACs on-
+       chip [19]. However, the purpose of this study is not to benchmark FPGAs vs. ASICs in accelerating
+       CNN inference, but rather highlight the bottlenecks of current FPGA architectures when imple-
+       menting those CAs. Therefore, the ASIC implementations in this study serve as an upper-bound
+       on the performance and area-efffciency of FPGA-optimized CNN accelerators where all the FPGA
+       programmability has been removed. Comparing the same CAs on FPGAs and ASICs enables us
+       to quantify the effect of FPGA programmability on the performance and area of those CAs and
+       pinpoint the causes of this gap in current FPGA architectures; this would not be possible if we
+       instead compared existing ASIC implementations to totally different state-of-the-art FPGA ones.
+
+       4.1 Performance Modeling
+       To obtain the performance results of the three CAs, we build analytical performance models based
+       on our RTL simulations that calculate the number of cycles required for the computation of each
+       layer as well as the time required for any necessary memory transfers of weights and features.
+       We assume that the layout of the features and weights in the external memory is optimized for
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:15
+
+
+
+
+
+
+
+
+
+             Fig. 7. Processing time breakdown of one image for the LRN variation of the three CAs.
+
+       the parallelization schemes of each CA, which allows us to utilize the burst capabilities and all
+       the external memory bandwidth available. Given a high-level description of the CNN model, the
+       operating frequency of the accelerator, the bit-widths of weights/features, and the available exter-
+       nal memory bandwidth, our performance models produce the computation and memory transfer
+       time required for each layer of the CNN. Our performance models assume either a single bank of
+       DDR4x64 memory at 1,200MHz (for a total bandwidth of 17GB/s) or unlimited bandwidth to obtain
+       effective performanceandcomputational performanceresults, respectively. As an example, Figure7
+       shows the performance model output for AlexNet on the three CAs. We then use this output to
+       calculate the throughput in GOPS counting each MAC as two operations (i.e., a multiplication and
+       an addition). We veriffed our performance models against the results reported in References [27]
+       and [2], and we found that our models align well with the published results.
+
+       4.2 ASIC Flow
+       For the ASIC implementations, we use Synopsys Design Compiler 2013.03 to synthesize the CAs
+       using 28nm STMicroelectronics standard-cell libraries; we target an unachievable clock period
+       of 0ns to achieve the highest possible frequency and then perform area recovery by setting the
+       maximum area to 0 and carrying out an incremental compilation. The standard-cell library comes
+       with a wide variety of variations for different processes, voltages and operating temperatures, from
+       which we choose the 1.0V, 125°C, and worst-case process corner for our experiments.
+         Memory Compiler:We use COFFE’s memory compiler [46] to generate on-chip memories for
+       our ASIC implementations. Although this memory compiler was previously used to design FPGA
+       BRAM blocks, it is capable of designing custom memory blocks for ASICs with any required word
+       size and depth, without any FPGA-speciffc circuitry. The memory cell layout as well as the veri-
+       ffcation of its area and timing results against state-of-the-art industrial and academic designs are
+       detailed in Reference [46]. Our experiments also show that the area of memory blocks generated
+       by COFFE’s memory compiler align well with that generated by the OpenRAM [12]memorycom-
+       piler for memories having different word sizes and depths. The ASIC CAs have the ffexibility to
+       implement on-chip memories of the required size and type (i.e., simple or dual port) unlike the
+       FPGA implementations, which are constrained by the ffxed size of BRAM blocks.
+         Place and Route Correction Factors:Using synthesis-only resultsfor ASIC designs can over-
+       estimate frequency and underestimate area as it only predicts routing effects. However, pushing all
+       nine designs that we implemented through multiple iterations of the place-and-route ffow proved
+       computationally infeasible due to the very high runtime of such large designs and the limited tool
+       licenses available. However, we exploit the modular nature of the three architectures and place
+       and route smaller instances of the CAs with fewer PEs ( 1 /8 to 1 /4 of the full size designs) to obtain
+       correction factors for our synthesis-only results of the full-size CAs. We use Cadence Innovus 16
+       to place and route our designs. Our experiments show that the frequency achieved in synthesis is
+       degraded after placement and routing by factors of 0.65, 0.74, and 0.73 for the ASU-like, Intel-DLA-
+       like, and Chain-NN-like CAs, respectively. We observed that the area of the CAs scale linearly and
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:16 A. Boutros et al.
+
+                  Table 3. Frequency, Effective Performance, and Image Processing for the
+                            FPGA Implementations of Different CAs
+        CA                   ASU-like      Intel-DLA-like    Chain-NN-like
+        Variation           BSC  LRN ELT  BSC  LRN ELT BSC LRN  ELT
+        Frequency (MHz)     266   216  258  339   336  345  186  197   188
+        Eff. Perf. (GOPS)     1,077  303  580  1,660  307  620  423  128   54
+        Processing Time (ms)  27.2   4.8  13.3  16.1   4.5  10.5  73.1  11.4  143.5
+       BSC, LRN, and ELT CA variations implement the VGG-16, AlexNet, and ResNet-50, respectively. The Intel-DLA-like results
+       are with batch size 1.
+       that the correction factors are consistent across different sizes of the CAs, as we expected given
+       the modular nature of these architectures, and this increases our conffdence in the correction fac-
+       tors. We also needed to bloat the area of the ASIC implementations by 5% for ASU-like and 11%
+       for both Intel-DLA-like and Chain-NN-like architectures to achieve a successful routing that met
+       timing. We apply those correction factors to our synthesis-only results to obtain more accurate
+       and realistic area and performance numbers for the placed and routed ASIC implementations.
+
+       4.3 FPGA Flow
+       For the FPGA implementations, we use Intel Quartus Prime 17.0 to synthesize, place and route the
+       three variations of each CA for the largest and fastest speed-grade Arria 10 device. The function-
+       ality of all the designs is veriffed using ModelSim Intel FPGA Starter Edition 10.5b. To estimate
+       the area occupied by the CAs on the FPGA, we ffrst convert all the utilized resources to equivalent
+       ALMs (eALMs). It is reported in Reference [34] that the costs of an M20K block and a DSP block
+       in Stratix V architecture are 40 and 30 eALMs, respectively. For the Arria 10 architecture, which
+       uses the same M20K blocks as Stratix V, we use the same cost for BRAMs; however, we account for
+       the 10% increase in DSP block area compared to Stratix V due to adding support for ffoating-point
+       arithmetic [21] leading to a DSP block cost of 33 eALMs. After that, we use the publicly available
+       area of the 65 nm Stratix III ALM [53] and scale it down to 28nm to get an area estimate in squared
+       millimeters that is comparable to the area of the ASIC implementations. Although the ALM ar-
+       chitecture has only minor changes from Stratix III to Arria 10, we believe that the area results of
+       the FPGA implementations in squared millimeters can still be optimistic, since we assume ideal
+       scaling from 65 to 28nm. However, we are most interested in relative trends in our area gap anal-
+       ysis, which can help us identify the blocks that have relatively higher gap than others, rather than
+       ffnding the absolute area results in squared millimeters with high accuracy.
+
+       5 RESULTS
+       In this section, we ffrst compare the FPGA implementations of the different variations of the three
+       CAs in terms of performance, resource utilization, and area breakdown. Then, we study the per-
+       formance and area gap compared to the ASIC implementations. Finally, we analyze these results
+       and suggest FPGA architectural changes to achieve more efffcient CNN inference acceleration.
+
+       5.1 FPGA Results
+       Table3summarizes the maximum frequency and the processing time of one image and Figure8(a)
+       shows the performance results in TOPS for all variations of the three CAs. We show the perfor-
+       mance results of the Intel-DLA-like CA in case of both processing a batch of sizeB=96 images,
+       similar to what was reported in Reference [2], andB=1 similar to the other CAs. Besides using
+       the Winograd transform that signiffcantly reduces the amount of required operations and reduc-
+       ing external memory transfers by using double-buffered stream buffers, the Intel-DLA-like CA also
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:17
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+            Fig. 8. FPGA Results: (a) Performance in TOPS. (b) Resource utilization. (c) Area breakdown.
+
+       achieves the highest frequency because of its pipelined daisy-chain architecture that allows an op-
+       timized placement of the PEs with less fan-out from the feature/weight buffers to the PEs when
+       compared to the other CAs. Therefore, the Intel-DLA-like CA achieves the highest performance
+       with 1.54×and 1.07×more TOPS than that achieved by the ASU-like CA (which uses more PEs)
+       for the BSC and ELT and LRN variations, respectively, in case of a single image inference.
+         The Intel-DLA-like CA has the highest advantage over the ASU-like-CA in the BSC variation,
+       since all the CONV layers of VGG-16 are of size 3×3 that benefft the most from the Winograd
+       transform. This advantage decreases in the ELT variation as the ratio of 3×3 CONV layers to all
+       layers decreases in ResNet-50, and we cannot fully make use of the double-buffering technique due
+       to the ELTWISE layers that require storing intermediate results to the external memory. However,
+       despite the signiffcantly higher performance reported in Reference [2] in case of batch processing
+       of FC layers, it achieves slightly more TOPS when compared to the ASU-like CA in case of single
+       image inference using AlexNet. Figure8(a) also shows that the gains from batch processing (4.2×
+       and 1.8×more TOPS in the LRN and BSC variations, respectively) almost vanishes in ELT, since
+       the ResNet-50 model has only one small FC layer compared to three larger FC layers in AlexNet
+       and VGG-16.
+         The Chain-NN-like CA has the lowest performance results in all variations, since it runs at a sig-
+       niffcantly lower frequency than the other CAs. We believe that this is due to the high utilization of
+       the FPGA’s soft fabric (between 74%–77% as shown in Figure8(b)), leading to physically stretched
+       critical paths. The large fan-out from the odd/even input buffers to the ffrst PE of all sub-chains
+       and the large multiplexers used for selecting the outputs of sub-chains for different convolution
+       sizes (i.e., selecting between every 9th, 25th, 49th, or 121st PE for CONV layers of sizeK=3,5,7,
+       or 11,respectively) also negatively affect the frequency. Finally, the performance of this CA is sig-
+       niffcantly degraded in FC layers and 1×1 CONV layers, since it was originally implemented for
+       accelerating only the CONV layers as explained in Section3.3.
+         Figure8(b) shows the percentage utilization of ALMs, M20K BRAM blocks, and DSP blocks for
+       each CA variation. The highest utilization percentage in most cases is for the DSP blocks, which
+       are the core of the convolution engine in all CAs. The ASU-like CA uses all the 1,518 DSP blocks
+       (3,03618-bitmultipliers)toimplementthethree-dimensionalarrayofMACunitsinitsconvolution
+       engine and off-loads 100 MAC units to the FPGA’s soft fabric. The BSC and ELT variations of the
+       Intel-DLA-like CA use 91% of the DSP blocks, 224 of which are used for the Winograd transform
+       and inverse transform, while 1,152 blocks are used to implement the dot product units in its PEs. In
+       addition, its LRN variation uses the remaining DSP blocks to implement some of the multiplication
+       operations of the LRN layers. The Chain-NN CA uses signiffcantly more soft logic, because it
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:18 A. Boutros et al.
+
+                                                 Table 4. Summary of Area and
+                                                    Performance Ratios
+
+                                             Var    CA     AR 1  CPR 2  EPR 3
+                                                   ASU    9.38   4.44   2.09
+                                                 Intel-DLA   7.87   2.83   1.26BSC  Chain-NN   8.16   6.33   3.63
+                                                   ASU    11.02  4.63   1.25
+                                                 Intel-DLA   8.48   2.91   1.15LRN  Chain-NN   8.38   5.98   2.29
+                                                   ASU    9.48   4.58   2.08
+                                                 Intel-DLA   7.93   2.82   1.5ELT  Chain-NN   8.27   6.26   5.35
+                                                Geomean    8.73  4.31  2.01
+                                            1 Area Ratio (FPGA/ASIC).
+                                            2 Computational Performance Ratio (ASIC/FPGA).
+                                            3 Effective Performance Ratio (ASIC/FPGA).
+              Fig. 9. Area and performance gaps.
+       implements the weight buffers as distributed memories in MLABs. In Figure8(c), we show the
+       area in squared millimeters estimated according the methodology of Section4.3and its breakdown
+       for all the CAs. With the exception of the Chain-NN-like CA that uses a signiffcant amount of
+       the soft fabric to implement weight buffers, the area of the two other CAs is dominated by the
+       computational blocks such as the convolution, pooling and normalization blocks. In the Intel-
+       DLA-like CA, the Winograd transform and inverse transform blocks contribute to the total area
+       by 29–33%, which is almost as expensive as the convolution engine, which consumes 32–37% of
+       the total area.
+
+
+       5.2 Performance Gap
+       Figure9illustrates the area and computational performance gap between the FPGA and ASIC
+       implementations of the three variations of each CA. The FPGA implementations are represented
+       as triangles while the ASIC implementations are represented as squares. The colors and patterns of
+       the data points represent the variation and the CA, respectively, and the dotted lines connect each
+       FPGA implementation to its ASIC counterpart. The closer the data point is to the upper left corner
+       ofthegraph,thebetteritisasitwillhavesmallerareaandhigherperformance.Table4summarizes
+       the FPGA-to-ASIC area ratios as well as the computational performance and effective performance
+       ASIC-to-FPGA ratio for each CA variation. The computational performance ratio (CPR) represents
+       the performance gap between the FPGA and ASIC implementations assuming inffnite external
+       memory bandwidth. However, the effective performance ratio (EPR) represents the performance
+       gap assuming a single-bank external memory interface as speciffed previously. We believe that
+       the computational performance ratio better captures the cost of FPGA programmability and its
+       effect on the computational core performance of the three CAs as it is not limited by a relatively
+       low-performance external memory interface. The values of EPR are less than those of the CPR as
+       shown in Table4due to the external memory bandwidth constraints. As the performance of the
+       computational engine increases, the CAs can use multiple DDR memory banks or high-bandwidth
+       memory to enhance the overall performance. Therefore, EPR and CPR represent lower and upper
+       bounds for design points using different external memory systems. Since the main focus of this
+       work is studying the computational gap caused by the FPGA programmability, we believe that the
+       CPR is the more important metric.
+
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:19
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 10. Area gap between FPGA and ASIC implementations for different blocks of: (a) BSC, (b) LRN, and
+       (c) ELT. The percentages represent the contribution of each component to the total area of the FPGA
+       implementation.
+
+         Interestingly, the computational performance gap is not consistent among different CAs; how-
+       ever different variations of the same CA have similar gap results. The Intel-DLA-like CA has
+       the smallest ASIC-to-FPGA computational performance ratio (≈2.9) compared to the ASU-like
+       and Chain-NN-like CAs (≈4.6 and 6.2,respectively). We believe that the reason is that the Intel-
+       DLA-like CA has a modular daisy-chain architecture, which is more routing-friendly and bene-
+       ffts the FPGA implementation more than the ASIC one due to the relatively slow speed of FPGA
+       routing.
+
+       5.3 Area Gap
+       On average, the FPGA implementations have 8.7×larger area than their ASIC counterparts and
+       the gap is, in contrast to the performance gap, fairly similar across different variations of the three
+       CAs. To understand the reasons for this gap, Figures10(a),10(b), and10(c) illustrate the area ratio
+       of different components in the FPGA implementations to those in the ASIC implementations for
+       the BSC, LRN, and ELT variations, respectively. The percentages written above the bars represent
+       the area breakdown of each FPGA implementation into different components and hence indicate
+       the contribution of each component to the overall area gap. We notice that the convolution engine,
+       which has the largest contribution to total area (up to 60% in some cases) and thus the strongest
+       impactonthetotalareagap,hasanFPGA-to-ASICareaarearatiorangingfrom13to31fordifferent
+       variations of the three CAs. The Intel-DLA-like uses Winograd transform to signiffcantly reduce
+       MAC operations in convolution, which costs almost the same area as the convolution engine in the
+       FPGA implementation. However, the Winograd transform and inverse transform blocks in this CA
+       have FPGA-to-ASIC area ratios of 28 and 26, respectively, which are almost twice the area gap for
+       the convolution engine, since they contain a large number of multi-input adders implemented in
+       the FPGA’s soft fabric compared to the convolution engine, which is mostly implemented in hard
+       DSP blocks. The smallest area gap is in the feature and weight buffers, since the RAMs in the FPGA
+       and the ASIC implementations are both custom SRAM blocks. However, the buffers area ratios are
+       still signiffcant (≈3–5)because of the area overhead of the programmable routing in BRAM tiles
+       as well as the underutilization of some of the M20K blocks on the FPGA, whereas in the ASIC
+       implementations, we use memories with the exact required sizes. The NORM block has an area
+       ratio of 32 and 28 and consumes 22% and 14% of the total area in ASU-like and Intel-DLA-like CAs,
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:20 A. Boutros et al.
+
+       respectively, since it is a heavily arithmetic block and is mostly implemented in the soft fabric.
+       However, it only consumes 3% of the total area in the Chain-NN-like CA, which produces outputs
+       in one dimension only and therefore does not normalize output features at different locations in
+       parallel. The POOL, ELTWISE and BNORM blocks have large area ratios, however they have small
+       overall areas and hence limited impact on the total gap.
+         An interesting observation is that the area gap in the convolution engine of the Intel-DLA-like
+       CA is signiffcantly less than that of the other two CAs: an area ratio of 13 compared to 20 and
+       29 in ASU-like and Chain-NN-like CAs, respectively. This is because the Intel-DLA-like CA uses
+       the hard adders in the DSP blocks to implement its dot-product unit, while the other two CAs
+       pay for the area of the complete DSP block on the FPGA but only make use of the multipliers
+       inside it and thus have a higher area gap compared to their ASIC counterparts. This observation
+       motivates the investigation of new DSP block designs that could bring more of the convolution
+       engine functionality inside the hard DSP block. For instance, the ASU-like CA needs two separate
+       accumulators for the two independent 18-bit multipliers, which is not supported in current DSP
+       blocks. Hence, the DSP block accumulators are wasted and soft logic is used to implement the
+       accumulators. The convolution engine of the Chain-NN-like CA has the highest area gap as it
+       implements input multiplexing, accumulation, and output de-multiplexing in the soft fabric.
+
+       5.4 Architectural Insights
+       Based on the results of Sections5.1and5.2, we can draw several architectural insights:
+
+          • According to the resource utilization results in Figure8(b), the limiting factor is the DSP
+            block count available on-chip, with close to 100% resource utilization in most cases. One
+            direct approach to gain higher performance is adding more DSP blocks to current FPGAs,
+            especially given that a DSP-focused device spends only 5% of its core area on DSP blocks
+            [21]. This requires a careful architectural study to determine the optimal ratio and area
+            distribution between DSPs, BRAMs, and ALMs for DL-tuned FPGAs that are still ffexible
+            enough and suitable for other applications as well. These architectural explorations require
+            a suite of DL benchmark circuits such as the one we developed in this work, and which we
+            plan to expand and open-source in future work.
+          • AsshowninFigure10, the area gap of the convolution engine of the Intel-like-DLA CA is
+            signiffcantly less than that of the other two CAs, since it makes better use of the DSP block
+            available functionalities such as the internal adders and hard cascade chains. By looking
+            at the ASIC area breakdown of the convolution engine, we can see that about 72% of the
+            logic in the convolution engine of the Intel-DLA-like CA was implemented inside hard DSP
+            blocks on the FPGA compared to only 32% and 35% in the ASU-like and Chain-NN-like CAs,
+            respectively, and the rest is implemented in the soft fabric. We believe that small changes to
+            the DSP block architecture could capture more of the convolution engine hardware inside
+            the hard circuitry of the DSP block. For example, adding an operation mode that conffgures
+            the two internal adders as independent accumulators for two independent 18-bit MACs
+            (such as in the ASU-like CA) or having a small circular shift register accumulator for inter-
+            leaving dot-product operations (as in the Intel-DLA-like CA) would save soft logic. Neither
+            of the DSP block enhancements would add much logic to the block, nor would they require
+            more block routing ports (inputs and outputs) and, therefore, the DSP block area increase
+            would be minimal. To increase the DSP block count on-chip, as mentioned in our ffrst sug-
+            gestion, we not only wish to avoid signiffcant block area increase, but also remove DSP
+            block functionalities that are unnecessary for DL applications and would not cause severe
+            performance degradation when implemented in the soft fabric. For example, removing the
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:21
+
+            built-in constant coefffcient banks in the Arria 10 DSP blocks should be evaluated as they
+            are not usable by any of our CAs.
+          •In this study, we used 16- and 8-bit ffxed-point precision for features and weights, respec-
+            tively, in all CAs to ensure fair comparisons. However, the most suitable precision for CNN
+            inference is debatable and varies widely in the literature from single-precision ffoating-
+            point down to ternary and binary [28]. Currently, DSP blocks from Intel and Xilinx support
+            a limited number of precisions. For instance, a DSP block in Intel Arria 10, and similarly
+            Stratix 10, FPGAs supports two 18-bit, one 27-bit, or one single-precision ffoating-point
+            multiplication. However, a DSP slice in Xilinx Virtex Ultrascale FPGAs supports one 27×18
+            multiplication. Designers can sometimes fft more low-precision multiplies that match cer-
+            tain patterns using clever tricks such as performing two 8-bit multiplies that share one
+            operand using a single Xilinx DSP slice [8]. Even with these operand packing tricks, using
+            lower precision leaves a portion of the DSP block logic idle. We can avoid this by designing
+            DSP blocks that natively support low-precision multiplications and reuse routing ports and
+            multiplier sub-arrays to keep the area overhead minimal.
+          •When implementing the three CAs, we noticed that the required on-chip buffers are either
+            deep central buffers for input and output features or smaller and more distributed buffers
+            for the weights. When we tried to extend the double-buffering technique used in the Intel-
+            DLA-like CA to more layers of models larger than AlexNet by implementing deeper stream
+            buffers, it resulted in a net performance degradation as the operating frequency dropped
+            signiffcantly due to depth stitching of M20K BRAMs to implement those deep buffers. How-
+            ever, when implementing the small weight buffers of the Chain-NN-like CA in MLABs, the
+            high utilization of the soft fabric also resulted in lower operating frequency. This observa-
+            tion indicates that having only M20K BRAMs and MLABs to implement on-chip memories
+            might not be a good fft for DL acceleration on FPGAs. This also requires a more detailed ar-
+            chitectural study to determine the best size and ratio of on-chip BRAMs and their effect on
+            the overall performance using DL-representative benchmarks, and we believe our parame-
+            terized CAs can form the start of this benchmark set. In addition, the memory-richness of
+            FPGAs can be enhanced by employing emerging technologies such as Magnetic Tunneling
+            Junction memories, which can provide bigger yet more dense BRAMs for memory-intensive
+            applications as shown in Reference [54].
+
+       6 CONCLUSION
+       In this article, we implemented three highly optimized state-of-the-art CAs for accelerating CNN
+       inference, which are: ASU-like, Intel-DLA-like, and Chain-NN-like CAs. We implemented three
+       variations of each CA (BSC, LRN, and ELT) for three different CNN models (VGG-16, AlexNet, and
+       ResNet-50, respectively) on an Intel Arria 10 FPGA device and compared them to 28nm ASIC im-
+       plementations of the same CAs to quantify the programmability cost that comes with using FPGAs
+       on the performance and area of DL accelerators. Across different variations of the three CAs, we
+       observed a consistent area gap with an average FPGA-to-ASIC area ratio of 8.7×, to which the con-
+       volution engine contributes the most with area ratios ranging from 13 to 31 for different CAs. The
+       performance gap, unlike the area gap, varies signiffcantly across different CAs. The computational
+       performance of the ASIC implementations is 2.8×to 6.3×faster than that of the FPGA imple-
+       mentations when assuming inffnite external memory bandwidth. We ffnd that the Intel-DLA-like
+       CA has the smallest performance gap compared to its ASIC counterpart indicating that focusing
+       on modular and routing-friendly designs is of great importance for building efffcient FPGA-based
+       DL accelerators. Finally, we suggest several FPGA DSP and RAM architecture changes for future
+
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       20:22 A. Boutros et al.
+
+       work that could reduce the area and performance gaps and enable more efffcient DL acceleration
+       on FPGAs.
+
+       ACKNOWLEDGMENTS
+       TheauthorsthankMartinLanghammer,DebbieMarr,andErikoNurvitadhiforhelpfuldiscussions,
+       as well as Huawei, Intel, and NSERC for funding support.
+
+       REFERENCES
+        [1] M. Abadi et al. 2016. TensorFlow: A system for large-scale machine learning. InProceedings of the OSDI. 265–283.
+        [2] U. Aydonat et al. 2017. An OpenCL (TM) deep learning accelerator on Arria 10. InProceedings of the FPGA. 55–64.
+        [3] Y. Chen et al. 2014. DaDianNao: A machine-learning supercomputer. InProceedings of the MICRO. 609–622.
+        [4] Y. Chen et al. 2017. Eyeriss: An energy-efffcient reconffgurable accelerator for deep convolutional neural networks.
+          InProceedings of the JSSC, Vol. 52. 127–138.
+        [5] S. Chetlur et al. 2014. CuDNN: Efffcient primitives for deep learning.arXiv:1410.0759.
+        [6] E. Chung and J. Fowers. 2017. Accelerating persistent neural networks at datacenter scale. InProceedings of the HOT
+          CHIPS,Vol.29.
+        [7] F. Colombo et al. 2017. Deep artiffcial composer: A creative neural network model for automated melody generation.
+          InProceedings of the EvoMUSART. 81–96.
+        [8] Y. Fu et al. 2016. Deep learning with INT8 optimization on Xilinx devices. Inwhite paper of Xilinx.
+        [9] L. Gatys et al. 2015. A neural algorithm of artistic style.arXiv:1508.06576.
+       [10] A. Graves et al. 2013. Speech recognition with deep recurrent neural networks. InProceedings of the ICASSP. 6645–
+          6649.
+       [11] Y. Guan et al. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS
+          hybrid templates. InProceedings of the FCCM. 152–159.
+       [12] Matthew R. Guthaus et al. 2016. OpenRAM: An open-source memory compiler. InProceedings of the ICCAD.
+       [13] P. Gysel et al. 2016. Hardware-oriented approximation of convolutional neural networks.arXiv:1604.03168.
+       [14] K. He et al. 2015. Delving deep into rectiffers: Surpassing human-level performance on ImageNet classiffcation. In
+          Proceedings of the ICCV. 1026–1034.
+       [15] K. He et al. 2016. Deep residual learning for image recognition. InProceedings of the CVPR. 770–778.
+       [16] S. Herculano-Houzel. 2009. The human brain in numbers: A linearly scaled-up primate brain. InFrontiers in Human
+          Neuroscience,Vol.3.
+       [17] S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate
+          shift. InProceedings of the ICML. 448–456.
+       [18] Y. Jia et al. 2014. Caffe: Convolutional architecture for fast feature embedding.arXiv:1408.5093.
+       [19] N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. InProceedings of the ISCA. 1–12.
+       [20] A. Krizhevsky et al. 2012. ImageNet classiffcation with deep convolutional neural networks. InProceedings of the
+          NIPS. 1097–1105.
+       [21] M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. InProceedings of the FPGA.
+          117–125.
+       [22] A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. InProceedings of the CVPR. 4013–4021.
+       [23] Z. Liu et al. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. InProceed-
+          ings of the FPT. 61–68.
+       [24] L. Lu et al. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. InProceedings of the FCCM.
+          101–108.
+       [25] Y. Ma et al. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. InPro-
+          ceedings of the FPL.1–8.
+       [26] Y. Ma et al. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolu-
+          tional neural networks. InProceedings of the FPL.1–8.
+       [27] Y. Ma et al. 2017. Optimizing loop operation and dataffow in FPGA acceleration of deep convolutional neural net-
+          works. InProceedings of the FPGA. 45–54.
+       [28] A. Mishra et al. 2017. WRPN: Wide reduced-precision networks.arXiv:1709.01134.
+       [29] E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In
+          Proceedings of the FPT. 77–84.
+       [30] K. Ovtcharov et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. InMicrosoft
+          Research Whitepaper,Vol.2.
+
+       ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.       FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference 20:23
+
+       [31] A. Prost-Boucle et al. 2017. Scalable high-performance architecture for convolutional ternary neural networks on
+          FPGA. InProceedings of the FPL.1–7.
+       [32] A. Putnam et al. 2014. A reconffgurable fabric for accelerating large-scale datacenter services. InProceedings of the
+          ISCA. 13–24.
+       [33] J. Qiu et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. InProceedings of
+          the FPGA. 26–35.
+       [34] R. Rashid et al. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL
+          HLS. InProceedings of the FPT. 20–27.
+       [35] D. E. Rumelhart et al. 1985.Learning Internal Representations by Error Propagation. Technical Report.
+       [36] O. Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. InProceedings of the IJCV, Vol. 115.
+          211–252.
+       [37] H. Sharma et al. 2016. From high-level deep neural models to FPGAs. InProceedings of the MICRO. 1–12.
+       [38] F. Shen et al. 2016. Weighted residuals for very deep networks. InProceedings of the ICSAI. 936–941.
+       [39] Y. Shen et al. 2016. Overcoming resource underutilization in spatial CNN accelerators. InProceedings of the FPL.1–4.
+       [40] Y. Shen et al. 2017. Maximizing CNN accelerator efffciency through resource partitioning. InProceedings of the ISCA.
+          535–547.
+       [41] D. Silver et al. 2017. Mastering the game of go without human knowledge. InNature, Vol. 550. 354–359.
+       [42] N. Suda et al. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural
+          networks. InProceedings of the FPGA. 16–25.
+       [43] A. Suleiman et al. 2017. Towards closing the energy Gap between HOG and CNN features for embedded vision.
+          arXiv:1703.05853.
+       [44] I. Sutskever et al. 2014. Sequence to sequence learning with neural networks. InProceedings of the NIPS. 3104–3112.
+       [45] C. Szegedy et al. 2015. Going deeper with convolutions. InProceedings of the CVPR.
+       [46] Kosuke Tatsumura et al. 2016. High density, low energy, magnetictunnel junction based block RAMs for memory-rich
+          FPGAs. InProceedings of the FPT. 4–11.
+       [47] Y. Umuroglu et al. 2017. FINN: A framework for fast, scalable binarized neural network inference. InProceedings of
+          the FPGA. 65–74.
+       [48] S. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs.
+          InProceedings of the FCCM. 40–47.
+       [49] G. Venkatesh et al. 2017. Accelerating deep convolutional networks using low-precision and sparsity. InProceedings
+          of the ICASSP. 2861–2865.
+       [50] S. Wang et al. 2017. Chain-NN: An energy-efffcient 1D chain architecture for accelerating deep convolutional neural
+          networks. InProceedings of the DATE. 1032–1037.
+       [51] Y. Wang et al. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network
+          family. InProceedings of the DAC.1–6.
+       [52] X. Wei et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In
+          Proceedings of the DAC.1–6.
+       [53] H. Wong et al. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. InProceed-
+          ings of the FPGA. 5–14.
+       [54] S. Yazdanshenas et al. 2017. Don’t forget the memory: Automatic block RAM modelling, optimization, and architec-
+          ture exploration. InProceedings of the FPGA. 115–124.
+       [55] C. Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. InProceed-
+          ings of the FPGA. 161–170.
+       [56] C. Zhang et al. 2016. Energy-efffcient CNN implementation on a deeply pipelined FPGA cluster. InProceedings of the
+          ISLPED. 326–331.
+       [57] C. Zhang and V. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA
+          shared memory system. InProceedings of the FPGA. 35–44.
+
+       Received December 2017; revised April 2018; accepted July 2018
+
+
+
+
+
+
+
+
+
+
+          ACM Transactions on Reconffgurable Technology and Systems, Vol. 11, No. 3, Article 20. Pub. date: December 2018.
\ No newline at end of file
diff --git a/Corpus/convex-neural-networks.txt b/Corpus/convex-neural-networks.txt
new file mode 100644
index 0000000..9097e3b
Binary files /dev/null and b/Corpus/convex-neural-networks.txt differ
diff --git a/Corpus/vDNN Virtualized Deep Neural Networks for Scalable Memory-Efficient Neural Network Design.txt b/Corpus/vDNN Virtualized Deep Neural Networks for Scalable Memory-Efficient Neural Network Design.txt
new file mode 100644
index 0000000..73ec9bd
Binary files /dev/null and b/Corpus/vDNN Virtualized Deep Neural Networks for Scalable Memory-Efficient Neural Network Design.txt differ