More corpus documents

2020-08-06 14:53:44 -06:00 · 2020-08-06 14:53:44 -06:00 · 514f272a6d
commit 514f272a6d
parent f30a0b2be3
47 changed files with 12133 additions and 0 deletions
--- a/Cheng.txt
+++ b/Cheng.txt
@ -0,0 +1,555 @@
+        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1
+
+
+
+         A Survey of Model Compression and Acceleration
+
+                             for Deep Neural Networks
+
+                 Yu Cheng, Duo Wang, Pan Zhou,Member, IEEE,and Tao Zhang,Senior Member, IEEE
+
+
+
+
+         Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model
+        recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment
+        in devices with low memory resources or in applications with to billions [4].
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+     arXiv:1710.09282v7  [cs.LG]  7 Feb 2019  strict latency requirements. Therefore, a natural thought is to   As larger neural networks with more layers and nodes
+        perform model compression and acceleration in deep networks are considered, reducing their storage and computational cost
+        without signiﬁcantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech-
+        niques for compacting and accelerating CNNs model developed. tion, recent years witnessed signiﬁcant progress in virtual
+        These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre-
+        parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle
+        ferred/compact convolutional ﬁlters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced.
+        For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efﬁcient deep learning methods can have
+        performance, related applications, advantages, and drawbacks signiﬁcant impacts on distributed systems, embedded devices,
+        etc. Then we will go through a few very recent additional and FPGA for Artiﬁcial Intelligence. For example, the ResNet-
+        successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion ﬂoating number multiplications matrix, the main datasets used for evaluating the model per-
+        formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant
+        this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than
+        on this topic.                                   75% of parameters and 50% computational time. For devices
+         Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte
+        Model Compression and Acceleration,                  resources, how to compact the models used on them is also
+                                                   important.
+                                                     Achieving these goal calls for joint solutions from manyI. I NTRODUCTION                disciplines, including but not limited to machine learning, op-
+         In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing,
+        lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works
+        achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which
+        These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community
+        billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years.
+        very high computation capability plays a key role in their   We classify these approaches into four categories: pa-
+        success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans-
+        achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional ﬁlters, and knowledge distil-
+        using a network containing 60 million parameters with ﬁve lation. The parameter pruning and sharing based methods
+        convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to
+        it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor-
+        ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to
+        example is the top face veriﬁcation results on the Labeled estimate the informative parameters of the deep CNNs. The
+        Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional ﬁlters
+        containing hundreds of millions of parameters, using a mix design special structural convolutional ﬁlters to reduce the
+        of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge
+                                                   distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
+        Way, Redmond, WA 98052, USA.                         compact neural network to reproduce the output of a larger
+         Duo Wang and Tao Zhang are with the Department of Automation, network.
+        Tsinghua University, Beijing 100084, China.                     In Table I, we brieﬂy summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074,
+        China.                                        rank factorization and knowledge distillation approaches can        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2
+
+
+                                                TABLE I
+                        SUMMARIZATION OF DIFFERENT APPROACHES FOR MODEL COMPRESSION AND ACCELERATION .
+              Theme Name                Description             Applications             More details
+          Parameter pruning and sharing    Reducing redundant parameters which   Convolutional layer and  Robust to various settings, can achieve
+                               are not sensitive to the performance    fully connected layer   good performance, can support both train
+                                                                       from scratch and pre-trained model
+            Low-rank factorization      Using matrix/tensor decomposition to   Convolutional layer and    Standardized pipeline, easily to be
+                               estimate the informative parameters    fully connected layer    implemented, can support both train
+                                                                       from scratch and pre-trained model
+         Transferred/compact convolutional  Designing special structural convolutional   Convolutional layer   Algorithms are dependent on applications,
+                 ﬁlters              ﬁlters to save parameters           only         usually achieve good performance,
+                                                                        only support train from scratch
+            Knowledge distillation     Training a compact neural network with  Convolutional layer and    Model performances are sensitive
+                               distilled knowledge of a large model    fully connected layer    to applications and network structure
+                                                                        only support train from scratch
+
+
+        be used in DNN models with fully connected layers and
+        convolutional layers, achieving comparable performances. On
+        the other hand, methods using transferred/compact ﬁlters are
+        designed for models with convolutional layers only. Low-rank
+        factorization and transfered/compact ﬁlters based approaches
+        provide an end-to-end pipeline and can be easily implemented
+        in CPU/GPU environment, which is straightforward. while
+        parameter pruning & sharing use different methods such as
+        vector quantization, binary coding and sparse constraints to
+        perform the task. Generally it will take several steps to achieve
+        the goal.                                     Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output
+        eter pruning/sharing low-rank factorization can be extracted is the compression model.
+        from pre-trained ones or trained from scratch. While the
+        transferred/compact ﬁlter and knowledge distillation models
+        can only support train from scratch. These methods are inde- memory usage and ﬂoat point operations with little loss in
+        pendently designed and complement each other. For example, classiﬁcation accuracy.
+        transferred layers and parameter pruning & sharing can be   The method proposed in [10] quantized the link weights
+        used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the
+        used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce
+        speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con-
+        properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the
+                                                   small-weight connections. Finally, the network was retrained
+              II. P                                 to learn the ﬁnal weights for the remaining sparse connections. ARAMETER PRUNING AND SHARING        This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importanceﬁtting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which   In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classiﬁed into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix.       instance, BinaryConnect [12], BinaryNet [13] and XNORNet-
+                                                   works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization                    activation during the model training. The systematic study in
+         Network quantization compresses the original network by [15] showed that networks trained with back propagation could
+        reducing the number of bits required to represent each weight. be resilient to speciﬁc weight distortions, including binary
+        Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights.
+        quantization to the parameter values. Vanhouckeet al.[8]   Drawbacks: the accuracy of the binary nets is signiﬁcantly
+        showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet.
+        in signiﬁcant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina-
+        work in [9] used 16-bit ﬁxed-point representation in stochastic rization schemes are based on simple matrix approximations
+        rounding based CNN training, which signiﬁcantly reduced and ignore the effect of binarization on the accuracy loss.        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3
+
+
+         To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of
+        Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear
+        directly minimizes the loss with respect to the binary weights. transformsf(x;M) =(Mx), where()is an element-wise
+        The work in [17] reduced the time on ﬂoat point multiplication nonlinear operator,xis the input vector, andMis themn
+        in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense
+        converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing
+        signiﬁcant changes.                              matrix-vector products inO(mn)time. Thus, an intuitive
+                                                   way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing                           structural matrix. Anmnmatrix that can be described
+         Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured
+        network complexity and to address the over-ﬁtting issue. An matrix. Typically, the structure should not only reduce the
+        early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference
+        [18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and
+        Surgeon [20] methods reduced the number of connections gradient computations.
+        based on the Hessian of the loss function, and their work sug-   Following this direction, the work in [30], [31] proposed a
+        gested that such pruning gave higher accuracy than magnitude- simple and efﬁcient approach based on circulant projections,
+                                                   while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training (rprocedure of those methods followed the way training from   0 ;r 1 ;;r d1 ), a circulant matrixR2Rdd is deﬁned
+                                                   as:
+        scratch manner.                                               2                    3 r A recent trend in this direction is to prune redundant,                   0  rd1  ::: r 2  r1 6r6 1   r0  rd1     r2 77 non-informative weights in a pre-trained CNN model. For                6 ..            .     7
+        example, Srinivas and Babu [21] explored the redundancy      R= circ(r) :=66 .   r        .   ..   . 71   r0       . 7:  (1)6          .         7 among neurons, and proposed a data-free pruning method to                4r         .   ..   ..    5d2              rd1
+        remove redundant neurons. Hanet al.[22] proposed to reduce                 rd1 rd2  ::: r 1  r0
+        the total number of parameters and operations in the entire  thus the memory cost becomesO(d)instead ofO(d2 ).network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourierused a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan-   In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fullyon soft weight-sharing was proposed, which included both connected layers. The Adaptive Fastfood transform matrixquantization and pruning in one simple (re-)training procedure. R2Rnd was deﬁned as:The above pruning schemes typically produce connections
+        pruning in CNNs.                                              R=SHGHB            (2)
+         There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices. 2
+        with sparsity constraints. Those sparsity constraints are typ- f0;1gdd is a random permutation matrix, andHdenotes
+        ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con-
+        norm regularizers. The work in [25] imposed group sparsity nected layer withdinputs andnoutputs using the Adaptive
+        constraint on the convolutional ﬁlters to achieve structured Fastfood transform reduces the storage and the computational
+        brain Damage, i.e., pruning entries of the convolution kernels costs fromO(nd)toO(n)and fromO(nd)toO(nlogd),
+        in a group-wise fashion. In [26], a group-sparse regularizer respectively.
+        on neurons was introduced during the training stage to learn   The work in [29] showed the effectiveness of the new
+        compact CNNs with reduced ﬁlters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their
+        structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured
+        ﬁlters, channels or even layers. In the ﬁlter-level pruning, all matrix classes, including block and multi-level Toeplitz-like
+        the above works usedl2;1 -norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34].
+        usedl1 -norm to select and prune unimportant ﬁlters.       Following this idea, [35] proposed a general structured efﬁ-
+         Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs.
+        and sharing. First, pruning withl1 orl2 regularization requires   Drawbacks: one problem of this kind of approaches is that
+        more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the
+        pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand,
+        which demands ﬁne-tuning of the parameters and could be how to ﬁnd a proper structural matrix is difﬁcult. There is no
+        cumbersome for some applications.                   theoretical way to derive it out.
+
+        C. Designing Structural Matrix                          III. L OW -RANK FACTORIZATION AND SPARSITY
+         In architectures that contain fully-connected layers, it is   Convolution operations contribute the bulk of most com-
+        critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4
+
+
+                                                                      TABLE II
+                                                    COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES
+                                                                   ON ILSVRC-2012.
+                                                        Model TOP-5 Accuracy Speed-up Compression Rate
+                                                       AlexNet 80.03% 1. 1.
+                                                      BN Low-rank 80.56% 1.09 4.94
+                                                      CP Low-rank 79.66% 1.82 5.
+                                                       VGG-16 90.60% 1. 1.
+        Fig. 2. A typical framework of the low-rank regularization method. The left    BN Low-rank 90.47% 1.53 2.72
+        is the original convolutional layer and the right is the low-rank constraint    CP Low-rank 90.31% 2.05 2.75
+        convolutional layer with rank-K.                             GoogleNet 92.21% 1. 1.
+                                                      BN Low-rank 91.88% 1.08 2.79
+                                                      CP Low-rank 91.79% 1.20 2.84
+        would improve the compression rate as well as the overall
+        speedup. For the convolution kernels, it can be viewed as a
+        4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic
+        the intuition that there is a signiﬁcant amount of redundancy parameters in deep models using the low-rank method. [42]
+        in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the ﬁnal weight
+        remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted
+        it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite
+        help.                                        the fully connected layer for designing compact multi-task
+         It has been a long time for using low-rank ﬁlters to acceler- deep learning architectures.
+        ate convolution, for example, high dimensional DCT (discrete   Drawbacks: low-rank approaches are straightforward for
+        cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements
+        to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti-
+        respectively. Learning separable 1D ﬁlters was introduced ﬁed units and maxout. However, the implementation is not
+        by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which
+        idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current
+        approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and
+        kernels were proposed in [37]. They achieved 2speedup thus cannot perform global parameter compression, which
+        for a single convolutional layer with 1% drop in classiﬁcation is important as different layers hold different information.
+        accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to
+        decomposition schemes, reporting a 4.5speedup with 1% achieve convergence when compared to the original model.
+        drop in accuracy in text recognition.
+         The low-rank approximation was done layer by layer. The   IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS
+        parameters of one layer were ﬁxed after it was done, and the   CNNs are parameter efﬁcient due to exploring the trans-layers above were ﬁne-tuned based on a reconstruction error lation invariant property of the representations to the inputcriterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-ﬁtting. Although a strong theorylowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant propertyused nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu-decomposition for training low-rank constrained CNNs from tional ﬁlters to compress CNN models is motivated by recentscratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input,()be a network or layer andT()be theboth the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is deﬁned as:Low-rank) can be used to train CNNs from scratch. However,
+        there are few differences between them. For example, ﬁnding                T‘ (x) = (Tx)            (3)the best low-rank approximation in CP decomposition is an ill-
+        posed problem, and the best rank-K(Kis the rank number) indicating that transforming the inputxby the transformT()
+        approximation may not exist sometimes. While for the BN and then passing it through the network or layer()should
+        scheme, the decomposition always exists. We perform a simple give the same result as ﬁrst mappingxthrough the network
+        comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq.
+        speedup and the compression rates are used to measure their (10), the transformsT()andT0 ()are not necessarily the
+        performances.                                  same as they operate on different objects. According to this
+         As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or ﬁlters
+        be viewed as a 2D matrix and thus the above mentioned ()to compress the whole network models. From empirical
+        methods can also be applied there. There are several classical observation, deep CNNs also beneﬁt from using a large set of
+        works on exploiting low-rankness in fully connected layers. convolutional ﬁlters by applying certain transformT()to a        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5
+
+
+        small set of base ﬁlters since it acts as a regularizer for the                   TABLE III
+        model.                                       ASIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND
+         Following this direction, there are many recent reworks                   CIFAR-100.
+        proposed to build a convolutional layer from a set of base       Model CIFAR-100 CIFAR-10 Compression Rate
+        ﬁlters [43]–[46]. What they have in common is that the      VGG-16 34.26% 9.85% 1.
+        transformT()lies in the family of functions that only operate      MBA [46] 33.66% 9.76% 2.
+                                                       CRELU [45] 34.57% 9.92% 2. in the spatial domain of the convolutional ﬁlters. For example,      CIRC [43] 35.15% 10.23% 4.
+        the work in [45] found that the lower convolution layers of     DCNN [44] 33.57% 9.65% 1.62
+        CNNs learned redundant ﬁlters to extract both positive and
+        negative phase information of an input signal, and deﬁnedT()   Drawbacks: there are few issues to be addressed for ap-to be the simple negation function:                   proaches that apply transform constraints to convolutional ﬁl-
+                       T(Wx ) =W             (4) ters. First, these methods can achieve competitive performance x                 for wide/ﬂat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional ﬁlter andW is the ﬁlter x         ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2compression   Using a compact ﬁlter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric ﬁlters with compact blocks to improve the classiﬁcation accuracy. The intuition is that the learning the speed, which signiﬁcantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing33convolution into two11to useful convolutional ﬁlters instead of redundant ones.     convolutions was used in [48], which achieved signiﬁcantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace33convolution with11convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The
+        transformT()was deﬁne as:                                 V. K NOWLEDGE DISTILLATION T‘ (x) =Wx +            (5)   To the best of our knowledge, exploiting knowledge transfer
+        wherewere the multi-bias factors. The work in [47] con- (KT) to compress model was ﬁrst proposed by Caruanaet
+        sidered a combination of rotation by a multiple of90  and al.[50]. They trained a compressed/ensemble model of strong
+        horizontal/vertical ﬂipping with:                     classiﬁers with pseudo-data labeled, and reproduced the output
+                                                   of the original larger network. But the work is limited toT‘ (x) =WT             (6) shallow models. The idea has been recently adopted in [51]
+        whereWT was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide
+        original ﬁlters with angle2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model
+        transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The
+        was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from
+        can achieve good classiﬁcation performance.             a large teacher model into a small one by learning the class
+         The work in [44] deﬁnedT()as the set of translation distributions output via softmax.
+        functions applied to 2D ﬁlters:                        The work in [52] introduced a KD compression framework,
+                                                   which eased the training of deep networks by following aT‘ (x) =T(;x;y)x;y2fk;:::;kg;(x;y)6=(0;0)    (7) student-teacher paradigm, in which the student was penalized
+        whereT(;x;y)denoted the translation of the ﬁrst operand by according to a softened version of the teacher’s output. The
+        (x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into
+        at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained
+        can be used to 1) improve the classiﬁcation accuracy as a to predict the output and the classiﬁcation labels. Despite
+        regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various
+        parameter efﬁciency by ﬂexibly varying their architectures to image classiﬁcation tasks. The work in [53] aimed to address
+        compress networks.                              the network compression problem by taking advantage of
+         Table III brieﬂy compares the performance of different depth neural networks. It proposed an approach to train thin
+        methods with transferred convolutional ﬁlters, using VGGNet but deep networks, called FitNets, to compress wide and
+        (16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended
+        on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In
+        observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher
+        little or no drop in classiﬁcation accuracy.               network, FitNet made the student mimic the full feature maps        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 6
+
+
+        of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec-
+        the capacities of teacher and student may differ greatly.     ture such as GoogleNet or Network in Network, can achieve
+         All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting
+        10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully
+        experimental results show that these methods match or outper- optimized the utilization of the computing resources inside
+        form the teacher’s performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62]
+        parameters and multiplications.                      and motivated them to increase the depth and width of the
+         There are several extension along this direction of dis- network while keeping the computational budget constant.
+        tillation knowledge. The work in [54] trained a parametric   The work in [63] targeted the Residual Network based
+        student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called
+        proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory
+        neural networks for the student model. Different from previous setup to train short networks and used deep networks at test
+        works which represented the knowledge using the soften label time. It started with very deep networks, while during training,
+        probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers
+        neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this
+        information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual
+        The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed
+        instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers
+        network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best
+        are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional
+        mations between neural network speciﬁcations. Zagoruyko networks with adaptive inference graphs to adaptively deﬁne
+        et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66].
+        assumption of FitNet. They transferred the attention maps that   Other approaches to reduce the convolutional overheads in-are summaries of the full activations.                  clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help signiﬁcantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classiﬁcation tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral ﬁlters [70]. Those worksperformance competitive with other type of approaches.     only aim to speed up the computation but not reduce the
+                                                   memory storage.VI. O THER TYPES OF APPROACHES
+         We ﬁrst summarize the works utilizing attention-based
+        methods. Note that attention-based mechanism [58] can reduce    VII. B ENCHMARKS , E VALUATION AND DATABASES
+        computations signiﬁcantly by learning to selectively focus or   In the past ﬁve years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to ﬁrst standard models include LeNets [71], All-CNN-nets [72] andﬁnd the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been signiﬁcantly reduced.                  layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected
+        a sparse combination of the experts to process each input. In   The standard criteria to measure the quality of model
+        [61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the
+        which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters
+        selected and executed a subset of D2NN neurons based on the in the original modelManda is that of the compressed
+        input.                                        modelM , then the compression rate(M;M  )ofM over
+         There have been other attempts to reduce the number of Mis                     aparameters of neural networks by replacing the fully connected                (M;M  ) =  :            (8)a        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7
+
+
+                          TABLE IV                       or low rank factorization based methods. If you need
+           SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT         end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION .          and transferred convolutional ﬁlters approaches could be
+            Baseline Models         Representative Works            considered.
+             Alexnet [1]        structural matrix [29], [30], [32]        For applications in some speciﬁc domains, methods with low-rank factorization [40]           human prior (like the transferred convolutional ﬁlters, Network in network [73]      low-rank factorization [40]
+            VGG nets [74]          transferred ﬁlters [44]            structural matrix) sometimes have beneﬁts. For example,
+                             low-rank factorization [40]           when doing medical images classiﬁcation, transferred Residual networks [75]  compact ﬁlters [49], stochastic depth [63]       convolutional ﬁlters could work well as medical images parameter sharing [24]
+           All-CNN-nets [72]         transferred ﬁlters [45]            (like organ) do have the rotation transformation property.
+             LeNets [71]          parameter sharing [24]           Usually the approaches of pruning & sharing could give parameter pruning [20], [22]          reasonable compression rate while not hurt the accuracy.
+                                                       Thus for applications which requires stable model accu-
+        Another widely used measurement is the index space saving     racy, it is better to utilize pruning & sharing.
+        deﬁned in several papers [30], [35] as                    If your problem involves small/medium size datasets, you
+                                                       can try the knowledge distillation approaches. The com-aa
+                     (M;M  ) =     ;           (9)     pressed student model can take the beneﬁt of transferringa                    knowledge from teacher model, making it robust datasets
+        whereaandaare the number of the dimension of the index     which are not large.
+        space in the original model and that of the compressed model,    As we mentioned before, techniques of the four groups
+        respectively.                                      are orthogonal. It is reasonable to combine two or three
+         Similarly, given the running timesofMands ofM ,     of them to maximize the performance. For some spe-
+        the speedup rate(M;M  )is deﬁned as:                  ciﬁc applications, like object detection, which requires
+                                 s                     both convolutional and fully connected layers, you can(M;M  ) =  :            (10)s                     compress the convolutional layers with low rank based
+        Most work used the average training time per epoch to measure     method and the fully connected layers with a pruning
+        the running time, while in [30], [35], the average testing time     technique.
+        was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster
+        computation for both the training and the testing stages.       Techniques for deep model compression and acceleration
+         Good compression methods are expected to achieve almost are still in the early stage and the following challenges still
+        the same performance as the original model with much smaller need to be addressed.
+        parameters and less computational time. However, for different    Most of the current state-of-the-art approaches are built
+        applications with different CNN designs, the relation between     on well-designed CNN models, which have limited free-
+        parameter size and computational time may be different.     dom to change the conﬁguration (e.g., network structural,
+        For example, it is observed that for deep CNNs with fully     hyper-parameters). To handle more complicated tasks,
+        connected layers, most of the parameters are in the fully     it should provide more plausible ways to conﬁgure the
+        connected layers; while for image classiﬁcation tasks, ﬂoat     compressed models.
+        point operations are mainly in the ﬁrst few convolutional layers    Pruning is an effective way to compress and acceler-
+        since each ﬁlter is convolved with the whole image, which is     ate CNNs. The current pruning techniques are mostly
+        usually very large at the beginning. Thus compression and     designed to eliminate connections between neurons. On
+        acceleration of the network should focus on different type of     the other hand, pruning channel can directly reduce the
+        layers for different applications.                         feature map width and shrink the model into a thinner
+                                                       one. It is efﬁcient but also challenging because removing
+               VIII. D ISCUSSION AND CHALLENGES            channels might dramatically change the input of the
+                                                       following layer.In this paper, we summarized recent efforts on compressing
+        and accelerating deep neural networks (DNNs). Here we dis-    As we mentioned before, methods of structural matrix
+                                                       and transferred convolutional ﬁlters impose prior humancuss more details about how to choose different compression     knowledge to the model, which could signiﬁcantly affectapproaches, and possible challenges/solutions on this area.       the performance and stability. It is critical to investigate
+                                                       how to control the impact of those prior knowledge.A. General Suggestions                              The methods of knowledge distillation provide many ben-
+         There is no golden rule to measure which approach is the     eﬁts such as directly accelerating model without special
+        best. How to choose the proper method is really depending     hardware or implementations. It is still worthy developing
+        on the applications and requirements. Here are some general     KD-based approaches and exploring how to improve their
+        guidance we can provide:                             performances.
+          If the applications need compacted models from pre-    Hardware constraints in various of small platforms (e.g.,
+           trained models, you can choose either pruning & sharing     mobile, robotic, self-driving car) are still a major problem        IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8
+
+
+           to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g.,
+           use of the limited computational source and how to design video and image frames [88], [89]).
+           special compression methods for such platforms are still
+           challenges that need to be addressed.                         IX. ACKNOWLEDGMENTS
+          Despite the great achievements of these compression ap-
+           proaches, the black box mechanism is still the key barrier   The authors would like to thank the reviewers and broader
+           to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular,
+           is still an important problem.                    we would like to thank Hong Zhao from the Department of
+                                                   Automation of Tsinghua University for her help on modifying
+        C. Possible Solutions                             the paper. This research is supported by National Science
+                                                   Foundation of China with Grant number 61401169.To solve the hyper-parameters conﬁguration problem, we
+        can rely on the recent learning-to-learn strategies [76], [77].
+        This framework provides a mechanism allowing the algorithm                  REFERENCES
+        to automatically learn how to exploit structure in the problem  [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classiﬁcation with of interest. Very recently, leveraging reinforcement learning     deep convolutional neural networks,” inNIPS, 2012.
+        to efﬁciently sample the design space and improve the model  [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
+        compression has also been tried [78].                     gap to human-level performance in face veriﬁcation,” inCVPR, 2014.
+                                                    [3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efﬁciency beneﬁt on both     adaptive feature sharing in multi-task networks with applications in
+        CPU and GPU because no special implementation is required.     person attribute classiﬁcation,”CoRR, vol. abs/1611.05377, 2016.
+        But it is also challenging to handle the input conﬁguration.  [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
+                                                       M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel     distributed deep networks,” inNIPS, 2012.
+        pruning methods [79], which focus on imposing sparse con-  [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
+        straints on weights during training. However, training from     recognition,”CoRR, vol. abs/1512.03385, 2015.
+                                                    [6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In     deep convolutional networks using vector quantization,”CoRR, vol.
+        [80], the authors provided an iterative two-step algorithm to     abs/1412.6115, 2014.
+        effectively prune channels in each layer.                 [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized
+                                                       convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models     on Computer Vision and Pattern Recognition (CVPR), 2016.
+        and transferring it to the student models is useful for the  [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
+        knowledge distillation (KD) approaches. Instead of directly re-     neural networks on cpus,” inDeep Learning and Unsupervised Feature
+                                                       Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl-  [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
+        edge of neurons could be helpful. One can derive a way to     learning with limited numerical precision,” inProceedings of the
+        select essential neurons related to the task [81], [82]. The     32Nd International Conference on International Conference on Machine
+                                                       Learning - Volume 37, ser. ICML’15, 2015, pp. 1737–1746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
+        or samples, that implies these regions or samples share some     deep neural networks with pruning, trained quantization and huffman
+        common properties that may relate to the task.              coding,”International Conference on Learning Representations (ICLR),
+                                                       2016. For methods with the convolutional ﬁlters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network
+        matrix, we can conclude that the transformation lies in the     quantization,”CoRR, vol. abs/1612.01543, 2016.
+        family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
+                                                       neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is     in Neural Information Processing Systems 28: Annual Conference on
+        to provide a generalization of the aforementioned approaches     Neural Information Processing Systems 2015, December 7-12, 2015,
+        in two aspects: 1) instead of limiting the transformation to     Montreal, Quebec, Canada, 2015, pp. 3123–3131.
+                                                   [13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predeﬁned transformations, let it be the     works with weights and activations constrained to +1 or -1,”CoRR, vol.
+        whole family of spatial transformations applied on 2D ﬁlters     abs/1602.02830, 2016.
+        or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
+                                                       Imagenet classiﬁcation using binary convolutional neural networks,” in model parameters.                                  ECCV, 2016.
+         Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha,
+        some general/uniﬁed approaches is one direction. Wanget al.     “Deep neural networks are robust to weight binarization and other non-
+        [83] presented a feature map dimensionality reduction method     linear distortions,”CoRR, vol. abs/1606.01981, 2016.
+                                                   [16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen-     networks,”CoRR, vol. abs/1611.01600, 2016.
+        erated from different ﬁlters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
+        information of the original network. The idea can be applied     with few multiplications,”CoRR, vol. abs/1510.03009, 2015.
+                                                   [18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The     construction with back-propagation,” inAdvances in Neural Information
+        work in [84] proposed a one-shot whole network compression     Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177–185.
+        scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information
+                                                       processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and ﬁne-tuning to make deep     Damage, pp. 598–605.
+        CNNs work in mobile devices.                      [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives
+         Despite the classiﬁcation task, people are also adapting the     for network pruning: Optimal brain surgeon,” inAdvances in Neural
+                                                       Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164– compacted models in other tasks [85]–[87]. We would like to     171.          IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9
+
+
+
+          [21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural  [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net-
+              networks,” inProceedings of the British Machine Vision Conference      works,”arXiv preprint arXiv:1602.07576, 2016.
+              2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp.  [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural
+              31.1–31.12.                                              networks,” inAdvances In Neural Information Processing Systems, 2016,
+          [22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and      pp. 1082–1090.
+              connections for efﬁcient neural networks,” inProceedings of the 28th  [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and
+              International Conference on Neural Information Processing Systems, ser.      improving convolutional neural networks via concatenated rectiﬁed
+              NIPS’15, 2015.                                            linear units,”arXiv preprint arXiv:1603.05201, 2016.
+          [23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com-  [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in
+              pressing neural networks with the hashing trick.” JMLR Workshop and      deep neural networks,”arXiv preprint arXiv:1604.00676, 2016.
+              Conference Proceedings, 2015.                             [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic
+          [24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural      symmetry in convolutional neural networks,” inProceedings of the
+              network compression,”CoRR, vol. abs/1702.04008, 2017.               33rd International Conference on International Conference on Machine
+          [25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain      Learning - Volume 48, ser. ICML’16, 2016.
+              damage,” in2016 IEEE Conference on Computer Vision and Pattern  [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
+              Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,      resnet and the impact of residual connections on learning.”CoRR, vol.
+              pp. 2554–2564.                                            abs/1602.07261, 2016.
+          [26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact  [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Uniﬁed,
+              cnns,” inEuropean Conference on Computer Vision, Amsterdam, the      small, low power fully convolutional neural networks for real-time object
+              Netherlands, October 2016, pp. 662–677.                          detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016.
+          [27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured  [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ
+              sparsity in deep neural networks,” inAdvances in Neural Information      inProceedings of the 12th ACM SIGKDD International Conference on
+              Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg,      Knowledge Discovery and Data Mining, ser. KDD ’06, 2006, pp. 535–
+              I. Guyon, and R. Garnett, Eds., 2016, pp. 2074–2082.                 541.
+          [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning  [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
+              ﬁlters for efﬁcient convnets,”CoRR, vol. abs/1608.08710, 2016.           Advances in Neural Information Processing Systems 27: Annual Confer-
+          [29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for      ence on Neural Information Processing Systems 2014, December 8-13
+              small-footprint deep learning,” inAdvances in Neural Information Pro-      2014, Montreal, Quebec, Canada, 2014, pp. 2654–2662.
+              cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,  [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
+              and R. Garnett, Eds., 2015, pp. 3088–3096.                        neural network,”CoRR, vol. abs/1503.02531, 2015.
+          [30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F.  [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
+              Chang, “An exploration of parameter redundancy in deep networks with      Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550,
+              circulant projections,” inInternational Conference on Computer Vision      2014.
+              (ICCV), 2015.                                         [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling,
+          [31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and      “Bayesian dark knowledge,” inAdvances in Neural Information Process-
+              S. Chang, “Fast neural networks with circulant projections,”CoRR, vol.      ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
+              abs/1502.03436, 2015.                                       and R. Garnett, Eds., 2015, pp. 3420–3428.
+          [32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song,  [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression
+              and Z. Wang, “Deep fried convnets,” inInternational Conference on      by distilling knowledge from neurons,” inProceedings of the Thirtieth
+              Computer Vision (ICCV), 2015.                                 AAAI Conference on Artiﬁcial Intelligence, February 12-17, 2016,
+          [33]J. Chun and T. Kailath,Generalized Displacement Structure for Block-      Phoenix, Arizona, USA., 2016, pp. 3560–3566.
+              Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel-  [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning
+              berg: Springer Berlin Heidelberg, 1991, pp. 215–236.                  via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015.
+          [34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution  [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
+              in low-rank tensor formats via cross approximation,”SIAM J. Scientiﬁc      Improving the performance of convolutional neural networks via atten-
+              Computing, vol. 37, no. 2, 2015.                                tion transfer,”CoRR, vol. abs/1612.03928, 2016.
+          [35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc:  [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
+              A structured efﬁcient linear layer,” inInternational Conference on      jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014.
+              Learning Representations (ICLR), 2016.                       [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and
+          [36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable      A. C. Courville, “Dynamic capacity networks,” inProceedings of the
+              ﬁlters,” in2013 IEEE Conference on Computer Vision and Pattern      33nd International Conference on Machine Learning, ICML 2016, New
+              Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754–      York City, NY, USA, June 19-24, 2016, 2016, pp. 2549–2558.
+              2761.                                               [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
+          [37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,      and J. Dean, “Outrageously large neural networks: The sparsely-gated
+              “Exploiting linear structure within convolutional networks for efﬁcient      mixture-of-experts layer,” 2017.
+              evaluation,” inAdvances in Neural Information Processing Systems 27,  [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and
+              Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.      J. Odobez, “Deep dynamic neural networks for multimodal gesture
+              Weinberger, Eds., 2014, pp. 1269–1277.                           segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell.,
+          [38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional      vol. 38, no. 8, pp. 1583–1597, 2016.
+              neural networks with low rank expansions,” inProceedings of the British  [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
+              Machine Vision Conference. BMVA Press, 2014.                    V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
+          [39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit-      inComputer Vision and Pattern Recognition (CVPR), 2015.
+              sky, “Speeding-up convolutional neural networks using ﬁne-tuned cp-  [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep
+              decomposition,”CoRR, vol. abs/1412.6553, 2014.                    Networks with Stochastic Depth, 2016.
+          [40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks  [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual
+              with low-rank regularization,” vol. abs/1511.06067, 2015.               networks with separated stochastic depth,”CoRR, vol. abs/1612.01230,
+          [41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas,      2016.
+              “Predicting parameters in deep learning,” in Advances in Neural  [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and
+              Information Processing Systems 26, C. Burges, L. Bottou, M. Welling,      R. Feris, “Blockdrop: Dynamic inference paths in residual networks,”
+              Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 2148–2156.      inCVPR, 2018.
+              [Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper   [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer-
+              ﬁles/nips26/1053.pdf                                        ence graphs,” 2018.
+          [42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab-  [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional
+              hadran, “Low-rank matrix factorization for deep neural network training      networks through FFTs, 2014.
+              with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on  [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
+              Acoustics, Speech and Signal Processing, 2013.                      works,” in2016 IEEE Conference on Computer Vision and Pattern          IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10
+
+
+
+              Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,  [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong,
+              pp. 4013–4021.                                            M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X.
+          [69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S.      Yu, “Ibm research and columbia university trecvid-2012 multimedia
+              Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol.      event detection (med), multimedia event recounting (mer), and semantic
+              abs/1611.05138, 2016.                                       indexing (sin) systems,” 2012.
+          [70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving
+              pooling in deep networks,” inProceedings of the IEEE Conference on
+              Computer Vision and Pattern Recognition, 2018.                                  Yu Cheng(yu.cheng@microsoft.com) currently is a
+          [71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning                   Researcher at Microsoft. Before that, he was a Re-
+              applied to document recognition,” inProceedings of the IEEE, 1998, pp.                   search Staff Member at IBM T.J. Watson Research
+              2278–2324.                                                            Center. Yu got his Ph.D. from Northwestern Univer-
+          [72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried-                   sity in 2015 and bachelor from Tsinghua University
+              miller, “Striving for simplicity: The all convolutional net,”CoRR, vol.                   in 2010. His research is about deep learning in
+              abs/1412.6806, 2014.                                                     general, with speciﬁc interests in the deep generative
+          [73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014.                    model, model compression, and transfer learning.
+          [74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for                   He regularly serves on the program committees of
+              large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014.                        top-tier AI conferences such as NIPS, ICML, ICLR,
+          [75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image                   CVPR and ACL.
+              recognition,”arXiv preprint arXiv:1512.03385, 2015.
+          [76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
+              D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient
+              descent by gradient descent,” inNeural Information Processing Systems
+              (NIPS), 2016.                                                          Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference                   ceived the B.S. degree in automation from theon Learning Representations 2016, 2016.                                       Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl                   Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe                   Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018.                    Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in                   are about deep learning, particularly in few-shotdeep networks,” pp. 2270–2278, 2016.                                         learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating                   on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on                   robotics vision.Computer Vision (ICCV), Oct 2017.
+          [81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep
+              neural networks,”ECCV, 2018.
+          [82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric
+              learning via cross sample similarities transfer,” inProceedings of the
+              Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI-18),
+              New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852–                   Pan Zhou(panzhou@hust.edu.cn) is currently an
+              2859.                                                                associate professor with School of Electronic In-
+          [83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond ﬁlters: Compact feature                   formation and Communications, Wuhan, China. He
+              map for portable deep model,” inProceedings of the 34th International                   received his Ph.D. in the School of Electrical and
+              Conference on Machine Learning, ser. Proceedings of Machine Learning                   Computer Engineering at the Georgia Institute of
+              Research, D. Precup and Y. W. Teh, Eds., vol. 70. International                   Technology in 2011. Before that, he received his
+              Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp.                   B.S. degree in theAdvanced Classof HUST, and
+              3703–3711.                                                            a M.S. degree in the Department of Electronics
+          [84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression                   and Information Engineering from HUST, Wuhan,
+              of deep convolutional neural networks for fast and low power mobile                   China, in 2006 and 2008, respectively. His current
+              applications,”CoRR, vol. abs/1511.06530, 2015.                                   research interest includes big data analytics and
+          [85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efﬁcient  machine learning, security and privacy, and information networks.
+              object detection models with knowledge distillation,” inAdvances in
+              Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg,
+              S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
+              Eds., 2017, pp. 742–751.
+          [86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,                   Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob-
+              “Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE                   tained his B.S., M.S., and Ph.D. degrees from Ts-
+              Conference on Computer Vision and Pattern Recognition (CVPR), June                   inghua University, Beijing, China, in 1993, 1995,
+              2018.                                                                and 1999, respectively, and another Ph.D. degree
+          [87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,                   from Saga University, Saga, Japan, in 2002, all in
+              Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy                   control engineering. He is currently a Professor with
+              trade-offs for modern convolutional object detectors,” in2017 IEEE                   the Department of Automation, Tsinghua University.
+              Conference on Computer Vision and Pattern Recognition, CVPR 2017,                   He serves the Associate Dean, School of Information
+              Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296–3297.                          Science and Technology and Head of the Department
+          [88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence                   of Automation. His current research interests include
+              modeling for video event detection,” in The IEEE Conference on                   artiﬁcial intelligence, robotics, image processing,
+              Computer Vision and Pattern Recognition (CVPR), June 2014.        control theory, and control of spacecraft.
--- a/learning.txt
+++ b/learning.txt
--- a/Corpus/Analysis
+++ b/Corpus/Analysis
--- a/Corpus/Bayesian
+++ b/Corpus/Bayesian
--- a/Corpus/Neural_Ordinary_Differential_Equations.txt
+++ b/Corpus/Neural_Ordinary_Differential_Equations.txt
--- a/Corpus/Channel
+++ b/Corpus/Channel
@ -0,0 +1,391 @@
+                 Channel Pruning for Accelerating Very Deep Neural Networks
+
+
+                     Yihui He *               Xiangyu Zhang              Jian Sun
+              Xi’an Jiaotong University          Megvii Inc.               Megvii Inc.
+                Xi’an, 710049, China       Beijing, 100190, China     Beijing, 100190, China
+              heyihui@stu.xjtu.edu.cn    zhangxiangyu@megvii.com      sunjian@megvii.com
+
+
+
+                        Abstract                         W1
+
+          In this paper, we introduce a new channel pruning      number of  channels
+                                                                                   nonlinear method to accelerate very deep convolutional neural net-
+        works. Given a trained CNN model, we propose an it-
+        erative two-step algorithm to effectively prune each layer,         W2
+        by a LASSO regression based channel selection and least                                    nonlinear
+        square reconstruction. We further generalize this algorithm
+        to multi-layer and multi-branch cases. Our method re-        W3
+        duces the accumulated error and enhance the compatibility
+        with various architectures. Our pruned VGG-16 achieves         (a)                           (b)                        (c)                       (d)
+        the state-of-the-art results by5×speed-up along with only   Figure 1. Structured simpliﬁcation methods that accelerate CNNs: 0.3% increase of error. More importantly, our method is   (a) a network with 3 conv layers. (b) sparse connection deacti-
+        able to accelerate modern networks like ResNet, Xception   vates some connections between channels. (c) tensor factorization
+        and suffers only 1.4%, 1.0% accuracy loss under2×speed-   factorizes a convolutional layer into several pieces. (d) channel
+        up respectively, which is signiﬁcant.                   pruning reduces number of channels in each layer (focus of this
+                                                   paper).
+
+        1. Introduction                              a network into thinner one, as shown in Fig.1(d). It is efﬁ-
+          Recent CNN acceleration works fall into three cate-   cient on both CPU and GPU because no special implemen-
+        gories: optimized implementation (e.g., FFT [47]), quan-   tation is required.
+        tization (e.g., BinaryNet [8]), and structured simpliﬁcation     Pruning channels is simple but challenging because re-
+        that convert a CNN into compact one [22]. This work fo-   moving channels in one layer might dramatically change
+        cuses on the last one.                             the input of the following layer. Recently,training-based
+          Structured simpliﬁcation mainly involves: tensor fac-   channel pruning works [1,48] have focused on imposing
+        torization [22], sparse connection [17], and channel prun-   sparse constrain on weights during training, which could
+        ing [48]. Tensor factorization factorizes a convolutional   adaptively determine hyper-parameters. However, training
+        layer into several efﬁcient ones (Fig.1(c)). However, fea-   from scratch is very costly and results for very deep CNNs
+        ture map width (number of channels) could not be reduced,   on ImageNet have been rarely reported.Inference-timeat-
+        which makes it difﬁcult to decompose1×1convolutional   tempts [31,3] have focused on analysis of the importance
+        layer favored by modern networks (e.g., GoogleNet [45],   of individual weight. The reported speed-up ratio is very
+        ResNet [18], Xception [7]). This type of method also intro-   limited.
+        duces extra computation overhead. Sparse connection deac-     In this paper, we propose a new inference-time approach
+        tivates connections between neurons or channels (Fig.1(b)).   for channel pruning, utilizing redundancy inter channels.
+        Though it is able to achieves high theoretical speed-up ratio,   Inspired by tensor factorization improvement by feature
+        the sparse convolutional layers have an ”irregular” shape   maps reconstruction [52], instead of analyzing ﬁlter weights
+        which is not implementation friendly. In contrast, channel   [22,31], we fully exploits redundancy within feature maps.
+        pruning directly reduces feature map width, which shrinks   Speciﬁcally, given a trained CNN model, pruning each layer
+                                                   is achieved by minimizing reconstruction error on its output
+          * This work was done when Yihui He was an intern at Megvii Inc.      feature maps, as showned in Fig.2. We solve this mini-
+
+
+
+                                                 1389                A                                                B                                                                      C           maps. There are several training-based approaches. [1,48]
+                                 W                 regularize networks to improve accuracy. Channel-wise
+                                                   SSL [48] reaches high compression ratio for ﬁrst few conv
+                                                   layers of LeNet [30] and AlexNet [26]. However,training- kh kc  w              basedapproaches are more costly, and the effectiveness for
+                         c           n             very deep networks on large datasets is rarely exploited. nonlinear                 nonlinear
+        Figure 2. Channel pruning for accelerating a convolutional layer.     Inference-time channel pruning is challenging, as re-
+        We aim to reduce the width of feature map B, while minimizing   ported by previous works [2,39]. Some works [44,34,19]
+        the reconstruction error on feature map C. Our optimization algo-   focus on model size compression, which mainly operate the
+        rithm (Sec. 3.1) performs within the dotted box, which does not   fully connected layers. Data-free approaches [31,3] results
+        involve nonlinearity. This ﬁgure illustrates the situation that two   for speed-up ratio (e.g.,5×) have not been reported, and
+        channels are pruned for feature map B. Thus corresponding chan-   requires long retraining procedure. [3] select channels via
+        nels of ﬁltersWcan be removed. Furthermore, even though not   over 100 random trials, however it need long time to evalu- directly optimized by our algorithm, the corresponding ﬁlters in   ate each trial on a deep network, which makes it infeasible the previous layer can also be removed (marked by dotted ﬁlters).   to work on very deep models and large datasets. [31] is even c,n: number of channels for feature maps B and C,kh ×kw :   worse than naive solution from our observation sometimes kernel size.                                    (Sec.4.1.1).
+
+        mization problem by two alternative steps: channels selec-   3. Approach
+        tion and feature map reconstruction. In one step, we ﬁgure     In this section, we ﬁrst propose a channel pruning al-out the most representative channels, and prune redundant   gorithm for a single layer, then generalize this approach toones, based on LASSO regression. In the other step, we   multiple layers or the whole model. Furthermore, we dis-reconstruct the outputs with remaining channels with linear   cuss variants of our approach for multi-branch networks.least squares. We alternatively take two steps. Further, we
+        approximate the network layer-by-layer, with accumulated   3.1. Formulation
+        error accounted. We also discuss methodologies to prune
+        multi-branch networks (e.g., ResNet [18], Xception [7]).       Fig.2illustrates our channel pruning algorithm for a sin-
+          For VGG-16, we achieve4×acceleration, with only   gle convolutional layer. We aim to reduce the width of
+        1.0%increase of top-5 error. Combined with tensor factor-   feature map B, while maintaining outputs in feature map
+        ization, we reach5×acceleration but merely suffer0.3%   C. Once channels are pruned, we can remove correspond-
+        increase of error, which outperforms previous state-of-the-   ing channels of the ﬁlters that take these channels as in-
+        arts. We further speed up ResNet-50 and Xception-50 by   put. Also, ﬁlters that produce these channels can also be
+        2×with only1.4%, 1.0%accuracy loss respectively.       removed. It is clear that channel pruning involves two key
+                                                   points. The ﬁrst is channel selection, since we need to select
+        2. Related Work                             most representative channels to maintain as much informa-
+                                                   tion. The second is reconstruction. We need to reconstruct
+          There has been a signiﬁcant amount of work on acceler-   the following feature maps using the selected channels.
+        ating CNNs. Many of them fall into three categories: opti-     Motivated by this, we propose an iterative two-step al-
+        mized implementation [4], quantization [40], and structured   gorithm. In one step, we aim to select most representative
+        simpliﬁcation [22].                              channels. Since an exhaustive search is infeasible even for
+          Optimized implementation based methods [35,47,27,4]   tiny networks, we come up with a LASSO regression based
+        accelerate convolution, with special convolution algorithms   method to ﬁgure out representative channels and prune re-
+        like FFT [47]. Quantization [8,40] reduces ﬂoating point   dundant ones. In the other step, we reconstruct the outputs
+        computational complexity.                         with remaining channels with linear least squares. We alter-
+          Sparse connection eliminates connections between neu-   natively take two steps.
+        rons [17,32,29,15,14]. [51] prunes connections based on     Formally, to prune a feature map withcchannels, we
+        weights magnitude. [16] could accelerate fully connected   consider applyingn×c×kh ×kw convolutional ﬁltersWon
+        layers up to50×. However, in practice, the actual speed-up   N×c×kh ×kw input volumesXsampled from this feature
+        maybe very related to implementation.                 map, which producesN×noutput matrixY. Here,Nis
+          Tensor factorization [22,28,13,24] decompose weights   the number of samples,nis the number of output channels,
+        into several pieces. [50,10,12] accelerate fully connected   andkh ,k w are the kernel size. For simple representation,
+        layers with truncated SVD. [52] factorize a layer into3×3   bias term is not included in our formulation. To prune the
+        and1×1combination, driven by feature map redundancy.    input channels fromcto desiredc′ (0≤c′ ≤c), while
+          Channel pruning removes redundant channels on feature   minimizing reconstruction error, we formulate our problem
+
+
+
+                                                 1390        as follow:                                    penalty, andβ =c. We gradually increaseλ. For each 0                         change ofλ, we iterate these two steps untilβ is stable.
+                      1             2                                            0    c                    Afterβ ≤c′ satisﬁes, we obtain the ﬁnal solutionWarg min   Y−   β                         0i Xi W⊤ i             from{ββ,W 2N                  (1)         i Wi }. In practice, we found that the two steps it- i=1       F           eration is time consuming. So we apply (i) multiple times,subject toβ ≤c′
+                         0                         untilβ ≤c′ satisﬁes. Then apply (ii) just once, to obtain 0
+          · is Frobenius norm.X                      the ﬁnal result. From our observation, this result is compa-
+            F               i isN×kh kw matrix sliced
+        fromith channel of input volumesX,i= 1,...,c.W     rable with two steps iteration’s. Therefore, in the following i is
+        n×k                                       experiments, we adopt this approach for efﬁciency. h kw ﬁlter weights sliced fromith channel ofW.βis
+        coefﬁcient vector of lengthcfor channel selection, andβ      Discussion: Some recent works [48,1,17] (though train- i
+        isith entry ofβ. Notice that, ifβ                     ing based) also introduceℓ1 -norm or LASSO. However, we i = 0,Xi will be no longer
+        useful, which could be safely pruned from feature map.W    must emphasis that we use different formulations. Many of i
+        could also be removed.                           them introduced sparsity regularization into training loss,
+        Optimization                                 instead of explicitly solving LASSO. Other work [1] solved
+        Solving thisℓ                                  LASSO, while feature maps or data were not considered 0 minimization problem in Eqn.1is NP-hard.
+        Therefore, we relax theℓ                          during optimization. Because of these differences, our ap- 0 toℓ1 regularization:            proach could be applied at inference time.            
+                   1    c       2
+            arg min   Y−   β                      3.2. Whole Model Pruning i Xi W⊤ 
+                                 i  +λβ1β,W 2N                      (2) i=1       F                Inspired by [52], we apply our approach layer by layersubject toβ ≤c′ ,∀iW = 1 0        iF                  sequentially. For each layer, we obtain input volumes from
+                                                   the current input feature map, and output volumes from theλis a penalty coefﬁcient. By increasingλ, there will be   output feature map of the un-pruned model. This could bemore zero terms inβand one can get higher speed-up ratio.   formalized as:We also add a constrain∀iWi  = 1to this formulation, F which avoids trivial solution.                                                    
+          Now we solve this problem in two folds. First, we ﬁxW,                 1    c       2
+                                                           arg min   Y′ −   βsolveβfor channel selection. Second, we ﬁxβ, solveWto                            i Xi W⊤ i 
+                                                            β,W 2N                  (5)
+        reconstruct error.                                                    i=1       F
+          (i) The subproblem ofβ. In this case,Wis ﬁxed. We           subject toβ ≤c′
+                                                                    0
+        solveβfor channel selection. This problem can be solved     Different from Eqn.1,Yis replaced byY′ , which is fromby LASSO regression [46,5], which is widely used for   feature map of the original model. Therefore, the accumu-model selection.                                lated error could be accounted during sequential pruning.                 2    c    βˆLASSO           1(λ) = argmin   Y−   β   +λβ     3.3. Pruning MultiBranch Networks
+                     β  2N      i Zi        1
+                                i=1    F             The whole model pruning discussed above is enough for
+         subject toβ ≤c′
+                  0                                single-branch networks like LeNet [30], AlexNet [26] and(3)   VGG Nets [43]. However, it is insufﬁcient for multi-branch HereZi = X i W⊤ i (sizeN×n). We will ignoreith channels   networks like GoogLeNet [ 45] and ResNet [18]. We mainlyifβi = 0.                                    focus on pruning the widely used residual structure (e.g.,(ii) The subproblem ofW. In this case,βis ﬁxed. We   ResNet [18], Xception [7]). Given a residual block shownutilize the selected channels to minimize reconstruction er-   in Fig.3(left), the input bifurcates into shortcut and residualror. We can ﬁnd optimized solution by least squares:        branch. On the residual branch, there are several convolu-
+                                                tional layers (e.g., 3 convolutional layers which have spatialarg minY−X′ (W ′ )⊤ 2        (4) F              size of1×1,3×3,1×1, Fig.3, left). Other layers ex- W′                            cept the ﬁrst and last layer can be pruned as is described
+        HereX′ = [β1 X1 β2 X2 ... β i Xi ... β c Xc ](size   previously. For the ﬁrst layer, the challenge is that the large
+        N×ck h kw ). W′ isn×ck h kw reshapedW,W′ =   input feature map width (for ResNet, 4 times of its output)
+        [W 1 W2 ...Wi ...Wc ]. After obtained resultW′ , it is re-   can’t be easily pruned, since it’s shared with shortcut. For
+        shaped back toW. Then we assignβi ←βi Wi  ,W      the last layer, accumulated error from the shortcut is hard to F  i ←
+        Wi /Wi  . Constrain∀iW                      be recovered, since there’s no parameter on the shortcut. To F            i  = 1satisﬁes. F We alternatively optimize (i) and (ii). In the beginning,   address these challenges, we propose several variants of our
+        Wis initialized from the trained model,λ= 0, namely no   approach as follows.
+
+
+
+                                                 1391                                    c              ers, which need special library implementation support. We
+                      Input (c) sampled (c')  0              do not adopt it in the following experiments. c             0      0 
+             0
+                          channel     sampler
+                          sampler 1x1,c                   c'0               4. Experiment 1
+            c                       1x1 1   relu                c' 3x3,c                   1   relu           We evaluation our approach for the popular VGG Nets 2
+            c                       3x3 2   relu                                [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR- c'1x1                    2   relu         10 [25] and PASCAL VOC 2007 [11]. 1x1                For Batch Normalization [21], we ﬁrst merge it into con- Y2   Y          volutional weights, which do not affect the outputs of the Y+Y    1
+                                   1 2              networks. So that each convolutional layer is followed by
+        Figure 3. Illustration of multi-branch enhancement for residual   ReLU [36]. We use Caffe [23] for deep network evalua-
+        block.Left: original residual block.Right: pruned residual block   tion, and scikit-learn [38] for solvers implementation. For
+        with enhancement,cx denotes the feature map width. Input chan-   channel pruning, we found that it is enough to extract 5000 nels of the ﬁrst convolutional layer are sampled, so that the large   images, and 10 samples per image. On ImageNet, we eval- input feature map width could be reduced. As for the last layer,   uate the top-5 accuracy with single view. Images are re- rather than approximateY2 , we try to approximateY1 + Y 2 di-   sized such that the shorter side is 256. The testing is on rectly (Sec.3.3Last layer of residual branch).               center crop of224×224pixels. We could gain more per-
+                                                   formance with ﬁne-tuning. We use a batch size of 128 and
+                                                   learning rate1e−5 . We ﬁne-tune our pruned models for 10Last layer of residual branch: Shown in Fig.3, the   epoches. The augmentation for ﬁne-tuning is random cropoutput layer of a residual block consists of two inputs: fea-   of224×224and mirror.ture mapY1 andY2 from the shortcut and residual branch.
+        We aim to recoverY1 + Y 2 for this block. Here,Y1 ,Y2   4.1. Experiments with VGG16 are the original feature maps before pruning.Y2 could be
+        approximated as in Eqn.1. However, shortcut branch is     VGG-16 [43] is a 16 layers single path convolutional
+        parameter-free, thenY                            neural network, with 13 convolutional layers. It is widely 1 could not be recovered directly. To
+        compensate this error, the optimization goal of the last layer   used in recognition, detection and segmentation,etc. Single
+        is changed fromY                               view top-5 accuracy for VGG-16 is 89.9% 1 .2 toY1 −Y′ +Y, which does not change 1  2
+        our optimization. Here,Y′ is the current feature map after 1 previous layers pruned. When pruning, volumes should be   4.1.1 Single Layer Pruning
+        sampled correspondingly from these two branches.         In this subsection, we evaluate single layer acceleration per-First layer of residual branch: Illustrated in   formance using our algorithm in Sec.3.1. For better under-Fig.3(left), the input feature map of the residual block   standing, we compare our algorithm with two naive chan-could not be pruned, since it is also shared with the short-   nel selection strategies.ﬁrst kselects the ﬁrstkchannels.cut branch. In this condition, we could performfeature   max responseselects channels based on corresponding ﬁl-map samplingbefore the ﬁrst convolution to save compu-   ters that have high absolute weights sum [31]. For fair com-tation. We still apply our algorithm as Eqn.1. Differently,   parison, we obtain the feature map indexes selected by eachwe sample the selected channels on the shared feature maps   of them, then perform reconstruction (Sec. 3.1(ii)). We to construct a new input for the later convolution, shown   hope that this could demonstrate the importance of channelin Fig.3(right). Computational cost for this operation could   selection. Performance is measured by increase of error af-be ignored. More importantly, after introducingfeature map   ter a certain layer is pruned without ﬁne-tuning, shown insampling, the convolution is still ”regular”.              Fig.4.Filter-wise pruningis another option for the ﬁrst con-     As expected, error increases as speed-up ratio increases.volution on the residual branch. Since the input channels   Our approach is consistently better than other approaches inof parameter-free shortcut branch could not be pruned, we   different convolutional layers under different speed-up ra-apply our Eqn.1to each ﬁlter independently (each ﬁl-   tio. Unexpectedly, sometimesmax responseis even worseter chooses its own representative input channels). Under   thanﬁrst k. We argue thatmax responseignores correla-single layer acceleration,ﬁlter-wise pruningis more accu-   tions between different ﬁlters. Filters with large absoluterate than our original one. From our experiments, it im-   weight may have strong correlation. Thus selection based proves 0.5% top-5 accuracy for2×ResNet-50 (applied on   on ﬁlter weights is less meaningful. Correlation on featurethe ﬁrst layer of each residual branch) without ﬁne-tuning.   maps is worth exploiting. We can ﬁnd that channel selectionHowever, after ﬁne-tuning, there’s no noticeable improve-
+        ment. In addition, it outputs ”irregular” convolutional lay-     1 http://www.vlfeat.org/matconvnet/pretrained/
+
+
+
+                                                 1392                          conv1_1                 conv2_1                 conv3_1 5
+                             first k                  first k                  first k
+                             max response              max response              max response 4          ours                   ours                   ours
+
+
+
+
+
+                 increase of error (%) 3
+
+                  2
+
+                  1
+
+                  0
+
+                          conv3_2                 conv4_1                 conv4_2 5
+                             first k                  first k             first k
+                             max response              max response        max response 4          ours                   ours             ours
+
+
+
+
+
+                 increase of error (%) 3
+
+                  2
+
+                  1
+
+                  01.0 1.5 2.0 2.5 3.0 3.5 4.0  1.0 1.5 2.0 2.5 3.0 3.5 4.0  1.0 1.5 2.0 2.5 3.0 3.5 4.0
+                         speed-up ratio               speed-up ratio               speed-up ratio
+        Figure 4. Single layer performance analysis under different speed-up ratios (without ﬁne-tuning), measured by increase of error. To verify
+        the importance of channel selection refered in Sec.3.1, we considered two naive baselines.ﬁrst kselects the ﬁrstkfeature maps.max
+        responseselects channels based on absolute sum of corresponding weights ﬁlter [31]. Our approach is consistently better (smaller is
+        better).
+
+
+            Increase of top-5 error (1-view, baseline 89.9%)       periments above, we pruning more aggressive for shal-
+                 Solution          2×  4×  5×    lower layers. Remaining channels ratios for shallow lay-
+         Jaderberget al. [22] ([52]’s impl.)   -   9.7  29.7    ers (conv1_xtoconv3_x) and deep layers (conv4_x)
+                Asym. [52]         0.28  3.84   -     is1 : 1.5.conv5_xare not pruned, since they only con-
+              Filter pruning [31]                        tribute 9% computation in total and are not redundant.0.8  8.6  14.6(ﬁne-tuned, our impl.)                         After ﬁne-tuning, we could reach2×speed-up without
+            Ours (without ﬁne-tune)     2.7  7.9  22.0    losing accuracy. Under4×, we only suffers 1.0% drops.
+              Ours (ﬁne-tuned)        0   1.0  1.7    Consistent with single layer analysis, our approach outper-
+        Table 1. Accelerating the VGG-16 model [43] using a speedup   forms previous channel pruning approach (Liet al. [31]) by
+        ratio of2×,4×, or5×(smaller is better).                 large margin. This is because we fully exploits channel re-
+                                                   dundancy within feature maps. Compared with tensor fac-
+        affects reconstruction error a lot. Therefore, it is important   torization algorithms, our approach is better than Jaderberg
+        for channel pruning.                             et al. [22], without ﬁne-tuning. Though worse than Asym.
+          Also notice that channel pruning gradually becomes   [52], our combined model outperforms its combined Asym.
+        hard, from shallower to deeper layers. It indicates that shal-   3D (Table2). This may indicate that channel pruning is
+        lower layers have much more redundancy, which is consis-   more challenging than tensor factorization, since removing
+        tent with [52]. We could prune more aggressively on shal-   channels in one layer might dramatically change the input
+        lower layers in whole model acceleration.               of the following layer. However, channel pruning keeps the
+                                                   original model architecture, do not introduce additional lay-
+                                                   ers, and the absolute speed-up ratio on GPU is much higher4.1.2 Whole Model Pruning                      (Table 3).
+        Shown in Table1, whole model acceleration results under     Since our approach exploits a new cardinality, we further
+        2×,4×,5×are demonstrated. We adopt whole model   combine our channel pruning with spatial factorization [22]
+        pruning proposed in Sec.3.2. Guided by single layer ex-   and channel factorization [52]. Demonstrated in Table2,
+
+
+
+                                                 1393               Increase of top-5 error (1-view, 89.9%)          scratch. This coincides with architecture design researches
+                     Solution        4×  5×          [20,1] that the model could be easier to train if there are
+                  Asym. 3D [52]      0.9  2.0          more channels in shallower layers. However, channel prun-
+              Asym. 3D (ﬁne-tuned) [52]  0.3  1.0          ing favors shallower layers.
+                     Our 3C        0.7  1.3            For from scratch (uniformed), the ﬁlters in each layers
+                 Our 3C (ﬁne-tuned)    0.0  0.3          is reduced by half (eg. reduceconv1_1from 64 to 32).
+        Table 2. Performance of combined methods on the VGG-16 model   We can observe that normal setting networks of the same
+        [43] using a speed-up ratio of4×or5×. Our 3C solution outper-   complexity couldn’t reach same accuracy either. This con-
+        forms previous approaches (smaller is better).               solidates our idea that there’s much redundancy in networks
+                                                   while training. However, redundancy can be opt out at
+                                                   inference-time. This maybe an advantage of inference-timeour 3 cardinalities acceleration (spatial, channel factoriza-   acceleration approaches over training-based approaches.tion, and channel pruning, denoted by 3C) outperforms pre-     Notice that there’s a 0.6% gap between the from scratchvious state-of-the-arts. Asym. 3D [52] (spatial and chan-   model and uniformed one, which indicates that there’s roomnel factorization), factorizes a convolutional layer to three   for model exploration. Adopting our approach is muchparts:1×3,3×1,1×1.                         faster than training a model from scratch, even for a thin-We apply spatial factorization, channel factorization, and   ner one. Further researches could alleviate our approach to our channel pruning together sequentially layer-by-layer.   do thin model exploring.We ﬁne-tune the accelerated models for 20 epoches, since
+        they are 3 times deeper than the original ones. After ﬁne-
+        tuning, our4×model suffers no degradation. Clearly, a   4.1.5 Acceleration for Detection
+        combination of different acceleration techniques is better   VGG-16 is popular among object detection tasks [42,41,than any single one. This indicates that a model is redun-   33]. We evaluate transfer learning ability of our2×/4×dant in each cardinality.                           pruned VGG-16, for Faster R-CNN [42] object detections.
+                                                   PASCAL VOC 2007 object detection benchmark [11] con-
+        4.1.3 Comparisons of Absolute Performance          tains 5k trainval images and 5k test images. The per-
+                                                   formance is evaluated by mean Average Precision (mAP).We further evaluate absolute performance of acceleration   In our experiments, we ﬁrst perform channel pruning foron GPU. Results in Table3are obtained under Caffe [23],   VGG-16 on the ImageNet. Then we use the pruned modelCUDA8 [37] and cuDNN5 [6], with a mini-batch of 32   as the pre-trained model for Faster R-CNN.on a GPU (GeForce GTX TITAN X). Results are averaged     The actual running time of Faster R-CNN is 220ms / im-from 50 runs. Tensor factorization approaches decompose   age. The convolutional layers contributes about 64%. Weweights into too many pieces, which heavily increase over-   got actual time of 94ms for4×acceleration. From Table5,head. They could not gain much absolute speed-up. Though   we observe 0.4% mAP drops of our2×model, which is notour approach also encountered performance decadence, it   harmful for practice consideration.generalizes better on GPU than other approaches. Our re-
+        sults for tensor factorization differ from previous research   4.2. Experiments with Residual Architecture Nets
+        [52,22], maybe because current library and hardware pre-     For Multi-path networks [45,18,7], we further explorefer single large convolution instead of several small ones.     the popular ResNet [18] and latest Xception [7], on Ima-
+                                                   geNet and CIFAR-10. Pruning residual architecture nets is
+        4.1.4 Comparisons with Training from Scratch        more challenging. These networks are designed for both ef-
+                                                   ﬁciency and high accuracy. Tensor factorization algorithmsThough training a compact model from scratch is time-   [52,22] have difﬁcult to accelerate these model. Spatially,consuming (usually 120 epoches), it worths comparing our   1×1convolution is favored, which could hardly be factor-approach and from scratch counterparts. To be fair, we eval-   ized.uated both from scratch counterpart, and normal setting net-
+        work that has the same computational complexity and same   4.2.1 ResNet Pruningarchitecture.
+          Shown in Table4, we observed that it’s difﬁcult for   ResNet complexity uniformly drops on each residual block.
+        from scratch counterparts to reach competitive accuracy.   Guided by single layer experiments (Sec. 4.1.1), we still
+        our model outperforms from scratch one. Our approach   prefer reducing shallower layers heavier than deeper ones.
+        successfully picks out informative channels and constructs     Following similar setting as Filter pruning [31], we
+        highly compact models. We can safely draw the conclu-   keep 70% channels for sensitive residual blocks (res5
+        sion that the same model is difﬁcult to be obtained from   and blocks close to the position where spatial size
+
+
+
+                                                 1394                       Model             Solution          Increased err.  GPU time/ms
+                       VGG-16              -                 0        8.144
+                                Jaderberget al. [22] ([52]’s impl.)     9.7     8.051(1.01×)
+                                        Asym. [52]            3.8     5.244(1.55×)
+                     VGG-16 (4×)        Asym. 3D [52]           0.9     8.503(0.96×)
+                                  Asym. 3D (ﬁne-tuned) [52]       0.3     8.503(0.96×)
+                                      Ours (ﬁne-tuned)           1.0     3.264 (2.50×)
+        Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is
+        better).
+
+
+           Original (acc. 89.9%)   Top-5 err.  Increased err.                Solution         Increased err.
+             From scratch        11.9       1.8            Filter pruning [31] (our impl.)     92.8
+         From scratch (uniformed)   12.5       2.4                Filter pruning [31]         4.3Ours          18.0       7.9               (ﬁne-tuned, our impl.)
+            Ours (ﬁne-tuned)      11.1       1.0                     Ours             2.9
+        Table 4. Comparisons with training from scratch, under4×accel-            Ours (ﬁne-tuned)         1.0
+        eration. Our ﬁne-tuned model outperforms scratch trained coun-   Table 7. Comparisons for Xception-50, under2×acceleration ra-
+        terparts (smaller is better).                           tio. The baseline network’s top-5 accuracy is 92.8%. Our ap-
+                                                   proach outperforms previous approaches. Most structured sim-
+                                                   pliﬁcation methods are not effective on Xception architecture
+                  Speedup  mAP  ∆mAP              (smaller is better).
+                  Baseline  68.7    -
+                    2×    68.3   0.4
+                    4×    66.9   1.8               4.2.2 Xception Pruning
+          Table 5.2×,4×acceleration for Faster R-CNN detection.
+                                                   Since computational complexity becomes important in
+                                                   model design, separable convolution has been payed muchSolution      Increased err.          attention [49,7]. Xception [7] is already spatially optimizedOurs           8.0             and tensor factorization on1×1convolutional layer is de-Ours           4.0             structive. Thanks to our approach, it could still be acceler-(enhanced)                         ated with graceful degradation. For the ease of comparison,Ours           1.4             we adopt Xception convolution on ResNet-50, denoted by(enhanced, ﬁne-tuned)                     Xception-50. Based on ResNet-50, we swap all convolu- Table 6.2×acceleration for ResNet-50 on ImageNet, the base-   tional layers with spatial conv blocks. To keep the same line network’s top-5 accuracy is 92.2% (one view). We improve   computational complexity, we increase the input channels performance with multi-branch enhancement (Sec.3.3,smaller is   of allbranch2blayers by2×. The baseline Xception- better).                                      50 has a top-5 accuracy of 92.8% and complexity of 4450
+                                                   MFLOPs.
+                                                     We apply multi-branch variants of our approach as de-change, e.g. res3a,res3d). As for other blocks,   scribed in Sec.3.3, and adopt the same pruning ratio settingwe keep 30% channels. With multi-branch enhance-   as ResNet in previous section. Maybe because of Xcep-ment, we prunebranch2amore aggressively within   tion block is unstable, Batch Normalization layers must beeach residual block. The remaining channels ratios for   maintained during pruning. Otherwise it becomes nontrivialbranch2a,branch2b,branch2cis2 : 4 : 3(e.g.,   to ﬁne-tune the pruned model.Given 30%, we keep 40%, 80%, 60% respectively).          Shown in Table7, after ﬁne-tuning, we only suffer1.0%
+          We evaluate performance of multi-branch variants of our   increase of error under2×. Filter pruning [31] could also
+        approach (Sec. 3.3). From Table6, we improve 4.0%   apply on Xception, though it is designed for small speed-
+        with our multi-branch enhancement. This is because we   up ratio. Without ﬁne-tuning, top-5 error is 100%. After
+        accounted the accumulated error from shortcut connection   training 20 epochs which is like training from scratch, in-
+        which could broadcast to every layer after it. And the large   creased error reach 4.3%. Our results for Xception-50 are
+        input feature map width at the entry of each residual block   not as graceful as results for VGG-16, since modern net-
+        is well reduced by ourfeature map sampling.             works tend to have less redundancy by design.
+
+
+
+                                                 1395                     Solution       Increased err.            [4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn:
+                  Filter pruning [31]                            Lookup-based convolutional neural network.arXiv preprint 1.3(ﬁne-tuned, our impl.)                           arXiv:1611.06473, 2016.2
+                    From scratch         1.9                [5] L. Breiman. Better subset regression using the nonnegative
+                       Ours            2.0                   garrote.Technometrics, 37(4):373–384, 1995.3
+                  Ours (ﬁne-tuned)       1.0                [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
+         Table 8.2×speed-up comparisons for ResNet-56 on CIFAR-10,       B. Catanzaro, and E. Shelhamer. cudnn: Efﬁcient primitives
+         the baseline accuracy is 92.8% (one view). We outperforms previ-       for deep learning.CoRR, abs/1410.0759, 2014.6
+         ous approaches and scratch trained counterpart (smaller is better).    [7] F. Chollet. Xception: Deep learning with depthwise separa-
+                                                            ble convolutions.arXiv preprint arXiv:1610.02357, 2016. 1,
+                                                            2,3,4,6,7
+         4.2.3 Experiments on CIFAR-10                      [8] M. Courbariaux and Y. Bengio. Binarynet: Training deep
+                                                            neural networks with weights and activations constrained to+
+         Even though our approach is designed for large datasets, it       1 or-1.arXiv preprint arXiv:1602.02830, 2016.1,2
+         could generalize well on small datasets. We perform ex-    [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
+         periments on CIFAR-10 dataset [25], which is favored by       Fei. Imagenet: A large-scale hierarchical image database.
+         many acceleration researches. It consists of 50k images for       InComputer Vision and Pattern Recognition, 2009. CVPR
+         training and 10k for testing in 10 classes.                     2009. IEEE Conference on, pages 248–255. IEEE, 2009. 4
+           We reproduce ResNet-56, which has accuracy of 92.8%    [10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-
+         (Serve as a reference, the ofﬁcial ResNet-56 [18] has ac-       gus. Exploiting linear structure within convolutional net-
+         curacy of 93.0%). For2×acceleration, we follow similar       works for efﬁcient evaluation. InAdvances in Neural In-
+                                                            formation Processing Systems, pages 1269–1277, 2014.2 setting as Sec.4.2.1(keep the ﬁnal stage unchanged, where
+         the spatial size is8×8). Shown in Table8, our approach    [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
+                                                            and A. Zisserman. The PASCAL Visual Object Classes is competitive with scratch trained one, without ﬁne-tuning,       Challenge 2007 (VOC2007) Results. http://www.pascal- under2×speed-up. After ﬁne-tuning, our result is signif-       network.org/challenges/VOC/voc2007/workshop/index.html. icantly better than Filter pruning [31] and scratch trained       4,6
+         one.                                            [12] R. Girshick. Fast r-cnn. InProceedings of the IEEE Inter-
+                                                            national Conference on Computer Vision, pages 1440–1448,
+         5. Conclusion                                      2015.2
+                                                         [13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress-
+           To conclude, current deep CNNs are accurate with high       ing deep convolutional networks using vector quantization.
+         inference costs. In this paper, we have presented an       arXiv preprint arXiv:1412.6115, 2014.2
+         inference-time channel pruning method for very deep net-    [14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for
+         works. The reduced CNNs are inference efﬁcient networks       efﬁcient dnns. InAdvances In Neural Information Process-
+         while maintaining accuracy, and only require off-the-shelf       ing Systems, pages 1379–1387, 2016.2
+         libraries. Compelling speed-ups and accuracy are demon-    [15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,
+         strated for both VGG Net and ResNet-like networks on Im-       and W. J. Dally. Eie: efﬁcient inference engine on com-
+         ageNet, CIFAR-10 and PASCAL VOC.                      pressed deep neural network. InProceedings of the 43rd
+                                                            International Symposium on Computer Architecture, pages In the future, we plan to involve our approaches into       243–254. IEEE Press, 2016. 2 training time, instead of inference time only, which may    [16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- also accelerate training procedure.                          pressing deep neural network with pruning, trained quantiza-
+                                                            tion and huffman coding.CoRR, abs/1510.00149, 2, 2015.
+         References                                         2
+                                                         [17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights
+          [1] J. M. Alvarez and M. Salzmann. Learning the number of       and connections for efﬁcient neural network. InAdvances in
+            neurons in deep networks. InAdvances in Neural Informa-       Neural Information Processing Systems, pages 1135–1143,
+            tion Processing Systems, pages 2262–2270, 2016. 1,2,3,       2015.1,2,3
+            6                                           [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
+          [2] S. Anwar, K. Hwang, and W. Sung. Structured prun-       ing for image recognition.arXiv preprint arXiv:1512.03385,
+            ing of deep convolutional neural networks. arXiv preprint       2015. 1,2,3,4,6,8
+            arXiv:1512.08571, 2015.2                          [19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim-
+          [3] S. Anwar and W. Sung. Compact deep convolutional       ming: A data-driven neuron pruning approach towards efﬁ-
+            neural networks with coarse pruning.  arXiv preprint       cient deep architectures. arXiv preprint arXiv:1607.03250,
+            arXiv:1610.09639, 2016.1,2                            2016.2
+
+
+
+
+                                                       1396         [20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,    [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
+            A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al.       B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
+            Speed/accuracy trade-offs for modern convolutional object       V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
+            detectors.arXiv preprint arXiv:1611.10012, 2016. 6            M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
+         [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating       chine learning in Python.Journal of Machine Learning Re-
+            deep network training by reducing internal covariate shift.       search, 12:2825–2830, 2011.4
+            arXiv preprint arXiv:1502.03167, 2015.4                [39] A. Polyak and L. Wolf. Channel-level acceleration of deep
+         [22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up       face representations.IEEE Access, 3:2163–2175, 2015.2
+            convolutional neural networks with low rank expansions.    [40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
+            arXiv preprint arXiv:1405.3866, 2014.1,2,5,6,7              net: Imagenet classiﬁcation using binary convolutional neu-
+         [23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-       ral networks. InEuropean Conference on Computer Vision,
+            shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-       pages 525–542. Springer, 2016. 2
+            tional architecture for fast feature embedding.arXiv preprint    [41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. arXiv:1408.5093, 2014. 4,6                            You only look once: Uniﬁed, real-time object detection.
+         [24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin.       CoRR, abs/1506.02640, 2015. 6
+            Compression of deep convolutional neural networks for    [42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:fast and low power mobile applications. arXiv preprint       towards real-time object detection with region proposal net-arXiv:1511.06530, 2015.2                             works.CoRR, abs/1506.01497, 2015.6 [25] A. Krizhevsky and G. Hinton. Learning multiple layers of    [43] K. Simonyan and A. Zisserman. Very deep convolutionalfeatures from tiny images. 2009.4,8                       networks for large-scale image recognition. arXiv preprint[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet       arXiv:1409.1556, 2014.3,4,5,6classiﬁcation with deep convolutional neural networks. In    [44] S. Srinivas and R. V. Babu. Data-free parameter pruningAdvances in neural information processing systems, pages       for deep neural networks.arXiv preprint arXiv:1507.06149,1097–1105, 2012.2,3                                2015.2[27] A. Lavin. Fast algorithms for convolutional neural networks.    [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,arXiv preprint arXiv:1509.09308, 2015.2                   D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and       Going deeper with convolutions. InProceedings of the IEEEV. Lempitsky. Speeding-up convolutional neural net-       Conference on Computer Vision and Pattern Recognition,works using ﬁne-tuned cp-decomposition. arXiv preprint       pages 1–9, 2015.1,3,6arXiv:1412.6553, 2014.2                          [46] R. Tibshirani. Regression shrinkage and selection via the[29] V. Lebedev and V. Lempitsky. Fast convnets using group-       lasso. Journal of the Royal Statistical Society. Series Bwise brain damage.arXiv preprint arXiv:1506.02515, 2015.       (Methodological), pages 267–288, 1996.32                                           [47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-       antino, and Y. LeCun. Fast convolutional nets withbased learning applied to document recognition. Proceed-       fbfft: A gpu performance evaluation.  arXiv preprintings of the IEEE, 86(11):2278–2324, 1998.2,3                arXiv:1412.7580, 2014.1,2[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.
+            Graf. Pruning ﬁlters for efﬁcient convnets. arXiv preprint    [48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning
+            arXiv:1608.08710, 2016.1,2,4,5,6,7,8                   structured sparsity in deep neural networks. InAdvances In
+         [32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.       Neural Information Processing Systems, pages 2074–2082,
+            Sparse convolutional neural networks. InProceedings of the       2016.1,2,3
+            IEEE Conference on Computer Vision and Pattern Recogni-    [49] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated´
+            tion, pages 806–814, 2015.2                            residual transformations for deep neural networks. arXiv
+         [33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed,       preprint arXiv:1611.05431, 2016.7
+            C. Fu, and A. C. Berg. SSD: single shot multibox detector.    [50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural
+            CoRR, abs/1512.02325, 2015.6                          network acoustic models with singular value decomposition.
+         [34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint       InINTERSPEECH, pages 2365–2369, 2013.2
+            arXiv:1511.05077, 2015.2                          [51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-
+         [35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training       efﬁcient convolutional neural networks using energy-aware
+            of convolutional networks through ffts.  arXiv preprint       pruning.arXiv preprint arXiv:1611.05128, 2016.2
+            arXiv:1312.5851, 2013.2                          [52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very
+         [36] V. Nair and G. E. Hinton. Rectiﬁed linear units improve       deep convolutional networks for classiﬁcation and detection.
+            restricted boltzmann machines. InProceedings of the 27th       IEEE transactions on pattern analysis and machine intelli-
+            international conference on machine learning (ICML-10),       gence, 38(10):1943–1955, 2016.1,2,3,5,6,7
+            pages 807–814, 2010.4
+         [37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable
+            parallel programming with CUDA.ACM Queue, 6(2):40–53,
+            2008.6
+
+
+
+
+                                                       1397
--- a/Corpus/DEEP
+++ b/Corpus/DEEP
--- a/Nakkiran.txt
+++ b/Nakkiran.txt
--- a/Recognition.txt
+++ b/Recognition.txt
--- a/Architectures.txt
+++ b/Architectures.txt
--- a/Corpus/Efficient
+++ b/Corpus/Efficient
--- a/Corpus/Efficient
+++ b/Corpus/Efficient
--- a/Corpus/EfficientNet
+++ b/Corpus/EfficientNet
--- a/Corpus/Energy
+++ b/Corpus/Energy
@ -0,0 +1,261 @@
+                 Energy and Policy Considerations for Deep Learning in NLP
+
+
+                      Emma Strubell Ananya Ganesh Andrew McCallum
+                            College of Information and Computer Sciences
+                                University of Massachusetts Amherst
+                       {strubell, aganesh, mccallum}@cs.umass.edu
+
+
+
+
+
+                        Abstract                Consumption CO 2 e (lbs)
+                                               Air travel, 1 passenger, NY↔SF 1984 Recent progress in hardware and methodol-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+    arXiv:1906.02243v1  [cs.CL]  5 Jun 2019                                          Human life, avg, 1 year 11,023 ogy for training neural networks has ushered
+             in a new generation of large networks trained      American life, avg, 1 year 36,156
+             on abundant data. These models have ob-      Car, avg incl. fuel, 1 lifetime 126,000
+             tained notable gains in accuracy across many
+             NLP tasks. However, these accuracy improve-      Training one model (GPU)
+             ments depend on the availability of exception-      NLP pipeline (parsing, SRL) 39 ally large computational resources that neces-       w/ tuning & experimentation 78,468 sitate similarly substantial energy consump-      Transformer (big) 192 tion. As a result these models are costly to
+             train and develop, both ﬁnancially, due to the       w/ neural architecture search 626,155
+             cost of hardware and electricity or cloud com-     Table 1: Estimated COpute time, and environmentally,due to the car-                   2 emissions from training com-
+                                              mon NLP models, compared to familiar consumption. 1 bon footprint required to fuel modern tensor
+             processing hardware. In this paper we bring
+             this issue to the attention of NLP researchers     NLP models could be trained and developed on by quantifying the approximate ﬁnancial and     a commodity laptop or server, many now require environmental costs of training a variety of re-
+             cently successful neural network models for     multiple instances of specialized hardware such as
+             NLP. Based on these ﬁndings, we propose ac-     GPUs or TPUs, therefore limiting access to these
+             tionable recommendations to reduce costs and     highly accurate models on the basis of ﬁnances.
+             improve equity in NLP research and practice.       Even when these expensive computational re-
+           1 Introduction                       sources are available, model training also incurs a
+                                              substantial cost to the environment due to the en-
+           Advances in techniques and hardware for train-  ergy required to power this hardware for weeks or
+           ing deep neural networks have recently en-  months at a time. Though some of this energy may
+           abled impressive accuracy improvements across  come from renewable or carbon credit-offset re-
+           many fundamental NLP tasks ( Bahdanau et al.,  sources, the high energy demands of these models
+           2015; Luong et al., 2015; Dozat and Man-  are still a concern since (1) energy is not currently
+           ning, 2017; Vaswani et al., 2017), with the  derived from carbon-neural sources in many loca-
+           most computationally-hungry models obtaining  tions, and (2) when renewable energy is available,
+           the highest scores (Peters et al.,2018;Devlin et al.,  it is still limited to the equipment we have to pro-
+           2019;Radford et al.,2019;So et al.,2019). As  duce and store it, and energy spent training a neu-
+           a result, training a state-of-the-art model now re-  ral network might better be allocated to heating a
+           quires substantial computational resources which  family’s home. It is estimated that we must cut
+           demand considerable energy, along with the as-  carbon emissions by half over the next decade to
+           sociated ﬁnancial and environmental costs. Re-  deter escalating rates of natural disaster, and based
+           search and development of new models multiplies  on the estimated CO 2 emissions listed in Table 1,
+           these costs by thousands of times by requiring re-
+           training to experiment with model architectures    1 Sources: (1) Air travel and per-capita consump-
+                                              tion: https://bit.ly/2Hw0xWc; (2) car lifetime: and hyperparameters. Whereas a decade ago most  https://bit.ly/2Qbr0w1.           model training and development likely make up   Consumer Renew. Gas Coal Nuc.
+           a substantial portion of the greenhouse gas emis-   China 22% 3% 65% 4%
+           sions attributed to many NLP researchers.         Germany 40% 7% 38% 13%
+            To heighten the awareness of the NLP commu-   United States 17% 35% 27% 19%
+           nity to this issue and promote mindful practice and   Amazon-AWS 17% 24% 30% 26%
+           policy, we characterize the dollar cost and carbon   Google 56% 14% 15% 10%
+           emissions that result from training the neural net-   Microsoft 32% 23% 31% 10%
+           works at the core of many state-of-the-art NLP
+           models. We do this by estimating the kilowatts  Table 2: Percent energy sourced from: Renewable (e.g.
+           of energy required to train a variety of popular  hydro, solar, wind), natural gas, coal and nuclear for
+           off-the-shelf NLP models, which can be converted  the top 3 cloud compute providers (Cook et al.,2017),
+           to approximate carbon emissions and electricity  compared to the United States, 4 China 5 and Germany
+           costs. To estimate the even greater resources re-  (Burger,2019).
+           quired to transfer an existing model to a new task
+           or develop new models, we perform a case study    We estimate the total time expected for mod-
+           of the full computational resources required for the  els to train to completion using training times and
+           development and tuning of a recent state-of-the-art  hardware reported in the original papers. We then
+           NLP pipeline (Strubell et al.,2018). We conclude  calculate the power consumption in kilowatt-hours
+           with recommendations to the community based on  (kWh) as follows. Letpc be the average power
+           our ﬁndings, namely: (1) Time to retrain and sen-  draw (in watts) from all CPU sockets during train-
+           sitivity to hyperparameters should be reported for  ing, letpr be the average power draw from all
+           NLP machine learning models; (2) academic re-  DRAM (main memory) sockets, letpg be the aver-
+           searchers need equitable access to computational  age power draw of a GPU during training, and let
+           resources; and (3) researchers should prioritize de-  gbe the number of GPUs used to train. We esti-
+           veloping efﬁcient models and hardware.         mate total power consumption as combined GPU,
+                                              CPU and DRAM consumption, then multiply this
+           2 Methods                          by Power Usage Effectiveness (PUE), which ac-
+                                              counts for the additional energy required to sup-To quantify the computational and environmen-  port the compute infrastructure (mainly cooling).tal cost of training deep neural network mod-  We use a PUE coefﬁcient of 1.58, the 2018 globalels for NLP, we perform an analysis of the en-  average for data centers (Ascierto,2018). Then theergy required to train a variety of popular off-  total powerpthe-shelf NLP models, as well as a case study of           t required at a given instance during
+                                              training is given by:the complete sum of resources required to develop
+           LISA (Strubell et al.,2018), a state-of-the-art NLP             1.58t(pp       c +pr +gp g )
+           model from EMNLP 2018, including all tuning          t =                    (1)1000
+           and experimentation.                      The U.S. Environmental Protection Agency (EPA)We measure energy use as follows. We train the  provides average COmodels described in§2.1using the default settings                 2 produced (in pounds per
+                                              kilowatt-hour) for power consumed in the U.S.provided, and sample GPU and CPU power con-  (EPA,2018), which we use to convert power tosumption during training. Each model was trained  estimated COfor a maximum of 1 day. We train all models on           2 emissions:
+
+           a single NVIDIA Titan X GPU, with the excep-             CO 2 e = 0.954pt         (2)
+           tion of ELMo which was trained on 3 NVIDIA  This conversion takes into account the relative pro-GTX 1080 Ti GPUs. While training, we repeat-  portions of different energy sources (primarily nat-edly query the NVIDIA System Management In-  ural gas, coal, nuclear and renewable) consumedterface 2 to sample the GPU power consumption  to produce energy in the United States. Table2and report the average over all samples. To sample  lists the relative energy sources for China, Ger-CPU power consumption, we use Intel’s Running  many and the United States compared to the topAverage Power Limit interface. 3
+                                                5 U.S. Dept. of Energy:https://bit.ly/2JTbGnI
+            2 nvidia-smi:https://bit.ly/30sGEbi        5 China Electricity Council; trans. China Energy Portal:
+            3 RAPL power meter:https://bit.ly/2LObQhV   https://bit.ly/2QHE5O3           three cloud service providers. The U.S. break-  ence. Devlin et al.(2019) report that the BERT
+           down of energy is comparable to that of the most  base model (110M parameters) was trained on 16
+           popular cloud compute service, Amazon Web Ser-  TPU chips for 4 days (96 hours). NVIDIA reports
+           vices, so we believe this conversion to provide a  that they can train a BERT model in 3.3 days (79.2
+           reasonable estimate of CO 2 emissions per kilowatt  hours) using 4 DGX-2H servers, totaling 64 Tesla
+           hour of compute energy used.                V100 GPUs (Forster et al.,2019).
+                                              GPT-2. This model is the latest edition of
+           2.1 Models                           OpenAI’s GPT general-purpose token encoder,
+           We analyze four models, the computational re-  also based on Transformer-style self-attention and
+           quirements of which we describe below. All mod-  trained with a language modeling objective (Rad-
+           els have code freely available online, which we  ford et al.,2019). By training a very large model
+           used out-of-the-box. For more details on the mod-  on massive data,Radford et al.(2019) show high
+           els themselves, please refer to the original papers.  zero-shot performance on question answering and
+                                              language modeling benchmarks. The large modelTransformer. The Transformer model (Vaswani  described inRadford et al.(2019) has 1542M pa-et al.,2017) is an encoder-decoder architecture  rameters and is reported to require 1 week (168primarily recognized for efﬁcient and accurate ma-  hours) of training on 32 TPUv3 chips. 6 chine translation. The encoder and decoder each
+           consist of 6 stacked layers of multi-head self-
+           attention. Vaswani et al.(2017) report that the  3 Related work
+           Transformerbasemodel (65M parameters) was
+           trained on 8 NVIDIA P100 GPUs for 12 hours,  There is some precedent for work characterizing
+           and the Transformerbigmodel (213M parame-  the computational requirements of training and in-
+           ters) was trained for 3.5 days (84 hours; 300k  ference in modern neural network architectures in
+           steps). This model is also the basis for recent  the computer vision community.Li et al.(2016)
+           work on neural architecture search (NAS) for ma-  present a detailed study of the energy use required
+           chine translation and language modeling (So et al.,  for training and inference in popular convolutional
+           2019), and the NLP pipeline that we study in more  models for image classiﬁcation in computer vi-
+           detail in§4.2(Strubell et al.,2018). So et al.  sion, including ﬁne-grained analysis comparing
+           (2019) report that their full architecture search ran  different neural network layer types. Canziani
+           for a total of 979M training steps, and that their  et al.(2016) assess image classiﬁcation model ac-
+           base model requires 10 hours to train for 300k  curacy as a function of model size and gigaﬂops
+           steps on one TPUv2 core. This equates to 32,623  required during inference. They also measure av-
+           hours of TPU or 274,120 hours on 8 P100 GPUs.   erage power draw required during inference on
+                                              GPUs as a function of batch size. Neither work an-ELMo. The ELMo model (Peters et al.,2018)  alyzes the recurrent and self-attention models thatis based on stacked LSTMs and provides rich  have become commonplace in NLP, nor do theyword representations in context by pre-training on  extrapolate power to estimates of carbon and dol-a large amount of data using a language model-  lar cost of training.ing objective. Replacing context-independent pre-
+           trained word embeddings with ELMo has been    Analysis of hyperparameter tuning has been
+           shown to increase performance on downstream  performed in the context of improved algorithms
+           tasks such as named entity recognition, semantic  for hyperparameter search (Bergstra et al.,2011;
+           role labeling, and coreference.Peters et al.(2018)  Bergstra and Bengio,2012;Snoek et al.,2012). To
+           report that ELMo was trained on 3 NVIDIA GTX  our knowledge there exists to date no analysis of
+           1080 GPUs for 2 weeks (336 hours).           the computation required for R&D and hyperpa-
+                                              rameter tuning of neural network models in NLP.BERT.The BERT model (Devlin et al.,2019) pro-
+           vides a Transformer-based architecture for build-
+           ing contextual representations similar to ELMo,    6 Via the authorson Reddit.
+                                                7 GPU lower bound computed using pre-emptible but trained with a different language modeling ob-  P100/V100 U.S. resources priced at $0.43–$0.74/hr, upper
+           jective. BERT substantially improves accuracy on  bound uses on-demand U.S. resources priced at $1.46–
+           tasks requiring sentence-level representations such  $2.48/hr. We similarly use pre-emptible ($1.46/hr–$2.40/hr)
+                                              and on-demand ($4.50/hr–$8/hr) pricing as lower and upper as question answering and natural language infer-  bounds for TPU v2/3; cheaper bulk contracts are available.           Model Hardware Power (W) Hours kWh·PUE CO 2 e Cloud compute cost
+           Transformer base P100x8 1415.78 12 27 26 $41–$140
+           Transformer big  P100x8 1515.43 84 201 192 $289–$981
+           ELMo P100x3 517.66 336 275 262 $433–$1472
+           BERT base     V100x64 12,041.51 79 1507 1438 $3751–$12,571
+           BERT base     TPUv2x16 — 96 — — $2074–$6912
+           NAS P100x8 1515.43 274,120 656,347 626,155 $942,973–$3,201,722
+           NAS TPUv2x1 — 32,623 — — $44,055–$146,848
+           GPT-2 TPUv3x32 — 168 — — $12,902–$43,008
+
+           Table 3: Estimated cost of training a model in terms of CO 2 emissions (lbs) and cloud compute cost (USD). 7 Power
+           and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware.
+
+
+           4 Experimental results                                  Estimated cost (USD)
+                                               Models Hours Cloud compute Electricity4.1 Cost of training                     1 120 $52–$175 $5Table3lists CO 2 emissions and estimated cost of   24 2880 $1238–$4205 $118training the models described in§2.1. Of note is   4789 239,942 $103k–$350k $9870that TPUs are more cost-efﬁcient than GPUs on
+           workloads that make sense for that hardware (e.g.  Table 4: Estimated cost in terms of cloud compute and
+           BERT). We also see that models emit substan-  electricity for training: (1) a single model (2) a single
+           tial carbon emissions; training BERT on GPU is  tune and (3) all models trained during R&D.
+           roughly equivalent to a trans-American ﬂight.So
+           et al.(2019) report that NAS achieves a new state-  about 60 GPUs running constantly throughout theof-the-art BLEU score of 29.7 for English to Ger-  6 month duration of the project. Table4lists upperman machine translation, an increase of just 0.1  and lower bounds of the estimated cost in termsBLEU at the cost of at least $150k in on-demand  of Google Cloud compute and raw electricity re-compute time and non-trivial carbon emissions.    quired to develop and deploy this model. 9 We see
+                                              that while training a single model is relatively in-4.2 Cost of development: Case study        expensive, the cost of tuning a model for a newTo quantify the computational requirements of  dataset, which we estimate here to require 24 jobs,R&D for a new model we study the logs of  or performing the full R&D required to developall training required to develop Linguistically-  this model, quickly becomes extremely expensive.Informed Self-Attention (Strubell et al.,2018), a
+           multi-task model that performs part-of-speech tag-  5 Conclusions
+           ging, labeled dependency parsing, predicate detec-
+           tion and semantic role labeling. This model makes  Authors should report training time and
+           for an interesting case study as a representative  sensitivity to hyperparameters.
+           NLP pipeline and as a Best Long Paper at EMNLP.  Our experiments suggest that it would be beneﬁ-
+            Model training associated with the project  cial to directly compare different models to per-
+           spanned a period of 172 days (approx. 6 months).  form a cost-beneﬁt (accuracy) analysis. To ad-
+           During that time 123 small hyperparameter grid  dress this, when proposing a model that is meant
+           searches were performed, resulting in 4789 jobs  to be re-trained for downstream use, such as re-
+           in total. Jobs varied in length ranging from a min-  training on a new domain or ﬁne-tuning on a new
+           imum of 3 minutes, indicating a crash, to a maxi-  task, authors should report training time and com-
+           mum of 9 days, with an average job length of 52  putational resources required, as well as model
+           hours. All training was done on a combination of  sensitivity to hyperparameters. This will enable
+           NVIDIA Titan X (72%) and M40 (28%) GPUs. 8   direct comparison across models, allowing subse-
+            The sum GPU time required for the project  quent consumers of these models to accurately as-
+           totaled 9998 days (27 years). This averages to  sess whether the required computational resources
+            8 We approximate cloud compute cost using P100 pricing.    9 Based on average U.S cost of electricity of $0.12/kWh.           are compatible with their setting. More explicit  half the estimated cost to use on-demand cloud
+           characterization of tuning time could also reveal  GPUs. Unlike money spent on cloud compute,
+           inconsistencies in time spent tuning baseline mod-  however, that invested in centralized resources
+           els compared to proposed contributions. Realiz-  would continue to pay off as resources are shared
+           ing this will require: (1) a standard, hardware-  across many projects. A government-funded aca-
+           independent measurement of training time, such  demic compute cloud would provide equitable ac-
+           as gigaﬂops required to convergence, and (2) a  cess to all researchers.
+           standard measurement of model sensitivity to data
+           and hyperparameters, such as variance with re-  Researchers should prioritize computationally
+           spect to hyperparameters searched.            efﬁcient hardware and algorithms.
+                                              We recommend a concerted effort by industry and
+           Academic researchers need equitable access to   academia to promote research of more computa-
+           computation resources.                   tionally efﬁcient algorithms, as well as hardware
+                                              that requires less energy. An effort can also beRecent advances in available compute come at a  made in terms of software. There is already ahigh price not attainable to all who desire access.  precedent for NLP software packages prioritizingMost of the models studied in this paper were de-  efﬁcient models. An additional avenue throughveloped outside academia; recent improvements in  which NLP and machine learning software de-state-of-the-art accuracy are possible thanks to in-  velopers could aid in reducing the energy asso-dustry access to large-scale compute.           ciated with model tuning is by providing easy-Limiting this style of research to industry labs  to-use APIs implementing more efﬁcient alterna-hurts the NLP research community in many ways.  tives to brute-force grid search for hyperparameterFirst, it stiﬂes creativity. Researchers with good  tuning, e.g. random or Bayesian hyperparameterideas but without access to large-scale compute  search techniques (Bergstra et al.,2011;Bergstrawill simply not be able to execute their ideas,  and Bengio,2012;Snoek et al.,2012). Whileinstead constrained to focus on different prob-  software packages implementing these techniqueslems. Second, it prohibits certain types of re-  do exist, 10 they are rarely employed in practicesearch on the basis of access to ﬁnancial resources.  for tuning NLP models. This is likely becauseThis even more deeply promotes the already prob-  their interoperability with popular deep learninglematic “rich get richer” cycle of research fund-  frameworks such as PyTorch and TensorFlow ising, where groups that are already successful and  not optimized, i.e. there are not simple exam-thus well-funded tend to receive more funding  ples of how to tune TensorFlow Estimators usingdue to their existing accomplishments. Third, the  Bayesian search. Integrating these tools into theprohibitive start-up cost of building in-house re-  workﬂows with which NLP researchers and practi-sources forces resource-poor groups to rely on  tioners are already familiar could have notable im-cloud compute services such as AWS, Google  pact on the cost of developing and tuning in NLP.Cloud and Microsoft Azure.
+            While these services provide valuable, ﬂexi-  Acknowledgements
+           ble, and often relatively environmentally friendly  We are grateful to Sherief Farouk and the anony- compute resources, it is more cost effective for  mous reviewers for helpful feedback on earlieracademic researchers, who often work for non-  drafts. This work was supported in part by theproﬁt educational institutions and whose research  Centers for Data Science and Intelligent Infor-is funded by government entities, to pool resources  mation Retrieval, the Chan Zuckerberg Initiativeto build shared compute centers at the level of  under the Scientiﬁc Knowledge Base Construc-funding agencies, such as the U.S. National Sci-  tion project, the IBM Cognitive Horizons Networkence Foundation. For example, an off-the-shelf  agreement no. W1668553, and National ScienceGPU server containing 8 NVIDIA 1080 Ti GPUs  Foundation grant no. IIS-1514053. Any opinions,and supporting hardware can be purchased for  ﬁndings and conclusions or recommendations ex-approximately $20,000 USD. At that cost, the  pressed in this material are those of the authors andhardware required to develop the model in our  do not necessarily reﬂect those of the sponsor.case study (approximately 58 GPUs for 172 days)
+           would cost $145,000 USD plus electricity, about    10 For example, theHyperopt Python library.           References                              Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
+                                                     Gardner, Christopher Clark, Kenton Lee, and LukeRhonda Ascierto. 2018.Uptime Institute Global Data    Zettlemoyer. 2018. Deep contextualized word rep-Center Survey. Technical report, Uptime Institute.     resentations. InNAACL.
+           Dzmitry Bahdanau, KyunghyunCho, and Yoshua Ben-
+             gio. 2015. Neural Machine Translation by Jointly  Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
+             Learning to Align and Translate. In3rd Inter-    Dario Amodei, and Ilya Sutskever. 2019.Language
+             national Conference for Learning Representations    models are unsupervised multitask learners.
+             (ICLR), San Diego, California, USA.            Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
+           James Bergstra and Yoshua Bengio. 2012. Random    2012. Practical bayesian optimization of machine
+             search for hyper-parameter optimization.Journal of    learning algorithms. InAdvances in neural informa-
+             Machine Learning Research, 13(Feb):281–305.       tion processing systems, pages 2951–2959.
+
+           James S Bergstra, R´emi Bardenet, Yoshua Bengio, and  David R. So, Chen Liang, and Quoc V. Le. 2019.
+             Bal´azs K´egl. 2011. Algorithms for hyper-parameter    The evolved transformer. InProceedings of the
+             optimization. InAdvances in neural information    36th InternationalConference on Machine Learning
+             processing systems, pages 2546–2554.             (ICML).
+
+           Bruno Burger. 2019.Net Public Electricity Generation  Emma Strubell, Patrick Verga, Daniel Andor,
+             in Germany in 2018. Technical report, Fraunhofer    David Weiss, and Andrew McCallum. 2018.
+             Institute for Solar Energy Systems ISE.             Linguistically-Informed Self-Attention for Se-
+                                                     mantic Role Labeling. InConference on Empir-Alfredo Canziani, Adam Paszke, and Eugenio Culur-    ical Methods in Natural Language Processingciello. 2016. An analysis of deep neural network    (EMNLP), Brussels, Belgium. models for practical applications .
+                                                   Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobGary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John    Uszkoreit, Llion Jones, Aidan N Gomez, LukaszDeans, Brian Johnson, Elizabeth Jardim, and Brian    Kaiser, and Illia Polosukhin. 2017. Attention is allJohnson. 2017. Clicking Clean: Who is winning    you need. In31st Conference on Neural Informationthe race to build a green internet?Technical report,    Processing Systems (NIPS).Greenpeace.
+            Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
+             Kristina Toutanova. 2019. BERT: Pre-training of
+             Deep Bidirectional Transformers for Language Un-
+             derstanding. InNAACL.
+            Timothy Dozat and Christopher D. Manning. 2017.
+             Deep biafﬁne attention for neural dependency pars-
+             ing. InICLR.
+            EPA. 2018. Emissions & Generation Resource Inte-
+             grated Database (eGRID). Technical report, U.S.
+             Environmental Protection Agency.
+            Christopher Forster, Thor Johnsen, Swetha Man-
+             dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie
+             Bernauer, Allison Gray, Sharan Chetlur, and Raul
+             Puri. 2019. BERT Meets GPUs. Technical report,
+             NVIDIA AI.
+            Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong.
+             2016. Evaluating the energy efﬁciency of deep con-
+             volutional neural networks on cpus and gpus.2016
+             IEEE International Conferences on Big Data and
+             Cloud Computing (BDCloud), Social Computing
+             and Networking (SocialCom), Sustainable Comput-
+             ing and Communications (SustainCom) (BDCloud-
+             SocialCom-SustainCom), pages 477–484.
+           Thang Luong, Hieu Pham, and Christopher D. Man-
+             ning. 2015.Effective approaches to attention-based
+             neural machine translation. InProceedings of the
+             2015 Conference on Empirical Methods in Natural
+             Language Processing, pages 1412–1421. Associa-
+             tion for Computational Linguistics.
--- a/Corpus/Finite-Element
+++ b/Corpus/Finite-Element
@ -0,0 +1,793 @@
+                                                               IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005                                                                                                                                                                                                                                                                                                                                                                                                                1381
+     Finite-Element Neural Networks for Solving
+             Differential Equations
+    Pradeep Ramuhalli, Member, IEEE, Lalita Udpa, Senior Member, IEEE, and Satish S. Udpa, Fellow, IEEE
+
+   Abstract—The solution of partial differential equations (PDE)
+  arises in a wide variety of engineering problems. Solutions to most
+  practical problems use numerical analysis techniques such as ﬁ-
+  nite-element or ﬁnite-difference methods. The drawbacks of these
+  approaches include computational costs associated with the mod-
+  eling of complex geometries. This paper proposes a ﬁnite-element
+  neural network (FENN) obtained by embedding a ﬁnite-element
+  model in a neural network architecture that enables fast and ac-
+  curate solution of the forward problem. Results of applying the
+  FENN to severalsimpleelectromagnetic forward and inverseprob-
+  lems are presented. Initial results indicate that the FENN perfor-
+  mance as a forward model is comparable to that of the conven-
+  tional ﬁnite-element method (FEM). The FENN can also be used
+  in an iterative approach to solve inverse problems associated with Fig. 1. Iterative inversion method for solving inverse problems. the PDE. Results showing the ability of the FENN to solve the in-
+  verse problem given the measured signal are also presented. The
+  parallel nature of the FENN also makes it an attractive solution resulting in the corresponding solution to the forward problem
+  for parallel implementation in hardware and software.    . The model output is compared to the measurement ,
+   Index Terms—Finite-element method (FEM), ﬁnite-element using a cost function  .If  is less than a toler-
+  neural network (FENN), inverse problems.      ance, the estimateis used as the desired solution. If not,
+                    is updated to minimize the cost function.
+  S     I. I           Although ﬁnite-element methods (FEMs) [3], [4] are ex- NTRODUCTION       tremely popular for solving differential equations, their majorOLUTIONS of differential equations arise in a widedrawback is computational complexity. This problem becomesvariety of engineering applications in electromagnetics,more acute when three-dimensional (3-D) ﬁnite-elementsignal processing, computational ﬂuid dynamics, etc. Thesemodels are used in an iterative algorithm for solving the inverseequations are typically solved using either analytical or numer-problem. Recently, several authors have suggested the use ofical methods. Analytical solution methods are however feasibleneural networks (MLP or RBF networks [5]) for solving differ-only for simple geometries, which limits their applicability. Inential equations [6]–[9]. In these techniques, a neural networkmost practical problems with complex boundary conditions,is trained using a large database containing the input data andnumerical analysis methods are required in order to obtain athe solution of the differential equation. The neural networkreasonable solution. An example is the solution of Maxwell’sduring generalization learns the mapping corresponding toequations in electromagnetics. Solutions to Maxwell’s equa-the PDE. Alternatively, in [10], the solution to a differentialtions are used in a variety of applications for calculating theequation is written as a constant term, and an adjustable term interaction of electromagnetic (EM) ﬁelds with different typeswith parameters that need to be determined. A neural networkof media.               is used to determine the optimal values of the parameters.Very often, the solution to differential equations is necessaryThis approach is applicable only to problems with regularfor solving the corresponding inverse problems. Inverse prob-boundaries. An extension of the approach to problems withlems in general are ill-posed, lacking continuous dependence ofirregular boundaries is given in [11]. Other neural networkthe measurements on the input. This has resulted in the devel-based differential equation solvers use multilayer perceptronopment of a variety of solution techniques ranging from simplenetworks or variations on the MLP to approximate the unknowncalibration procedures to other direct (analytical) and iterativefunction in a PDE [12]–[14]. A combination of the PDE andapproaches [1]. Iterative methods typically employ a forwardboundary conditions is used to construct an objective functionmodel that simulates the underlying physical process (Fig. 1)that is minimized during the training process.[2]. An initial estimate of the solution of the inverse problem A major limitation of these approaches is that the network ar- (represented byin Fig. 1) is applied to the forward model,chitecture is selected somewhat arbitrarily. A second drawback
+                    is that the performance of the neural networks depends on the
+   Manuscript received January 17, 2004; revised April 2, 2005.    data used in training and testing. As long the test data is sim-
+   The authors are with the Department of Electrical and Computer Engi- ilar to the training data, the network can interpolate between the neering, Michigan State University, East Lansing, MI 48824 USA (e-mail: training data points to obtain a reasonable prediction. However, rpradeep@egr.msu.edu; udpal@egr.msu.edu; udpa@egr.msu.edu).
+   Digital Object Identiﬁer 10.1109/TNN.2005.857945      when the test signal is no longer similar to the training data, the
+                1045-9227/$20.00 © 2005 IEEE                                                                  1382                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+      network is forced to extrapolate and the performance degrades.  Section V draws conclusions from the results and presents
+      One way around this difﬁculty is to ensure that the training data- ideas for future work.
+      base has a diverse set of signals. However, this is difﬁcult to
+      ensure in practice. Alternatively, we have to design neural net-                  II. T HE FENN
+      works that are capable of extrapolation. Extrapolation methods   This section brieﬂy describes the FEM and proposes its refor-are discussed extensively in literature [15]–[18], but the design  mulation into a parallel neural network structure. Details aboutof an extrapolation neural network involves several issues par-  the FEM can be found in [3] and [4].ticularly for ensuring that the error in the network prediction
+      stays within reasonable bounds during the extrapolation proce-  A. The FEMdure.                                          Consider a typical boundary value problem with the gov-An ideal solution to this problem would be to combine the erning differential equationpower of numerical models with the computational speed of
+      neural networks, i.e., to embed a numerical model in a neural                                          (1)network structure. One suchﬁnite-element neural network
+      (FENN) formulation has been reported by Takeuchi and Kosugi  where  is a differential operator,  is the applied source or
+      [19]. This approach, based on error minimization, derives the forcing function, and is the unknown quantity. This differen-
+      neural network using the energy functional resulting from the tial equation can be solved in conjunction with boundary condi-
+      ﬁnite-element formulation. Other reports of FENN combina-  tionson theboundary enclosingthedomain .Thevariational
+      tions are either similar to the Takeuchi method [20], [21] or use  formulation used inﬁnite-element analysis determines the un-
+      Hopﬁeld neural networks to solve the forward problem [22],  known by minimizing the functional [3], [4]
+      [23]. Kalkkuhlet al.[24] provide a description of a FEM-based
+      approach to NARX modeling that may be interpreted both as                                          (2)
+      a local model network, as well as a single layer feedforward
+      network. A slightly different approach to merging numerical  with respect to the trial function . The minimization procedure
+      methods and neural networks is given in [25], where theﬁ-  starts by dividing  into  small subdomains called elements
+      nite-difference time domain (FDTD) method is cast in a neural (Fig. 2) and representing  in each element by means of basis
+      network framework for the purpose of solving electromagnetic  functions deﬁned over the element
+      forward problems. The related problem of mesh generation
+      inﬁnite-element models has also been tackled using neural                                          (3)networks (for instance, [26]). Generally, these networks are
+      designed to solve the forward problem, and must be modiﬁed
+      to solve inverse problems.                          where  is the unknown solution in element ,   is the basis
+        This paper proposes a new approach that embeds aﬁnite-ele-  function associated with node in element ,  is the value
+      ment model commonly used in the solution of differential equa-  of the unknown quantity at node and is the total number of
+      tions in a neural network. The network, called the FENN, can  nodes associated with element . In general, the basis functions
+      solve the forward problem and can also be used in an itera-  (also referred to as interpolation functions or shape functions)
+      tive algorithm to solve inverse problems. The primary advan- can be linear, quadratic, or of higher order. Typically,ﬁnite-el-
+      tage of this approach is that the FEM is represented in a parallel ement models use either linear or polynomial spline basis func-
+      form. Thus, it has the potential to alleviate the computational  tions.
+      cost associated with using the FEM in an iterative algorithm   The functional within an element is expressed as
+      for solving inverse problems. More importantly, the FENN does
+      not need any training, and the computation of the weights is                                          (4)
+      a one-time process. The proposed approach is also different in
+      that the neural network architecture developed can be used to
+      solve the forward and inverse problems. The structure of the By substituting (3) in (4), we obtain the discrete version of the
+      neural network is also simpler than those reported in the litera-  functional within each element
+      ture, making it easier to implement in parallel in both hardware                                          (5)and software.
+        The rest of this paper is organized as follows. Section II  where     is the transpose of a matrix,   is the    ele-brieﬂy describes the FEM, and derives the proposed FENN. In  mental matrix with elements this paper, we focus on the problem of solving typical equa-
+      tions encountered in electromagnetic nondestructive evaluation                                          (6)(NDE). However, the same concepts can be easily applied
+      to solve differential equations encountered in otherﬁelds.
+      Sections III, IV and V present the application of the FENN  and  is an    vector with elements
+      to solving forward and inverse problems, along with initial
+      results. A discussion of the advantages and disadvantages of                                          (7)
+      the proposed FENN architecture is given in Section IV. Finally,                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1383
+
+
+      Combining the values in (5) for each of the elements
+
+                                              (8)
+
+      where  is the      global matrix derived from the terms
+      of the elemental matrices for different elements, and  is the
+      total number of nodes.  , also called the stiffness matrix, is a
+      sparse, banded matrix. Equation (8) is the discrete version of
+      the functional and can be minimized with respect to the nodal
+      parameters by taking the derivative of with respect to and
+      setting it equal to zero, which results in the matrix equation    Fig.2. (a)Schematicrepresentationofdomainandboundary. (b)SampleFEM
+                                                  mesh for the domain.
+                                              (9)
+
+        Boundary conditions for these problems are usually of two
+      types: natural boundary conditions and essential boundary
+      conditions. Essential boundary conditions (also referred to as
+      Dirichlet boundary conditions) impose constraints on the value
+      of the unknown  at several nodes. Natural boundary condi-
+      tions (of which Neumann boundary conditions are a special
+      case) impose constraints on the change in across a boundary.
+      Dirichlet boundary conditions are imposed on the functional
+      minimization (9), by deleting the rows and columns of the
+      matrix corresponding to the nodes on the Dirichlet boundary
+         and modifying  in (9).                         Fig. 3. FEM domain discretization using two elements and four nodes.
+        Natural boundary conditions are applied in the FEM by
+      adding an additional term to the functional. These boundary  This process ensures that natural boundary conditions are im-conditions are then incorporated into the functional and are  plicitlyandautomatically satisﬁedduring theFEMsolutionpro-satisﬁed automatically during the solution procedure. As an  cedure.example, consider the natural boundary condition represented
+      by the following equation [3]                        B. The FENN
+                               on            (10)   This section describes how theﬁnite-element model can be
+                                                  converted intoa parallel network form. Wefocus on solving typ-
+      where   represents the Neumann boundary,  is its outward  ical inverse problems arising in electromagnetic NDE, but the
+      normal unit vector,  is some constant, and , , and  are basicideaisapplicabletootherareas aswell.NDEinverseprob-
+      known parameters associated with the boundary. Assuming that lems can be formulated as the problem ofﬁnding the material
+      the boundary   is made up of   segments, we can deﬁne properties (such as the conductivity or the permeability) within
+      boundary matrices   and  with elements              the domain of the problem. Since the domain is discretized in
+                                                  the FEM method by a large number of elements, the problem
+                                                  can be posed as one ofﬁnding the material properties in each
+                                                  of these elements. These properties are usually embedded in the
+                                                  differential operator , or equivalently, in the global matrix .
+                                                  Thus, in order to be able to iteratively estimate these properties
+                                                  from the measurements, the material properties need to be sep-
+                                                  arated out from  . This separation is easier to achieve at the
+                                                  element matrix level. For nodes and in element
+                                             (11)
+
+      where   are basis functions deﬁned over segment and  is
+      the length of the segment. The elements of   are added to the
+      elementsof  that correspond tothe nodeson the boundary  .
+      Similarly, the elements of   are added to the corresponding
+      elements of . The global matrix (9) is thus modiﬁed as follows
+      before solving for                                                                       (13)
+
+                                                  where   is the parameter representing the material property(12)  in element  and  represents the differential operator at the                                                                  1384                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+        Fig. 4. FENN.
+
+
+        element level without   embedded in it. Substituting (13) into  neurons, corresponding to the    members of the global ma-
+        the functional, we get                                    trix  . The output of each group of hidden layer neurons is the
+                                                               corresponding row vector of  . The weights from the input to
+                                                               the hidden layer are set to the appropriate values of   . Each(14)  neuron in the hidden layer acts as a summation unit, (equivalent
+                                                               toasummationfollowedbyalinearactivationfunction[5]).The
+        If we deﬁne                                            outputs of the hidden layer neurons are the elements    of the
+                                                               global matrix   as given in (15).
+                                                         (15)    Each group of hidden neurons is connected to one output
+                                                               neuron (giving a total of  output neurons) by a set of weights
+                                                                , with each element of  representing the nodal values  .where                                                 Note that the set of weights  between theﬁrst group of hidden
+                                                               neurons and theﬁrst output neuron are the same as the set of(16)else                                   weights between the second group of hidden neurons and the
+                                                               second output neuron (as well as between successive groups
+                                                               of hidden neurons and the corresponding output neuron). Each
+                                                               output neuron is also a summation unit followed by a linear ac-
+                                                               tivation function, and the output of each neuron is equal to  :
+
+
+                                                                                                                (18)
+                                                         (17)
+
+                                                               where the second part of (18) is obtained by using (15). As an
+        Equation (17) expresses the functional explicitly in terms of  .  example, the FENN architecture for a two-element, four-node
+        The assumption that   is constant within each element is im-                 FEM mesh (Fig. 3) is shown in Fig. 4. In this
+        plicit in this expression. This assumption is usually satisﬁed in  case, the FENN has two input neurons, 16 hidden layer neurons
+        problems in NDE where each element in the FEM mesh is de-  and four output neurons. Theﬁgure illustrates the grouping of
+        ﬁned within the conﬁnes of a domain, and at no time does a  the hidden layer neurons, as well as the similarity inherent in
+        single element cross domain boundaries. Furthermore, each el-  the weights that connect each group of hidden layer neurons
+        ement is small enough that minor variations in   within an el-  to the corresponding output neuron. To simplify theﬁgure, the
+        ement may be ignored. Equation (17) can be easily converted  weights between the network input and hidden layer neurons
+        into a parallel network form. The neural network comprises an  are depicted by means of vectors                      (for
+        input, output and hidden layer. In the general case with   el-       , 2, 3, 4 and     , 2), where the individual weight values
+        ements and   nodes in the FEM mesh, the input layer with      are deﬁned as in (16).
+           network inputs takes the  values in each element as input.    1) Boundary Conditions in the FENN: Note that the ele-
+        The hidden layer has    neurons 1 arranged in   groups of    ments of   and   in (11) do not depend on the material prop-
+         1                                                    erties .   and   need to be added appropriately to the global In this paper, we use the term“neurons”in the FENN (in the hidden and
+        output layers) to avoid confusion with the nodes in aﬁnite-element mesh.     matrix   and the source vector  as shown in (12). Equation                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1385
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 5. Geometry of mesh for 1-D FEM.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 6. Flowchart (with example) for designing the FENN for a general PDE.
+
+
+       (12) thus implies that natural boundary conditions can be ap-  layer neurons. These weights will be referred to as the clamped
+       plied in the FENN as bias inputs to the hidden layer neurons  weights, while the remaining weights will be referred to as the
+       that are a part of the boundary, and the corresponding output  free weights. An example of these weights is presented later.
+       neurons. Dirichlet boundary conditions are applied by clamping    The FENN architecture was derived without consideration of
+       the corresponding weights between the hidden layer and output  the dimensionality of the problem at hand, and thus can be used                                                                  1386                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+      for 1-, 2-, 3-, or higher dimensional problems. The number of
+      nodes and elements in the FEM mesh dictates the number of
+      neurons in the different layers. The weights between the input
+      and hidden layer change depending on node-element connec-
+      tivity information.
+        The major drawback of the FENN is the number of neurons
+      and weights necessary. However, the memory requirements can
+      be reduced considerably, since most of the weights between the
+      input and hidden layer are zero. These weights, and the corre-
+      sponding connections, can be discarded. Similarly, most of the Fig. 7. Shielded microstrip geometry. (a) Complete problem description. (b)
+      elements of the  matrix are also zero (  is a banded ma-  Problem description using symmetry considerations.
+      trix). The corresponding neurons in the hidden layer can also
+      be discarded, reducing memory and computation requirements   The network implementation of (23) can be derived as fol-
+      considerably. Furthermore, the weights between each group of  lows. If  and  values at each element are the inputs to the
+      hidden layer neurons and the output layer are the same   .  network,   ,      ,   , and      form the weights
+      Weight-sharing approaches can be used here to further reduce  between the input and hidden layers. The network thus uses
+      the storage requirements.                           inputneuronsand  hiddenneurons.Thevaluesof ateachof
+                                                  thenodesareassigned asweightsbetweenthehidden andoutput
+      C. A 1-D Example                               layers, and the source   is the desired output of this network
+        Consider the 1-D equation                        (corresponding to the  output neurons). Dirichlet boundary
+                                                  conditions on are applied as explained earlier.
+
+                                             (19)  D. General Case
+                                                    Fig. 6 shows aﬂowchart of the general scheme for convertingboundary conditions       on the boundary deﬁned by .  a differential equation into the FENN structure. An exampleand  are constants depending on the material and  is the in two dimensions is also provided next to theﬂowchart. Weapplied source. Laplace’s equation and Poisson’s equation are  start with the differential equation and the boundary conditionsspecial cases of this equation. The FENN formulation for this and formulate the FEM using the variational method. This in-problem starts by discretizing the domain of interest with  el-  volves discretizing the domain of interest with  elements andements and  nodes. In one dimension, each element is deﬁned    nodes, selecting basis functions, writing the functional forby two nodes (Fig. 5). Deﬁne basis functions   and   over  each element and obtaining the element matrices and the sourceeach element and let  is the value of on node in element  vector. The example presented uses the FEM mesh shown in. An example of the basis functions is shown in Fig. 5.      Fig. 3, with      elements, and      nodes, and linearFor these basis functions, i.e.,                      basis functions. The unknown solution to the differential equa-
+                                                  tion   is represented by its values at each of the nodes in the(20)  ﬁnite-element mesh   . The element matrices   are then
+                                                  separated into two parts, with one part dependent on the mate-the element matrices are given by [3]                   rial properties and while the other is independent of them.
+                                                    The FENN is then designed to have   input neurons,
+                                                  hidden neurons, and  output neurons, where is the number
+                                                  of material property parameters. In the example under consid-
+                                                  eration,    , since we have two material property parameters(21)  ( and ). Theﬁrst group of  input neurons takes in the
+                                                  values while the second group takes in the values in each ele-
+                                                  ment. The weights from the input to the hidden layer are set to
+                                                  the appropriate values of  . In the example, since nodes 1, 2,
+                                             (22)  and 3 are part of element 1 (see Fig. 3), the weights from theﬁrst
+                                                  input node   to theﬁrst group of four neurons in the hidden
+      Here,  is the length of element . The global matrix  is then layer are given by
+      constructed by selectively adding the element matrices based
+      on the nodes that form an element. Speciﬁcally,  is a sparse
+      tridiagonal matrix, and its nonzero elements are given by                                             (24)
+
+                                                  The last weight is zero since node 4 is not a part of element 1.
+                                                    Each group of hidden neurons is connected to one output
+                                                  neuron (giving a total of  output neurons) by a set of weights
+                                                   , with each element of representing the nodal values  . The
+                                             (23)  output of each neuron in the output layer is equal to .                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1387
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 8. Forward problem solutions for shielded microstrip problem show the contours of constant potential for: (a) FEM solution and (b) FENN solution. (c) Error
+       between (a) and (b). Thex- andy-axes show the nodes in the FEM discretization of the domain, and thez-axis in (c) shows the error at each of these nodes in volts.
+
+
+
+        III. F ORWARD AND INVERSE PROBLEM FORMULATION USING   where     is the output of the FENN. Then, for a gradient-
+                               FENN                          based approach, the gradients of the error with respect to the
+                                                              free hidden layer weights is given by
+
+          The FENN architecture and algorithm lends itself to solving                                                   (27)both the forward and inverse problems. The forward problem
+       involves determining the weights  given the material parame-  Equation (27) can be used to solve the forward problem. Sim-ters  and  and the applied source  while the inverse problem  ilarly, to solve the inverse problem, the gradients of the errorinvolves determining  and  given  and . Any optimization  with respect to  and  (input of the FENN) are necessary, andapproach can be used to solve both these problems. Suppose we  are given bydeﬁne the error at the output of the FENN as
+
+
+
+
+                                                                                                                (28)
+
+
+
+
+                                                         (26)                                                   (29)                                                                  1388                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+
+                                                          TABLE I
+                                   SUMMARY OF PERFORMANCE OF THE FENN A LGORITHM FOR VARIOUS PDE S
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+        For the forward problem, such an approach is equivalent to the  Dirichlet boundary, with      on the microstrip and     on
+        iterative approaches used to solve for the unknown nodal values  the outer boundary [Fig. 7(b)]. Finally, there is no source term
+        in the FEM [4].                                         in this example (the source term would correspond to a charge
+                                                               distribution in the domain of interest), i.e.,      . In this ex-
+                             IV. R ESULTS                       ample, we assume that        volts and      . Further, we
+                                                               assume that the domain of interest is                  .A. Forward Model Results                                  The solution to the forward problem is presented in Fig. 8,
+          The FENN was tested using both 1- and 2-D versions of  with the FEM solution using 11 nodes in each direction shown
+        Poisson’s equation                                       in Fig. 8(a) and the corresponding FENN solution in Fig. 8(b).
+
+                                                         (30)  Theseﬁgures show contours of constant potential. The error be-
+                                                               tween the FEM and FENN solutions is presented in Fig. 8(c). As
+        where  represents the material property, and  is the applied  seen from theﬁgure, the FENN is seen to match the FEM solu-
+        source. For instance, in electromagnetics  may represent the  tion accurately, with the peak error at any node on the order of
+        permittivity while  represents the charge density.                  .
+          As theﬁrst example, consider the following 2-D equation       Several other examples were also used to test the FENN and
+                                                               the results are summarized in Table I. Column 1 shows the
+                                                         (31)  PDE used to evaluate the FENN performance, while column 2
+                                                               shows the boundary conditions used. The analytic solution to
+        with boundary conditions                                 the problem is indicated in Column 3. The FENN structure and
+
+                                  on                    (32)  the number of iterations for convergence using a gradient de-
+                                                               scent approach are indicated in Columns 4 and 5, respectively.
+        and                                                   The FENN structure, as explained earlier, has    inputs,
+                                                               hidden neurons and  output neurons, where   and  are the
+                                               on       (33)  number of elements and nodes in the FEM mesh, respectively,
+                                                               and  is the number of hidden neurons, and corresponds to the
+        This is the governing equation for the shielded microstrip trans-  number of nonzero elements in the FEM global matrix  . Fi-
+        mission line problem shown in Fig. 7. The forward problem  nally, Columns 6 and 7 present the sum-squared error (SSE) and
+        computes the electric potential due to the shielded microstrip  the maximum error in the solution, respectively, where the er-
+        shown in Fig. 7(a). The potentials are zero on the shielding con-  rors are computed with respect to the analytical solution. These
+        ductor.Sincethegeometryissymmetric,wecansolvetheequiv-  results indicate that the FENN is capable of accurately deter-
+        alent problem shown in Fig. 7(b), by applying the homogeneous  mining the potential . One advantage of the FENN approach
+        Neumann condition on the plane of symmetry. The inner con-  is that the computation of the input-hidden layer weights is a
+        ductor (microstrip) is held at a constant potential of   volts.  one-time process, as long as the differential equation does not
+        Finally, we also assume that the material inside the shielding  change. The only changes necessary to solve the different prob-
+        conductor has a permittivity      , where K is a constant. The  lems are changes in the input    and the desired output   .
+        permittivity in this case corresponds to the material property .
+        Speciﬁcally,            and     . The homogeneous Neu-  B. Inverse Model Results
+        mann boundary condition is equivalent to setting          .    TheFENNwasalsousedtosolveseveralsimpleinverseprob-
+        The microstrip and the shielding conductor correspond to the  lems based on (30). In all cases, the objective was to determine                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1389
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 9. FENN inversion results for Poisson’s equation with initial solutions (a)  = x . (b)  =1+   x .
+
+
+       the value of  and  for given values of  and . Theﬁrst ex-    In order to obtain a unique solution, we need to constrain the
+       ample is a 1-D problem that involves determining  given       value of  at the boundary as well. Consider the same differen-
+       and     ,         for the differential equation             tial equation as (34), but with  and  speciﬁed as follows:
+
+                                                         (34)                           and
+
+       with boundary conditions         and        . The analyt-                                                   (36)
+       ical solution to this inverse problem is                      The analytical solution for this equation is              .To
+                                       and              (35)  solve this problem, we set       and clamp the value of  at
+       As seen from (35), the problem has an inﬁnite number of solu-       and     as follows:          ,                 .
+       tions and we expect the solution procedure to converge to one    The results of the constrained inversion obtained using 11
+       of these solutions depending on the initial value.              nodes and 10 elements in the correspondingﬁnite-element mesh
+          Fig. 9(a) and (b) shows two solutions to this inverse problem  are shown in Fig. 10. Fig. 10(a) shows the comparison between
+       for two different initializations (shown using triangles). In both  the analytical solution (solid line with squares) and the FENN
+       cases, the FENN solution (in stars) is seen to match the analyt-  result (solid line with stars). The initial value of  is shown in
+       ical solution (squares). The SSE in both cases was on the order  theﬁgure as a dashed line. Fig. 10(b) shows the comparison
+       of     .                                              between the actual and desired forcing function  at the FENN                                                                  1390                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+        Fig. 10. Constrained inversion result with eleven nodes. (a) Comparison of analytic and simulation results for  . (b) Comparison of actual and desired NN outputs.
+
+
+        output. This result indicates that the SSE in the forcing function,  weight structure that allows both the forward and inverse prob-
+        as well as the SSE in the inversion result, is fairly large (0.0148  lemstobesolvedusingsimplegradient-basedalgorithms.Initial
+        and 0.0197, respectively). The reason for this was traced back  results indicate that the proposed FENN algorithm is capable of
+        to the mesh discretization. Fig. 11 shows the SSE in the output  accurately solving both the forward and inverse problems. In
+        of the FENN and the SSE in the inverse problem solution as a  addition, the forward problem solution from the FENN is seen
+        function of FEM discretization. It is seen that increasing the dis-  to exactly match the FEM solution, indicating that the FENN
+        cretization signiﬁcantly improves the solution. Similar results  represents theﬁnite-element model exactly in a parallel conﬁg-
+        were observed for other problems.                          uration.
+                                                                 The major advantage of the FENN is that it represents the
+                                                               ﬁnite-element model in a parallel form, enabling parallel imple-
+                    V. D ISCUSSION AND CONCLUSION              mentation in either hardware or software. Further, computing
+                                                               gradients in the FENN is very simple. This is an advantage in
+          The FENN is closely related to theﬁnite-element model used  solving bothforward and inverse problems using gradient-based
+        to solve differential equations. The FENN architecture has a  methods. The gradients can also be computed in parallel and                                                               RAMUHALLI   et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1391
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+       Fig. 11. SSE in FENN output and inversion results as a function of discretization.
+
+
+       the lack of nonlinearities in the neuron activation functions    [6] C. A. Jensenet al.,“Inversion of feedforward neural networks: algo-
+       makes the computation of gradients simpler. A major advantage       rithms and applications,”Proc. IEEE, vol. 87, no. 9, pp. 1536–1549,
+       of this approach for solving inverse problems is that it avoids       1999.
+                                                                [7] P. Ramuhalli, L. Udpa, and S. Udpa,“Neural networkalgorithm for elec-
+       inverting the global matrix in each iteration. The FENN also       tromagnetic NDE signal inversion,”inENDE 2000, Budapest, Hungary,
+       does not require any training, since most of its weights can be       Jun. 2000.
+       computed in advance and stored. The weights depend on the    [8] C. H. Barbosa, A. C. Bruno, M. Vellasco, M. Pacheco, J. P. Wikswo Jr.,
+                                                                   and A. P. Ewing,“Automation of SQUID nondestructive evaluation of
+       governing differential equation and its associated boundary       steel plates by neural networks,”IEEE Trans. Appl. Supercond., vol. 9,
+       conditions, and as long as these two factors do not change,       no. 2, pp. 3475–3478, 1999.
+       the weights do not change. This is especially an advantage    [9] W.Qing, S. Xueqin,Y.Qingxin,and Y.Weili,“Usingwaveletneural net-
+                                                                   works for the optimal design of electromagnetic devices,”IEEE Trans.
+       in solving inverse problems in electromagnetic NDE. This       Magn., vol. 33, no. 2, pp. 1928–1930, 1997.
+       approach also reduces the computational effort associated with   [10] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis,“Artiﬁcial neural networks
+       the network.                                                 for solving ordinary and partial differential equations,”IEEE Trans.
+                                                                   Neural Netw., vol. 9, no. 5, pp. 987–1000, 1998.
+          Future work will concentrate on applying the FENN to 3-D   [11] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou,“Neural-network
+       electromagnetic NDE problems. The robustness of the approach       methods for boundary value problems with irregular boundaries,”IEEE
+       will also be tested, since the ability of these approaches to in-       Trans. Neural Netw., vol. 11, no. 5, pp. 1041–1049, 2000.
+                                                                [12] B. P. Van Milligen, V. Tribaldos, and J. A. Jimenez,“Neural network
+       vert practical noisy measurements is important. Furthermore,       differential equation and plasma equilibrium solver,”Phys. Rev. Lett.,
+       the use of better optimization algorithms, like conjugate gra-       vol. 75, no. 20, pp. 3594–3597, 1995.
+       dient methods, is expected to improve the solution speed. In ad-   [13] M. W. M. G. Dissanayake and N. Phan-Thien,“Neural-network-based
+                                                                   approximations for solving partial differential equations,”Commun.
+       dition, parallel implementation of the FENN in both hardware       Numer. Meth. Eng., vol. 10, pp. 195–201, 1994.
+       and software is under investigation. The approach described in   [14] R. Masuoka,“Neural networks learning differential data,”IEICE Trans.
+       this paper is very general in that it can be applied to a variety       Inform. Syst., vol. E83-D, no. 6, pp. 1291–1300, 2000.
+                                                                [15] D.C.Youla,“Generalizedimagerestorationbythemethodofalternating
+       of inverse problems inﬁelds other than electromagnetic NDE.       orthogonal projections,”IEEE Trans. Circuits Syst., vol. CAS-25, no. 9,
+       Some of these other applications will also be investigated to       pp. 694–702, 1978.
+       show the general nature of the proposed method.               [16] D. C. Youla and H. Webb,“Image restoration by the method of convex
+                                                                   projections: part I—theory,”IEEE Trans. Med. Imag., vol. MI-1, no. 2,
+                                                                   pp. 81–94, 1982.
+                            REFERENCES                        [17] A. Lent and H. Tuy,“An iterative method for the extrapolation of band-
+                                                                   limitedfunctions,”J.Math.AnalysisandApplicat.,vol.83, pp.554–565,
+         [1] L. Udpa and S. S. Udpa,“Application of signal processing and pattern       1981.
+            recognition techniques to inverse problems in NDE,”Int. J. Appl. Elec-   [18] W. Chen,“A new extrapolation algorithm for band-limited signals using
+            tromagn. Mechan., vol. 8, pp. 99–117, 1997.                         the regularization method,”IEEE Trans. Signal Process., vol. 41, no. 3,
+         [2] M. Yan, M. Afzal, S. Udpa, S. Mandayam, Y. Sun, L. Udpa, and P.       pp. 1048–1060, 1993.
+            Sacks,“Iterative algorithms for electromagnetic NDE signal inversion,”   [19] J. Takeuchi and Y. Kosugi,“Neural network representation of theﬁnite
+            inENDE ’97, Reggio Calabria, Italy, Sep. 14–16, 1997.                 element method,”Neural Netw., vol. 7, no. 2, pp. 389–395, 1994.
+         [3] J. Jin,The Finite Element Method in Electromagnetics. New York:   [20] R. Sikora, J. Sikora, E. Cardelli, and T. Chady,“Artiﬁcial neural net-
+            Wiley, 1993.                                              work application for material evaluation by electromagnetic methods,”
+         [4] P. Zhou,Numerical Analysis of Electromagnetic Fields. Berlin, Ger-       inProc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 4027–4032.
+            many: Springer-Verlag, 1993.                               [21] G. Xu, G. Littlefair, R. Penson, and R. Callan,“Application of FE-based
+         [5] S. Haykin,Neural Networks: A Comprehensive Foundation. Upper       neural networks to dynamic problems,”inProc. Int. Conf. Neural Infor-
+            Saddle River, NJ: Prentice-Hall, 1994.                             mation Processing, vol. 3, 1999, pp. 1039–1044.                                                                  1392                                                                                                                                                                                                                                                                                                                                                                                                                IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
+
+
+
+         [22] F. Guo, P. Zhang, F. Wang, X. Ma, and G. Qiu,“Finite element anal-                    Lalita Udpa (S’84–M’86–SM’96) received the
+             ysis-based Hopﬁeld neural network model for solving nonlinear elec-                    Ph.D. degree in electrical engineering from Col-
+             tromagneticﬁeld problems,”inProc. Int. Joint Conf. Neural Networks,                    orado State University, Fort Collins, in 1986.
+             vol. 6, 1999, pp. 4399–4403.                                                 She is currently a Professor with the Department
+         [23] H. Lee and I. S. Kang,“Neural algorithm for solving differential equa-                    of Electrical and Computer Engineering, Michigan
+             tions,”J. Computat. Phys., vol. 91, pp. 110–131, 1990.                              State University, East Lansing. She works primarily
+         [24] J. Kalkkuhl, K. J. Hunt, and H. Fritz,“FEM-based neural-network                    in the broad areas of nondestructive evaluation,
+             approach to nonlinear modeling with application to longitudinal vehicle                    signal processing, and biomedical applications. Her
+             dynamics control,”IEEE Trans. Neural Netw., vol. 10, no. 4, pp.                    research interests include various aspects of NDE,
+             885–897, 1999.                                                         such as development of computational models for
+         [25] R. K. Mishra and P. S. Hall,“NFDTD concept,”IEEE Trans. Neural                    the forward problem in NDE, signal and image pro-
+             Netw., vol. 16, no. 2, pp. 484–490, 2005.                      cessing, pattern recognition and neural networks, and development of solution
+         [26] D. G. Triantafyllidis and D. P. Labridis,“Aﬁnite-element mesh gener-  techniques for inverse problems. Her current projects includeﬁnite-element
+             ator based on growing neural networks,”IEEE Trans. Neural Netw., vol.  modeling of electromagnetic NDE phenomena, application of neural network
+             13, no. 6, pp. 1482–1496, 2002.                            and signal processing algorithms to NDE data, and development of image
+                                                               processing techniques for the analysis of NDE and biomedical images.
+                                                                Dr. Udpa is a Member of Eta Kappa Nu and Sigma Xi.
+
+
+
+                                                                                Satish S. Udpa(S’82–M’82–SM’91–F’03) received
+                                                                                the B.Tech. degree in 1975 and the Post Graduate
+                                                                                Diplomainelectricalengineeringin1977fromJ.N.T.
+                                                                                University, Hyderabad, India. He received the M.S.
+                                                                                degree in 1980 and the Ph.D. degree in electrical en-
+                                                                                gineering in 1983, both from Colorado State Univer-
+                                                                                sity, Fort Collins.
+                                                                                  He has been with Michigan State University, East
+                                                                                Lansing, since 2001 and is currently Acting Dean for
+                                                                                the College of Engineering and a Professor with the
+                                                                                Electrical and Computer Engineering Department.
+                                                               Prior to joining Michigan State, he was a Professor with Iowa State University,
+                                                               Ames, from 1990 to 2001 and was associated with the Materials Assessment
+                                                               Research Group. Prior to joining Iowa State, he was an Associate Professor
+                                                               with the Department of Electrical Engineering at Colorado State University.
+                                                               His research interests span the broad area of materials characterization and
+                                                               nondestructive evaluation (NDE). Work done by him to date in the area includes
+                                                               an extensive repertoire of forward models for simulating physical processes
+                                                               underlying several inspection techniques. Coupled with careful experimental
+                         Pradeep Ramuhalli (S’92–M’02) received the  work, such forward models can be used for designing new sensors, optimizing
+                         B.Tech. degree from J.N.T. University, Hyderabad,  test conditions, estimating the probability of detection, assessing designs for
+                         India, in electronics and communications engi-  inspectability and training inverse models for characterizing defects. He has
+                         neering in 1995, and the M.S. and Ph.D. degrees in  also been involved in the development of system-, as well as model-based,
+                         electrical engineering from Iowa State University,  inverse solutions for defect and material property characterization. His interests
+                         Ames, in 1998 and 2002, respectively.           have expanded in recent years to include the development of noninvasive
+                           He is currently an Assistant Professor with the  tools for clinical applications. Work done to date in thisﬁeld includes the
+                         Department of Electrical and Computer Engi-  development of new electromagnetic-acoustic (EMAT) methods for detecting
+                         neering, Michigan State University, East Lansing.  single leg separation failures in artiﬁcial heart valves and microwave imaging
+                         His research is in the general area of nondestruc-  and ablation therapy systems. He and his research group have been engaged
+                         tive evaluation and materials characterization. His  in the design and development of high-performance instrumentation including
+        research interests include the application of signal and image processing  acoustic microscopes and single and multifrequency eddy current NDE instru-
+        methods, pattern recognition and neural networks for nondestructive evaluation  ments. These systems, as well as software packages embodying algorithms
+        applications, development of model-based solutions for inverse problems in  developed by Udpa for defect classiﬁcation and characterization, have been
+        NDE, and the development of information fusion algorithms for multimodal  licensed to industry.
+        data fusion.                                                He is a Fellow of the American Society for Nondestructive Testing (ASNT)
+         Dr. Ramuhalli is a Member of Phi Kappa Phi.                      and a Fellow of the Indian Society of Nondestructive Testing.
--- a/Corpus/Floating
+++ b/Corpus/Floating
--- a/Schwartz.txt
+++ b/Schwartz.txt
--- a/Corpus/Harnessing
+++ b/Corpus/Harnessing
--- a/Corpus/Identity
+++ b/Corpus/Identity
--- a/Corpus/Language
+++ b/Corpus/Language
--- a/Corpus/Learning
+++ b/Corpus/Learning
@ -0,0 +1,399 @@
+             Learning Efﬁcient Convolutional Networks through Network Slimming
+
+
+         Zhuang Liu 1∗ Jianguo Li 2  Zhiqiang Shen 3  Gao Huang 4  Shoumeng Yan 2  Changshui Zhang 1
+          1 CSAI, TNList, Tsinghua University 2 Intel Labs China 3 Fudan University 4 Cornell University
+              {liuzhuangthu, zhiqiangshen0214}@gmail.com,{jianguo.li, shoumeng.yan}@intel.com,
+                               gh349@cornell.edu, zcs@mail.tsinghua.edu.cn
+
+
+
+                        Abstract                     However, larger CNNs, although with stronger represen-
+                                                   tation power, are more resource-hungry. For instance, a
+          The deployment of deep convolutional neural networks   152-layer ResNet [14] has more than 60 million parame-
+        (CNNs) in many real world applications is largely hindered   ters and requires more than 20 Giga ﬂoat-point-operations
+        by their high computational cost. In this paper, we propose   (FLOPs) when inferencing an image with resolution 224×
+        a novel learning scheme for CNNs to simultaneously 1) re-   224. This is unlikely to be affordable on resource con-
+        duce the model size; 2) decrease the run-time memory foot-   strained platforms such as mobile devices, wearables or In-
+        print; and 3) lower the number of computing operations,   ternet of Things (IoT) devices.
+        without compromising accuracy. This is achieved by en-     The deployment of CNNs in real world applications areforcing channel-level sparsity in the network in a simple but   mostly constrained by1) Model size: CNNs’ strong repre-effective way. Different from many existing approaches, the   sentation power comes from their millions of trainable pa-proposed method directly applies to modern CNN architec-   rameters. Those parameters, along with network structuretures, introduces minimum overhead to the training process,   information, need to be stored on disk and loaded into mem-and requires no special software/hardware accelerators for   ory during inference time. As an example, storing a typi-the resulting models. We call our approachnetwork slim-   cal CNN trained on ImageNet consumes more than 300MBming, which takes wide and large networks as input mod-   space, which is a big resource burden to embedded devices.els, but during training insigniﬁcant channels are automat-   2) Run-time memory: During inference time, the interme-ically identiﬁed and pruned afterwards, yielding thin and   diate activations/responses of CNNs could even take morecompact models with comparable accuracy. We empirically   memory space than storing the model parameters, even withdemonstrate the effectiveness of our approach with several   batch size 1. This is not a problem for high-end GPUs, butstate-of-the-art CNN models, including VGGNet, ResNet   unaffordable for many applications with low computationaland DenseNet, on various image classiﬁcation datasets. For   power.3) Number of computing operations:The convolu-VGGNet, a multi-pass version of network slimming gives a   tion operations are computationally intensive on high reso-20×reduction in model size and a 5×reduction in comput-   lution images. A large CNN may take several minutes toing operations.                                 process one single image on a mobile device, making it un-
+                                                   realistic to be adopted for real applications.
+        1. Introduction                                Many works have been proposed to compress large
+                                                   CNNs or directly learn more efﬁcient CNN models for fast
+          In recent years, convolutional neural networks (CNNs)   inference. These include low-rank approximation [7], net-
+        have become the dominant approach for a variety of com-   work quantization [3, 12] and binarization [28, 6], weight
+        puter vision tasks, e.g., image classiﬁcation [22], object   pruning [12], dynamic inference [16], etc. However, most
+        detection [8], semantic segmentation [26]. Large-scale   of these methods can only address one or two challenges
+        datasets, high-end modern GPUs and new network architec-   mentioned above. Moreover, some of the techniques require
+        tures allow the development of unprecedented large CNN   specially designed software/hardware accelerators for exe-
+        models. For instance, from AlexNet [22], VGGNet [31] and   cution speedup [28, 6, 12].
+        GoogleNet [34] to ResNets [14], the ImageNet Classiﬁca-     Another direction to reduce the resource consumption of
+        tion Challenge winner models have evolved from 8 layers   large CNNs is to sparsify the network. Sparsity can be im-
+        to more than 100 layers.                           posed on different level of structures [2, 37, 35, 29, 25],
+          ∗ This work was done when Zhuang Liu and Zhiqiang Shen were interns   which yields considerable model-size compression and in-
+        at Intel Labs China. Jianguo Li is the corresponding author.           ference speedup. However, these approaches generally re-
+
+
+
+                                                 2736                      channel scaling                                channel scaling  i-thconv-layer   factors        (i+1)=j-th         i-thconv-layer    factors       (i+1)=j-th
+                                     conv-layer                                  conv-layer Ci1          1.170                           C           1.170
+               C                       C                 i1
+                i2          0.001           j1                                        Cj1
+               Ci3          0.290                 pruning     Ci3          0.290
+               C          0.003          Ci4                        j2                                        Cj2
+                                                          …      …    …
+                       …    …
+                …
+
+               C                                        Cin          0.820 in          0.820
+                    initial network                             compact network
+        Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity
+        regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small
+        scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then
+        ﬁne-tuned to achieve comparable (or even higher) accuracy as normally trained full network.
+
+        quire special software/hardware accelerators to harvest the   Low-rank Decompositionapproximates weight matrix in
+        gain in memory or time savings, though it is easier than   neural networks with low-rank matrix using techniques like
+        non-structured sparse weight matrix as in [12].            Singular Value Decomposition (SVD) [7]. This method
+          In this paper, we proposenetwork slimming, a simple   works especially well on fully-connected layers, yield-
+        yet effective network training scheme, which addresses all   ing∼3x model-size compression however without notable
+        the aforementioned challenges when deploying large CNNs   speed acceleration, since computing operations in CNN
+        under limited resources. Our approach imposes L1 regular-   mainly come from convolutional layers.
+        ization on the scaling factors in batch normalization (BN)   Weight Quantization. HashNet [3] proposes to quantizelayers, thus it is easy to implement without introducing any   the network weights. Before training, network weights arechange to existing CNN architectures. Pushing the val-   hashed to different groups and within each group weightues of BN scaling factors towards zero with L1 regulariza-   the value is shared. In this way only the shared weights andtion enables us to identify insigniﬁcant channels (or neu-   hash indices need to be stored, thus a large amount of stor-rons), as each scaling factor corresponds to a speciﬁc con-   age space could be saved. [12] uses a improved quantizationvolutional channel (or a neuron in a fully-connected layer).   technique in a deep compression pipeline and achieves 35xThis facilitates the channel-level pruning at the followed   to 49x compression rates on AlexNet and VGGNet. How-step. The additional regularization term rarely hurt the per-   ever, these techniques can neither save run-time memoryformance. In fact, in some cases it leads to higher gen-   nor inference time, since during inference shared weightseralization accuracy. Pruning unimportant channels may   need to be restored to their original positions.sometimes temporarily degrade the performance, but this     [28, 6] quantize real-valued weights into binary/ternaryeffect can be compensated by the followed ﬁne-tuning of   weights (weight values restricted to{−1,1}or{−1,0,1}).the pruned network. After pruning, the resulting narrower   This yields a large amount of model-size saving, and signiﬁ-network is much more compact in terms of model size, run-   cant speedup could also be obtained given bitwise operationtime memory, and computing operations compared to the   libraries. However, this aggressive low-bit approximationinitial wide network. The above process can be repeated   method usually comes with a moderate accuracy loss. for several times, yielding a multi-pass network slimming
+        scheme which leads to even more compact network.        Weight Pruning / Sparsifying.[12] proposes to prune the
+          Experiments on several benchmark datasets and different   unimportant connections with small weights in trained neu-
+        network architectures show that we can obtain CNN models   ral networks. The resulting network’s weights are mostly
+        with up to 20x mode-size compression and 5x reduction in   zeros thus the storage space can be reduced by storing the
+        computing operations of the original ones, while achieving   model in a sparse format. However, these methods can only
+        the same or even higher accuracy. Moreover, our method   achieve speedup with dedicated sparse matrix operation li-
+        achieves model compression and inference speedup with   braries and/or hardware. The run-time memory saving is
+        conventional hardware and deep learning software pack-   also very limited since most memory space is consumed by
+        ages, since the resulting narrower model is free of any   the activation maps (still dense) instead of the weights.
+        sparse storing format or computing operations.              In [12], there is no guidance for sparsity during training.
+                                                   [32] overcomes this limitation by explicitly imposing sparse
+        2. Related Work                             constraint over each weight with additional gate variables,
+                                                   and achieve high compression rates by pruning connections
+          In this section, we discuss related work from ﬁve aspects.   with zero gate values. This method achieves better com-
+
+
+
+                                                 2737        pression rate than [12], but suffers from the same drawback.   Advantages of Channel-level Sparsity. As discussed in
+                                                   prior works [35, 23, 11], sparsity can be realized at differ-Structured Pruning / Sparsifying. Recently, [23] pro-   ent levels, e.g., weight-level, kernel-level, channel-level orposes to prune channels with small incoming weights in   layer-level. Fine-grained level (e.g., weight-level) sparsitytrained CNNs, and then ﬁne-tune the network to regain   gives the highest ﬂexibility and generality leads to higheraccuracy. [2] introduces sparsity by random deactivat-   compression rate, but it usually requires special software oring input-output channel-wise connections in convolutional   hardware accelerators to do fast inference on the sparsiﬁedlayers before training, which also yields smaller networks   model [11]. On the contrary, the coarsest layer-level spar-with moderate accuracy loss. Compared with these works,   sity does not require special packages to harvest the infer-we explicitly impose channel-wise sparsity in the optimiza-   ence speedup, while it is less ﬂexible as some whole layerstion objective during training, leading to smoother channel   need to be pruned. In fact, removing layers is only effec-pruning process and little accuracy loss.                tive when the depth is sufﬁciently large, e.g., more than 50[37] imposes neuron-level sparsity during training thus   layers [35, 18]. In comparison, channel-level sparsity pro-some neurons could be pruned to obtain compact networks.   vides a nice tradeoff between ﬂexibility and ease of imple-[35] proposes a Structured Sparsity Learning (SSL) method   mentation. It can be applied to any typical CNNs or fully-to sparsify different level of structures (e.g. ﬁlters, channels   connected networks (treat each neuron as a channel), andor layers) in CNNs. Both methods utilize group sparsity   the resulting network is essentially a “thinned” version ofregualarization during training to obtain structured spar-   the unpruned network, which can be efﬁciently inferenced sity. Instead of resorting to group sparsity on convolu-   on conventional CNN platforms.tional weights, our approach imposes simple L1 sparsity on
+        channel-wise scaling factors, thus the optimization objec-   Challenges.  Achieving channel-level sparsity requires
+        tive is much simpler.                             pruning all the incoming and outgoing connections asso-
+          Since these methods prune or sparsify part of the net-   ciated with a channel. This renders the method of directly
+        work structures (e.g., neurons, channels) instead of individ-   pruning weights on a pre-trained model ineffective, as it is
+        ual weights, they usually require less specialized libraries   unlikely that all the weights at the input or output end of
+        (e.g. for sparse computing operation) to achieve inference   a channel happen to have near zero values. As reported in
+        speedup and run-time memory saving. Our network slim-   [23], pruning channels on pre-trained ResNets can only lead
+        ming also falls into this category, with absolutely no special   to a reduction of∼10% in the number of parameters without
+        libraries needed to obtain the beneﬁts.                 suffering from accuracy loss. [35] addresses this problem
+                                                   by enforcing sparsity regularization into the training objec-Neural Architecture Learning. While state-of-the-art   tive. Speciﬁcally, they adoptgroup LASSOto push all theCNNs are typically designed by experts [22, 31, 14], there   ﬁlter weights corresponds to the same channel towards zeroare also some explorations on automatically learning net-   simultaneously during training. However, this approach re-work architectures. [20] introduces sub-modular/super-   quires computing the gradients of the additional regulariza-modular optimization for network architecture search with   tion term with respect to all the ﬁlter weights, which is non-a given resource budget. Some recent works [38, 1] propose   trivial. We introduce a simple idea to address the aboveto learn neural architecture automatically with reinforce-   challenges, and the details are presented below.ment learning. The searching space of these methods are
+        extremely large, thus one needs to train hundreds of mod-   Scaling Factors and Sparsity-induced Penalty.Our idea
+        els to distinguish good from bad ones. Network slimming   is introducing a scaling factorγfor each channel, which is
+        can also be treated as an approach for architecture learning,   multiplied to the output of that channel. Then we jointly
+        despite the choices are limited to the width of each layer.   train the network weights and these scaling factors, with
+        However, in contrast to the aforementioned methods, net-   sparsity regularization imposed on the latter. Finally we
+        work slimming learns network architecture through only a   prune those channels with small factors, and ﬁne-tune the
+        single training process, which is in line with our goal of   pruned network. Speciﬁcally, the training objective of our
+        efﬁciency.                                    approach is given by
+                                                                           
+        3. Network slimming                                L=   l(f(x,W),y) +λ   g(γ)     (1)
+                                                              (x,y)            γ∈Γ We aim to provide a simple scheme to achieve channel-
+        level sparsity in deep CNNs. In this section, we ﬁrst dis-   where(x,y)denote the train input and target,Wdenotes
+        cuss the advantages and challenges of channel-level spar-   the trainable weights, the ﬁrst sum-term corresponds to the
+        sity, and introduce how we leverage the scaling layers in   normal training loss of a CNN,g(·)is a sparsity-induced
+        batch normalization to effectively identify and prune unim-   penalty on the scaling factors, andλbalances the two terms.
+        portant channels in the network.                     In our experiment, we chooseg(s) =|s|, which is known as
+
+
+
+                                                 2738                                                   convolution layers. 2), if we insert a scaling layer before
+                                                   a BN layer, the scaling effect of the scaling layer will be
+                 Train with     Prune channels Initial                        Fine-tune the     Compact      completely canceled by the normalization process in BN. channel sparsity     with small network                     pruned network   networkregularization   scaling factors                    3), if we insert scaling layer after BN layer, there are two
+                                                   consecutive scaling factors for each channel. Figure 2: Flow-chart of network slimming procedure. The dotted-
+        line is for the multi-pass/iterative scheme.                  Channel Pruning and Fine-tuning.After training under
+                                                   channel-level sparsity-induced regularization, we obtain a
+        L1-norm and widely used to achieve sparsity. Subgradient   model in which many scaling factors are near zero (see Fig-
+        descent is adopted as the optimization method for the non-   ure 1). Then we can prune channels with near-zero scaling
+        smooth L1 penalty term. An alternative option is to replace   factors, by removing all their incoming and outgoing con-
+        the L1 penalty with the smooth-L1 penalty [30] to avoid   nections and corresponding weights. We prune channels
+        using sub-gradient at non-smooth point.                with a global threshold across all layers, which is deﬁned
+          As pruning a channel essentially corresponds to remov-   as a certain percentile of all the scaling factor values. For
+        ing all the incoming and outgoing connections of that chan-   instance, we prune 70% channels with lower scaling factors
+        nel, we can directly obtain a narrow network (see Figure 1)   by choosing the percentile threshold as 70%. By doing so,
+        without resorting to any special sparse computation pack-   we obtain a more compact network with less parameters and
+        ages. The scaling factors act as the agents for channel se-   run-time memory, as well as less computing operations.
+        lection. As they are jointly optimized with the network     Pruning may temporarily lead to some accuracy loss,
+        weights, the network can automatically identity insigniﬁ-   when the pruning ratio is high. But this can be largely com-
+        cant channels, which can be safely removed without greatly   pensated by the followed ﬁne-tuning process on the pruned
+        affecting the generalization performance.               network. In our experiments, the ﬁne-tuned narrow network
+        Leveraging the Scaling Factors in BN Layers.Batch nor-   can even achieve higher accuracy than the original unpruned
+        malization [19] has been adopted by most modern CNNs   network in many cases.
+        as a standard approach to achieve fast convergence and bet-   Multi-pass Scheme. We can also extend the proposedter generalization performance. The way BN normalizes   method from single-pass learning scheme (training withthe activations motivates us to design a simple and efﬁ-   sparsity regularization, pruning, and ﬁne-tuning) to a multi-cient method to incorporates the channel-wise scaling fac-   pass scheme. Speciﬁcally, a network slimming proceduretors. Particularly, BN layer normalizes the internal activa-   results in a narrow network, on which we could again applytions using mini-batch statistics. Letzin andzout be the   the whole training procedure to learn an even more compactinput and output of a BN layer,Bdenotes the current mini-   model. This is illustrated by the dotted-line in Figure 2. Ex-batch, BN layer performs the following transformation:      perimental results show that this multi-pass scheme can lead
+                                                   to even better results in terms of compression rate.zzˆ= in −µ   B ; zσ2 +ǫ  out =γzˆ+β       (2)   Handling Cross Layer Connections and Pre-activation B                           Structure.  The network slimming process introduced
+        whereµB andσB are the mean and standard deviation val-   above can be directly applied to most plain CNN architec-
+        ues of input activations overB,γandβare trainable afﬁne   tures such as AlexNet [22] and VGGNet [31]. While some
+        transformation parameters (scale and shift) which provides   adaptations are required when it is applied to modern net-
+        the possibility of linearly transforming normalized activa-   works withcross layer connectionsand thepre-activation
+        tions back to any scales.                           design such as ResNet [15] and DenseNet [17]. For these
+          It is common practice to insert a BN layer after a convo-   networks, the output of a layer may be treated as the input
+        lutional layer, with channel-wise scaling/shifting parame-   of multiple subsequent layers, in which a BN layer is placed
+        ters. Therefore, we can directly leverage theγparameters in   before the convolutional layer. In this case, the sparsity is
+        BN layers as the scaling factors we need for network slim-   achieved at the incoming end of a layer, i.e., the layer selec-
+        ming. It has the great advantage of introducing no overhead   tively uses a subset of channels it received. To harvest the
+        to the network. In fact, this is perhaps also the most effec-   parameter and computation savings at test time, we need
+        tive way we can learn meaningful scaling factors for chan-   to place achannel selectionlayer to mask out insigniﬁcant
+        nel pruning.1), if we add scaling layers to a CNN without   channels we have identiﬁed.
+        BN layer, the value of the scaling factors are not meaning-
+        ful for evaluating the importance of a channel, because both   4. Experiments convolution layers and scaling layers are linear transforma-
+        tions. One can obtain the same results by decreasing the     We empirically demonstrate the effectiveness of network
+        scaling factor values while amplifying the weights in the   slimming on several benchmark datasets. We implement
+
+
+
+                                                 2739                                         (a) Test Errors on CIFAR-10
+                          Model        Test error (%) Parameters Pruned FLOPs Pruned
+                    VGGNet (Baseline)          6.34 20.04M - 7.97×10 8    -
+                    VGGNet (70% Pruned)       6.20      2.30M 88.5% 3.91×10 8  51.0%
+                    DenseNet-40 (Baseline)       6.11 1.02M - 5.33×10 8    -
+                    DenseNet-40 (40% Pruned)     5.19      0.66M 35.7% 3.81×10 8  28.4%
+                    DenseNet-40 (70% Pruned)     5.65 0.35M 65.2% 2.40×10 8  55.0%
+                    ResNet-164 (Baseline)        5.42 1.70M - 4.99×10 8    -
+                    ResNet-164 (40% Pruned)      5.08      1.44M 14.9% 3.81×10 8  23.7%
+                    ResNet-164 (60% Pruned)      5.27 1.10M 35.2% 2.75×10 8  44.9%
+
+                                         (b) Test Errors on CIFAR-100
+                          Model        Test error (%) Parameters Pruned FLOPs Pruned
+                    VGGNet (Baseline)         26.74 20.08M - 7.97×10 8    -
+                    VGGNet (50% Pruned)       26.52      5.00M 75.1% 5.01×10 8  37.1%
+                    DenseNet-40 (Baseline)       25.36 1.06M - 5.33×10 8    -
+                    DenseNet-40 (40% Pruned)    25.28      0.66M 37.5% 3.71×10 8  30.3%
+                    DenseNet-40 (60% Pruned)    25.72 0.46M 54.6% 2.81×10 8  47.1%
+                    ResNet-164 (Baseline)       23.37 1.73M - 5.00×10 8    -
+                    ResNet-164 (40% Pruned)     22.87      1.46M 15.5% 3.33×10 8  33.3%
+                    ResNet-164 (60% Pruned)     23.91 1.21M 29.7% 2.47×10 8  50.6%
+                                          (c) Test Errors on SVHN
+                          Model        Test Error (%) Parameters Pruned FLOPs Pruned
+                    VGGNet (Baseline)          2.17 20.04M - 7.97×10 8    -
+                    VGGNet (60% Pruned)       2.06      3.04M 84.8% 3.98×10 8  50.1%
+                    DenseNet-40 (Baseline)       1.89 1.02M - 5.33×10 8    -
+                    DenseNet-40 (40% Pruned)     1.79      0.65M 36.3% 3.69×10 8  30.8%
+                    DenseNet-40 (60% Pruned)     1.81 0.44M 56.6% 2.67×10 8  49.8%
+                    ResNet-164 (Baseline)        1.78      1.70M - 4.99×10 8    -
+                    ResNet-164 (40% Pruned)      1.85 1.46M 14.5% 3.44×10 8  31.1%
+                    ResNet-164 (60% Pruned)      1.81 1.12M 34.3% 2.25×10 8  54.9%
+        Table 1: Results on CIFAR and SVHN datasets. “Baseline” denotes normal training without sparsity regularization. In column-1, “60%
+        pruned” denotes the ﬁne-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters
+        and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy
+        could typically be maintained with≥60% channels pruned.
+
+        our method based on the publicly available Torch [5] im-   images, from which we split a validation set of 6,000 im-
+        plementation for ResNets by [10]. The code is available at   ages for model selection during training. The test set con-
+        https://github.com/liuzhuang13/slimming.   tains 26,032 images. During training, we select the model
+                                                   with the lowest validation error as the model to be pruned
+        4.1. Datasets                                 (or the baseline model). We also report the test errors of the
+                                                   models with lowest validation errors during ﬁne-tuning.CIFAR.The two CIFAR datasets [21] consist of natural im-
+        ages with resolution 32×32. CIFAR-10 is drawn from 10
+        and CIFAR-100 from 100 classes. The train and test sets   ImageNet. The ImageNet dataset contains 1.2 millioncontain 50,000 and 10,000 images respectively. On CIFAR-   training images and 50,000 validation images of 100010, a validation set of 5,000 images is split from the training   classes. We adopt the data augmentation scheme as in [10].set for the search ofλ(in Equation 1) on each model. We   We report the single-center-crop validation error of the ﬁnalreport the ﬁnal test errors after training or ﬁne-tuning on   model.all training images. A standard data augmentation scheme
+        (shifting/mirroring) [14, 18, 24] is adopted. The input data
+        is normalized using channel means and standard deviations.   MNIST.MNIST is a handwritten digit dataset containingWe also compare our method with [23] on CIFAR datasets.   60,000 training images and 10,000 test images. To test the
+        SVHN.The Street View House Number (SVHN) dataset   effectiveness of our method on a fully-connected network
+        [27] consists of 32x32 colored digit images. Following   (treating each neuron as a channel with 1×1 spatial size),
+        common practice [9, 18, 24] we use all the 604,388 training   we compare our method with [35] on this dataset.
+
+
+
+                                                 2740        4.2. Network Models                                     Model Parameter and FLOP Savings
+          On CIFAR and SVHN dataset, we evaluate our method        100  100.0% 100.0% 100.0%  Original
+                                                                                 Parameter Ratio
+        on three popular network architectures: VGGNet[31],        80                        FLOPs Ratio
+        ResNet [14] and DenseNet [17]. The VGGNet is originally
+
+                                                       Ratio (%)                       64.8%
+                                                        60
+        designed for ImageNet classiﬁcation. For our experiment a                                 55.1%
+                                                               49.0%      45.0%
+        variation of the original VGGNet for CIFAR dataset is taken        40            34.8%
+        from [36]. For ResNet, a 164-layer pre-activation ResNet        20    11.5%
+        with bottleneck structure (ResNet-164) [15] is used. For         0
+        DenseNet, we use a 40-layer DenseNet with growth rate 12             VGGNet   DenseNet-40  ResNet-164
+        (DenseNet-40).                                Figure 3: Comparison of pruned models withlowertest errors on On ImageNet dataset, we adopt the 11-layer (8-conv +   CIFAR-10 than the original models. The blue and green bars are 3 FC) “VGG-A” network [31] model with batch normaliza-   parameter and FLOP ratios between pruned and original models.
+        tion from [4]. We remove the dropout layers since we use
+        relatively heavy data augmentation. To prune the neurons   mented by building a new narrower model and copying the
+        in fully-connected layers, we treat them as convolutional   corresponding weights from the model trained with sparsity.
+        channels with 1×1 spatial size.
+          On MNIST dataset, we evaluate our method on the same   Fine-tuning.After the pruning we obtain a narrower and
+        3-layer fully-connected network as in [35].              more compact model, which is then ﬁne-tuned. On CIFAR,
+                                                   SVHN and MNIST datasets, the ﬁne-tuning uses the same
+        4.3. Training, Pruning and Finetuning            optimization setting as in training. For ImageNet dataset,
+                                                   due to time constraint, we ﬁne-tune the pruned VGG-A withNormal Training.We train all the networks normally from   a learning rate of 10 −3 for only 5 epochs.scratch as baselines. All the networks are trained using
+        SGD. On CIFAR and SVHN datasets we train using mini-   4.4. Results batch size 64 for 160 and 20 epochs, respectively. The ini-
+        tial learning rate is set to 0.1, and is divided by 10 at 50%   CIFAR and SVHNThe results on CIFAR and SVHN are
+        and 75% of the total number of training epochs. On Im-   shown in Table 1. We mark all lowest test errors of a model
+        ageNet and MNIST datasets, we train our models for 60   inboldface.
+        and 30 epochs respectively, with a batch size of 256, and an   Parameter and FLOP reductions. The purpose of net-initial learning rate of 0.1 which is divided by 10 after 1/3   work slimming is to reduce the amount of computing re-and 2/3 fraction of training epochs. We use a weight de-   sources needed. The last row of each model has≥60%cay of10 −4 and a Nesterov momentum [33] of 0.9 without   channels pruned while still maintaining similar accuracy todampening. The weight initialization introduced by [13] is   the baseline. The parameter saving can be up to 10×. Theadopted. Our optimization settings closely follow the orig-   FLOP reductions are typically around50%. To highlightinal implementation at [10]. In all our experiments, we ini-   network slimming’s efﬁciency, we plot the resource sav-tialize all channel scaling factors to be 0.5, since this gives   ings in Figure 3. It can be observed that VGGNet has ahigher accuracy for the baseline models compared with de-   large amount of redundant parameters that can be pruned.fault setting (all initialized to be 1) from [10].            On ResNet-164 the parameter and FLOP savings are rel-
+        Training with Sparsity.For CIFAR and SVHN datasets,   atively insigniﬁcant, we conjecture this is due to its “bot-
+        when training with channel sparse regularization, the hyper-   tleneck” structure has already functioned as selecting chan-
+        parameteerλ, which controls the tradeoff between empiri-   nels. Also, on CIFAR-100 the reduction rate is typically
+        cal loss and sparsity, is determined by a grid search over   slightly lower than CIFAR-10 and SVHN, which is possi-
+        10 −3 , 10 −4 , 10 −5 on CIFAR-10 validation set. For VG-   bly due to the fact that CIFAR-100 contains more classes.
+        GNet we chooseλ=10 −4 and for ResNet and DenseNet   Regularization Effect.From Table 1, we can observe that,λ=10 −5 . For VGG-A on ImageNet, we setλ=10 −5 . All   on ResNet and DenseNet, typically when40%channels areother settings are kept the same as in normal training.       pruned, the ﬁne-tuned network can achieve a lower test er-
+        Pruning.When we prune the channels of models trained   ror than the original models. For example, DenseNet-40
+        with sparsity, a pruning threshold on the scaling factors   with 40% channels pruned achieve a test error of 5.19%
+        needs to be determined. Unlike in [23] where different lay-   on CIFAR-10, which is almost 1% lower than the original
+        ers are pruned by different ratios, we use a global pruning   model. We hypothesize this is due to the regularization ef-
+        threshold for simplicity. The pruning threshold is deter-   fect of L1 sparsity on channels, which naturally provides
+        mined by a percentile among all scaling factors , e.g., 40%   feature selection in intermediate layers of a network. We
+        or 60% channels are pruned. The pruning process is imple-   will analyze this effect in the next section.
+
+
+
+                                                 2741               VGG-A       Baseline   50% Pruned                 (a) Multi-pass Scheme on CIFAR-10
+               Params       132.9M     23.2M           IterTrained Fine-tunedParams PrunedFLOPs Pruned
+             Params Pruned       -       82.5%           1  6.38 6.51     66.7%     38.6%
+               FLOPs       4.57×10 10   3.18×10 10          2  6.23 6.11     84.7%     52.7%
+             FLOPs Pruned       -       30.4%           3  5.87 6.10     91.4%     63.1%
+           Validation Error (%)    36.69      36.66            4  6.19 6.59     95.6%     77.2%
+                                                      5  5.96 7.73     98.3%     88.7%
+                  Table 2: Results on ImageNet.                 6  7.79 9.70     99.4%     95.7%
+
+        Model     Test Error (%)Params Pruned  #Neurons               (b) Multi-pass Scheme on CIFAR-100
+        Baseline      1.43        -     784-500-300-10       IterTrained Fine-tunedParams PrunedFLOPs Pruned
+        Pruned [35]    1.53      83.5%   434-174-78-10       1  27.72 26.52    59.1%     30.9%
+        Pruned (ours)   1.49      84.4%   784-100-60-10       2  26.03 26.52    79.2%     46.1%
+                                                      3  26.49 29.08    89.8%     67.3%
+                   Table 3: Results on MNIST.                 4  28.17 30.59    95.3%     83.0%
+                                                      5  30.04 36.35    98.3%     93.5%
+                                                      6  35.91 46.73    99.4%     97.7%
+        ImageNet. The results for ImageNet dataset are summa-
+        rized in Table 2. When 50% channels are pruned, the pa-   Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR-
+        rameter saving is more than 5×, while the FLOP saving   100 datasets, using VGGNet. The baseline model has test errors of
+        is only 30.4%. This is due to the fact that only 378 (out   6.34% and 26.74%. “Trained” and “Fine-tuned” columns denote
+        of 2752) channels from all the computation-intensive con-   the test errors (%) of the model trained with sparsity, and the ﬁne-
+                                                   tuned model after channel pruning, respectively. The parameter volutional layers are pruned, while 5094 neurons (out of   and FLOP pruned ratios correspond to the ﬁne-tuned model in that 8192) from the parameter-intensive fully-connected layers   row and the trained model in the next row. are pruned. It is worth noting that our method can achieve
+        the savings with no accuracy loss on the 1000-class Im-   more compact models. On CIFAR-10, the trained modelageNet dataset, where other methods for efﬁcient CNNs   achieves the lowest test error in iteration 5. This model[2, 23, 35, 28] mostly report accuracy loss.              achieves 20×parameter reduction and 5×FLOP reduction,
+        MNIST.On MNIST dataset, we compare our method with   while still achievinglowertest error. On CIFAR-100, after
+        the Structured Sparsity Learning (SSL) method [35] in Ta-   iteration 3, the test error begins to increase. This is pos-
+        ble 3. Despite our method is mainly designed to prune   sibly due to that it contains more classes than CIFAR-10,
+        channels in convolutional layers, it also works well in prun-   so pruning channels too agressively will inevitably hurt the
+        ing neurons in fully-connected layers. In this experiment,   performance. However, we can still prune near 90% param-
+        we observe that pruning with a global threshold sometimes   eters and near 70% FLOPs without notable accuracy loss.
+        completely removes a layer, thus we prune 80% of the neu-
+        rons in each of the two intermediate layers. Our method   5. Analysis
+        slightly outperforms [35], in that a slightly lower test error     There are two crucial hyper-parameters in network slim-is achieved while pruning more parameters.              ming, the pruned percentagetand the coefﬁcient of the
+          We provide some additional experimental results in the   sparsity regularization termλ(see Equation 1). In this sec-
+        supplementary materials, including (1) detailed structure of   tion, we analyze their effects in more detail.
+        a compact VGGNet on CIFAR-10; (2) wall-clock time and   Effect of Pruned Percentage. Once we obtain a modelrun-time memory savings in practice. (3) comparison with   trained with sparsity regularization, we need to decide whata previous channel pruning method [23];                percentage of channels to prune from the model. If we
+        4.5. Results for Multipass Scheme                prune too few channels, the resource saving can be very
+                                                   limited. However, it could be destructive to the model if
+          We employ the multi-pass scheme on CIFAR datasets   we prune too many channels, and it may not be possible to
+        using VGGNet. Since there are no skip-connections, prun-   recover the accuracy by ﬁne-tuning. We train a DenseNet-
+        ing away a whole layer will completely destroy the mod-   40 model withλ=10 −5 on CIFAR-10 to show the effect of
+        els. Thus, besides setting the percentile threshold as 50%,   pruning a varying percentage of channels. The results are
+        we also put a constraint that at each layer, at most 50% of   summarized in Figure 5.
+        channels can be pruned.                             From Figure 5, it can be concluded that the classiﬁcation
+          The test errors of models in each iteration are shown in   performance of the pruned or ﬁne-tuned models degrade
+        Table 4. As the pruning process goes, we obtain more and   only when the pruning ratio surpasses a threshold. The ﬁne-
+
+
+
+                                                 2742                         λ= 0                    λ= 10 −5                    λ= 10 −4
+             400                       450                      2000
+             350                       400
+             300                       350                      1500
+                                      300250
+
+             Count                         250200                                               1000200150                       150
+             100                       100                       500
+              50                       50
+               0                        0                        00.0  0.2  0.4  0.6  0.8      0.0  0.2  0.4  0.6  0.8      0.0  0.2  0.4  0.6  0.8
+                                              Scaling factor value
+        Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter
+        λ). With the increase ofλ, scaling factors become sparser.
+            8.0                                         0Baseline
+            7.5    Trained with Sparsity                          10 Pruned 7.0    Fine-tuned
+
+
+
+
+                                                     Channel Index )
+           %                                          20
+
+
+
+           Test error ( 6.5
+                                                      30 6.0
+                                                      40 5.5
+            5.0                                        50
+
+            4.50  10 20 30 40 50 60 70 80 90            0     20     40     60     80
+                      Pruned channels (%)                                   Epoch
+        Figure 5: The effect of pruning varying percentages of channels,   Figure 6: Visulization of channel scaling factors’ change in scale
+        from DenseNet-40 trained on CIFAR-10 withλ=10 −5 .          along the training process, taken from the 11th conv-layer in VG-
+                                                   GNet trained on CIFAR-10. Brighter color corresponds to larger
+                                                   value. The bright lines indicate the “selected” channels, the dark
+        tuning process can typically compensate the possible accu-   lines indicate channels that can be pruned.
+        racy loss caused by pruning. Only when the threshold goes
+        beyond 80%, the test error of ﬁne-tuned model falls behind   progresses, some channels’ scaling factors become largerthe baseline model. Notably, when trained with sparsity,   (brighter) while others become smaller (darker).even without ﬁne-tuning, the model performs better than the
+        original model. This is possibly due the the regularization   6. Conclusion effect of L1 sparsity on channel scaling factors.
+                                                     We proposed the network slimming technique to learnChannel Sparsity Regularization.The purpose of the L1   more compact CNNs. It directly imposes sparsity-inducedsparsity term is to force many of the scaling factors to be   regularization on the scaling factors in batch normalizationnear zero. The parameterλin Equation 1 controls its signif-   layers, and unimportant channels can thus be automatically icance compared with the normal training loss. In Figure 4   identiﬁed during training and then pruned. On multiple we plot the distributions of scaling factors in the whole net-   datasets, we have shown that the proposed method is able towork with differentλvalues. For this experiment we use a   signiﬁcantly decrease the computational cost (up to 20×) ofVGGNet trained on CIFAR-10 dataset.                 state-of-the-art networks, with no accuracy loss. More im-
+          It can be observed that with the increase ofλ, the scaling   portantly, the proposed method simultaneously reduces the
+        factors are more and more concentrated near zero. When   model size, run-time memory, computing operations while
+        λ=0, i.e., there’s no sparsity regularization, the distribution   introducing minimum overhead to the training process, and
+        is relatively ﬂat. Whenλ=10 −4 , almost all scaling factors   the resulting models require no special libraries/hardware
+        fall into a small region near zero. This process can be seen   for efﬁcient inference.
+        as a feature selection happening in intermediate layers of
+        deep networks, where only channels with non-negligible   Acknowledgements. Gao Huang is supported by the In-
+        scaling factors are chosen. We further visualize this pro-   ternational Postdoctoral Exchange Fellowship Program of
+        cess by a heatmap. Figure 6 shows the magnitude of scaling   China Postdoctoral Council (No.20150015). Changshui
+        factors from one layer in VGGNet, along the training pro-   Zhang is supported by NSFC and DFG joint project NSFC
+        cess. Each channel starts with equal weights; as the training   61621136008/DFG TRR-169.
+
+
+
+                                                 2743         References                                     [20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network
+                                                            architecture optimization through submodularity and super- [1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu-       modularity.arXiv preprint arXiv:1609.00074, 2016. ral network architectures using reinforcement learning. In    [21] A. Krizhevsky and G. Hinton. Learning multiple layers of ICLR, 2017.                                       features from tiny images. InTech Report, 2009. [2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power    [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetof sparsity in convolutional neural networks.arXiv preprint       classiﬁcation with deep convolutional neural networks. In arXiv:1702.06257, 2017.                               NIPS, pages 1097–1105, 2012. [3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and    [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Y. Chen. Compressing neural networks with the hashing       Graf. Pruning ﬁlters for efﬁcient convnets. arXiv preprint trick. InICML, 2015.                                 arXiv:1608.08710, 2016.
+          [4] S. Chintala. Training an object classiﬁer in torch-7 on    [24] M. Lin, Q. Chen, and S. Yan. Network in network. InICLR,multiple gpus over imagenet. https://github.com/       2014.soumith/imagenet-multiGPU.torch.            [25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.
+          [5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A       Sparse convolutional neural networks. InProceedings of the
+            matlab-like environment for machine learning. InBigLearn,       IEEE Conference on Computer Vision and Pattern Recogni-
+            NIPS Workshop, number EPFL-CONF-192376, 2011.            tion, pages 806–814, 2015.
+          [6] M. Courbariaux and Y. Bengio. Binarynet: Training deep    [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
+            neural networks with weights and activations constrained to+       networks for semantic segmentation. InCVPR, pages 3431–
+            1 or-1.arXiv preprint arXiv:1602.02830, 2016.                3440, 2015.
+          [7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-    [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
+            gus. Exploiting linear structure within convolutional net-       Ng. Reading digits in natural images with unsupervised fea-
+            works for efﬁcient evaluation. InNIPS, 2014.                 ture learning, 2011. InNIPS Workshop on Deep Learning
+          [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-       and Unsupervised Feature Learning, 2011.
+            ture hierarchies for accurate object detection and semantic    [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
+            segmentation. InCVPR, pages 580–587, 2014.                net: Imagenet classiﬁcation using binary convolutional neu-
+          [9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and       ral networks. InECCV, 2016.
+            Y. Bengio. Maxout networks. InICML, 2013.             [29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini.
+         [10] S. Gross and M. Wilber. Training and investigating residual       Group sparse regularization for deep neural networks.arXiv
+            nets. https://github.com/szagoruyko/cifar.       preprint arXiv:1607.00485, 2016.
+            torch.                                      [30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization
+         [11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-       methods for l1 regularization: A comparative study and two
+            pressing deep neural network with pruning, trained quanti-       new approaches. InECML, pages 286–297, 2007.
+            zation and huffman coding. InICLR, 2016.               [31] K. Simonyan and A. Zisserman. Very deep convolutional
+         [12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights       networks for large-scale image recognition. InICLR, 2015.
+            and connections for efﬁcient neural network. InNIPS, pages    [32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse
+            1135–1143, 2015.                                   neural networks.CoRR, abs/1611.06694, 2016.
+         [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into    [33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the
+            rectiﬁers: Surpassing human-level performance on imagenet       importance of initialization and momentum in deep learning.
+            classiﬁcation. InICCV, 2015.                            InICML, 2013.
+         [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning    [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
+            for image recognition. InCVPR, 2016.                      D. Anguelov, D. Erhan, et al. Going deeper with convolu-
+                                                            tions. InCVPR, pages 1–9, 2015.[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in    [35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningdeep residual networks. InECCV, pages 630–645. Springer,       structured sparsity in deep neural networks. InNIPS, 2016.2016.                                        [36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and       github.com/szagoruyko/cifar.torch.K. Q. Weinberger. Multi-scale dense convolutional networks
+            for efﬁcient prediction. arXiv preprint arXiv:1703.09844,    [37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards
+            2017.                                            compact cnns. InECCV, 2016.
+                                                         [38] B. Zoph and Q. V. Le. Neural architecture search with rein-[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.       forcement learning. InICLR, 2017.Densely connected convolutional networks. InCVPR, 2017.
+         [18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger.
+            Deep networks with stochastic depth. InECCV, 2016.
+         [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
+            deep network training by reducing internal covariate shift.
+            arXiv preprint arXiv:1502.03167, 2015.
+
+
+
+
+                                                       2744
--- a/Corpus/Learning
+++ b/Corpus/Learning
--- a/Corpus/Learning
+++ b/Corpus/Learning
--- a/Corpus/Learning
+++ b/Corpus/Learning
--- a/Corpus/Learning
+++ b/Corpus/Learning
@ -0,0 +1,933 @@
+        262-A1677  7/24/01  11:12 AM  Page 763
+
+
+
+
+
+
+
+                                                          SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING
+
+
+
+                         MANFRED OPPER                               Theories that try to understand the ability of neural
+
+                         Neural Computation Research Group                  networks to generalize from learned examples are
+                         Aston University                                   discussed. Also, an approach that is based on ideas
+                         Birmingham B4 7ET, United Kingdom                 from statistical physics which aims to model typical
+                                                                           learning behavior is compared with a worst-case
+                                                                           framework.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                    Learning to
+
+
+                    Generalize
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                    ................................................ ◗
+
+                                      Introduction                      rule. To what extent is it possible to understand the com-
+                                                                           plexity of learning from examples by mathematical models
+                    Neural networks learn from examples. This statement is     andtheirsolutions?Thisquestionisthefocusofthisarticle.
+                    obviously true for the brain, but also artiﬁcial networks (or    I concentrate on the use of neural networks for classiﬁca-
+                    neural networks), which have become a powerful new tool     tion. Here, one can take characteristic features (e.g., the
+                    for many pattern-recognition problems, adapt their “syn-    pixels of an image) as an input pattern to the network. In
+                    aptic” couplings to a set of examples. Neural nets usually     the simplest case, it should decide whether a given pattern
+                    consist of many simple computing units which are com-    belongs (at least more likely) to a certain class of objects
+                    bined in an architecture which is often independent from    and respond with the output 1 or 1. To learn the under-
+                    the problem. The parameters which control the interaction    lying classiﬁcation rule, the network is trained on a set of
+                    among the units can be changed during the learning phase     patterns together with the classiﬁcation labels, which are
+                    and these are often called synaptic couplings.After the    provided by a trainer. A heuristic strategy for training is to
+                    learning phase, a network adopts some ability to generalize    tune the parameters of the machine (the couplings of the
+                    from the examples; it can make predictions about inputs    network) using a learning algorithm, in such a way that the
+                    which it has not seen before; it has begun to understand a     errors made on the set of training examples are small, in
+
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       763        262-A1677  7/24/01  11:12 AM  Page 764
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                the hope that this helps to reduce the errors on new data.     for the case of realizable rules they are also independent
+                How well will the trained network be able to classify an in-    of the speciﬁc algorithm, as long as the training examples
+                put that it has not seen before? This performance on new     are perfectly learned. Because it is able to cover even bad
+                data deﬁnes the generalization ability of the network. This    situations which are unfavorable for improvement of the
+                ability will be affected by the problem of realizability: The     learning process, it is not surprising that this theory may
+                network may not be sufﬁciently complex to learn the rule     in some cases provide too pessimistic results which are also
+                completely or there may be ambiguities in classiﬁcation.     too crude to reveal interesting behavior in the intermediate
+                Here, I concentrate on a second problem arising from the     region of the learning curve.
+                fact that learning will mostly not be exhaustive and the in-       In this article, I concentrate mainly on a different ap-
+                formation about the rule contained in the examples is not    proach, which has its origin in statistical physics rather than
+                complete. Hence, the performance of a network may vary     in mathematical statistics, and compare its results with the
+                from one training set to another. In order to treat the gen-     worst-case results. This method aims at studying the typical
+                eralization ability in a quantitative way, a common model     rather than the worst-case behavior and often enables the
+                assumes that all input patterns, those from the training set     exact calculations of the entire learning curve for models of
+                and the new one on which the network is tested, have a pre-    simple networks which have many parameters. Since both
+                assigned probability distribution (which characterizes the     biological and artiﬁcial neural networks are composed of
+                feature that must be classiﬁed), and they are produced in-     many elements, it is hoped that such an approach may ac-
+                dependently at random with the same probability distribu-    tually reveal some relevant and interesting structures.
+                tion from the network’s environment. Sometimes the prob-       At ﬁrst, it may seem surprising that a problem should
+                ability distribution used to extract the examples and the     simplifywhenthenumberofitsconstituentsbecomeslarge.
+                classiﬁcation of these examples is called the rule.The net-     However, this phenomenon is well-known for macroscopic
+                work’s performance on novel data can now be quantiﬁed by     physical systems such as gases or liquids which consist of
+                the so-called generalization error,which is the probability     a huge number of molecules. Clearly, it is not possible to
+                of misclassifying the test input and can be measured by re-     study the complete microscopic state of such a system,
+                peating the same learning experiment many times with dif-    which is described by the rapidly ﬂuctuating positions and
+                ferent data.                                             velocities of all particles. On the other hand, macroscopic
+                   Within such a probabilistic framework, neural networks     quantities such as density, temperature, and pressure are
+                areoftenviewedasstatisticaladaptivemodelswhichshould    usually collective properties inﬂuenced by all elements. For
+                give a likely explanation of the observed data. In this frame-    such quantities, ﬂuctuations are averaged out in the ther-
+                work, the learning process becomes mathematically related     modynamic limit of a large number of particles and the col-
+                to a statistical estimation problem for optimal network pa-    lective properties become, to some extent, independent of
+                rameters.Hence,mathematicalstatisticsseemstobeamost    themicrostate.Similarly,thegeneralizationabilityofaneu-
+                appropriate candidate for studying a neural network’s be-     ral network is a collective property of all the network pa-
+                havior. In fact, various statistical approaches have been ap-     rameters, and the techniques of statistical physics allow, at
+                plied to quantify the generalization performance. For ex-     least for some simple but nontrivial models, for exact com-
+                ample, expressions for the generalization error have been     putations in the thermodynamic limit. Before explaining
+                obtainedinthelimit,wherethenumberofexamplesislarge    these ideas in detail, I provide a short description of feed-
+                compared to the number of couplings (Seung et al.,1992;    forward neural networks.
+                Amari and Murata, 1993). In such a case, one can expect                              ................................................that learning is almost exhaustive, such that the statistical                             ◗
+
+                ﬂuctuations of the parameters around their optimal values              Artiﬁcial Neural Networks
+                are small. However, in practice the number of parameters is
+                often large so that the network can be ﬂexible, and it is not    Based on highly idealized models of brain function, artiﬁ-
+                clear how many examples are needed for the asymptotic    cial neural networks are built from simple elementary com-
+                theorytobecomevalid.Theasymptotictheorymayactually    puting units, which are sometimes termed neurons after
+                miss interesting behavior of the so-called learning curve,    their biological counterparts. Although hardware imple-
+                which displays the progress of generalization ability with    mentations have become an important research topic, neu-
+                an increasing amount of training data.                      ral nets are still simulated mostly on standard computers.
+                   A second important approach, which was introduced    Each computing unit of a neural net has a single output and
+                into mathematical statistics in the 1970s by Vapnik and    several ingoing connections which receive the outputs of
+                Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact    other units. To every ingoing connection (labeled by the
+                bounds for the generalization error which are valid for any    index i) a real number is assigned, the synaptic weight w,i
+                number of training examples. Moreover, they are entirely    which is the basic adjustable parameter of the network. To
+                independent of the underlying distribution of inputs, and    compute a unit’s output, all incoming values x are multi- i
+
+
+
+
+                764                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 765
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                                     0.6   −0.9   0.8
+                                                    inputs
+
+                                     1.6 −1.4    −0.1 synaptic weights
+
+                                                       weighted sum
+                                               1.6 × 0.6 + (–1.4) × (–0.9) + (–0.1) × 0.8 = 2.14
+
+
+
+                                                                 1
+
+
+                                                                 0
+
+
+
+                                                                  −1
+                                                                      2.14 aboutput
+                            FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numeri-
+                            cal values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs
+                            reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which
+                            the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and
+                            step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information.
+
+
+                    plied by the weights w and then added. Figure 1a shows     its simple structure, it can for many learning problems give i
+                    an example of such a computation with three couplings.     a nontrivial generalization performance and may be used
+                    Finally, the result,  wx,is passed through an activation     as a ﬁrst step to an unknown classiﬁcation task. As can be i  i i
+                    function which is typically of the shape of the red curve in     seen by comparing Figs. 2a and 1b, it is also a building
+                    Fig. 1a (a sigmoidal function), which allows for a soft, am-     block for the more complex multilayer networks. Hence,
+                    biguous classiﬁcation between 1 and 1. Other impor-     understanding its performance theoretically may also pro-
+                    tant cases are the step function (green curve) and the linear     vide insight into the more complex machines. To learn a set
+                    function (yellow curve; used in the output neuron for prob-    of examples, a network must adjust its couplings appropri-
+                    lems of ﬁtting continuous functions). In the following, to     ately (I often use the word couplings for their numerical
+                    keep matters simple, I restrict the discussion mainly to the     strengths, the weights w, for i1,..., N). Remarkably, i
+                    step function. Such simple units can develop a remarkable     for the perceptron there exists a simple learning algorithm
+                    computational power when connected in a suitable archi-     which always enables the network to ﬁnd those parameter
+                    tecture. An important network type is the feedforward ar-     values whenever the examples can be learnt by a percep-
+                    chitecture shown in Fig. 1b, which has two layers of comput-     tron. In Rosenblatt’s algorithm, the input patterns are pre-
+                    ing units and adjustable couplings. The input nodes (which     sented sequentially (e.g., in cycles) to the network and the
+                    do not compute) are coupled to the so-called hidden units,
+                    whichfeedtheiroutputsintooneormoreoutputunits.With
+                    suchanarchitectureandsigmoidalactivationfunctions,any
+                    continuous function of the inputs can be arbitrarily closely                                         xx                                   21   x2   x3       xn
+                    approximated when the number of hidden units is sufﬁ-
+                    ciently large.                                                                                      (w1 ,w 2 )
+                                                                            w ................................................                                1  w2 w3    wn ◗
+
+                                    The Perceptron                                                                    x1
+
+
+                    The simplest type of network is the perceptron (Fig. 2a).
+                    There are Ninputs, Nsynaptic couplings w, and the output i
+                    is simply                                               a                          b
+                                           N                               FIGURE 2 (a) The perceptron. (b) Classiﬁcation of inputs
+                                          awx                   [1] i i                           by a perceptron with two inputs. The arrow indicates the vec-
+                                          i1                              tor composed of the weights of the network, and the line per-
+                    It has a single-layer architecture and the step function     pendicular to this vector is the boundary between the classes
+                    (green curve in Fig. 1a) as its activation function. Despite     of input.
+
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       765        262-A1677  7/24/01  11:12 AM  Page 766
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                output is tested. Whenever a pattern is not classiﬁed cor-
+                rectly, all couplings are altered simultaneously. We increase     x2
+                by a ﬁxed amount all weights for which the input unit and
+                the correct value of the output neuron have the same sign
+                but we decrease them for the opposite sign. This simple
+                algorithm is reminiscent of the so-called Hebbian learning
+                rule,a physiological model of a learning processes in the
+                real brain. It assumes that synaptic weights are increased
+                when two neurons are simultaneously active. Rosenblatt’s
+                theorem states that in cases in which there exists a choice of
+                the w which classify correctly all of the examples (i.e., per- i
+                fectly learnable perceptron), this algorithm ﬁnds a solution
+                in a ﬁnite number of steps, which is at worst equal to A N 3 ,
+                where Ais an appropriate constant.
+                   It is often useful to obtain an intuition of a perceptron’s                                                    xa                                               1
+                classiﬁcation performance by thinking in terms of a geo-
+                metric picture. We may view the numerical values of the in-
+                puts as the coordinates of a point in some (usually) high-
+                dimensional space. The case of two dimensions is shown
+                in Fig. 2b. A corresponding point is also constructed for the
+                couplings w.The arrow which points from the origin of the i
+                coordinate system to this latter point is called the weight
+                vector or coupling vector. An application of linear algebra
+                tothecomputationofthenetworkshowsthatthelinewhich
+                is perpendicular to the coupling vector is the boundary be-
+                tween inputs belonging to the two different classes. Input
+                points which are on the same side as the coupling vector are
+                classiﬁed as 1 (the green region in Fig. 2b) and those on
+                the other side as 1 (red region in Fig. 2b).
+                   Rosenblatt’s algorithm aims to determine such a line
+                when it is possible. This picture generalizes to higher di-                    direction of coupling vectorb
+                mensions, for which a hyperplane plays the same role of the     FIGURE 3 (a) Projection of 200 random points (with ran-
+                line of the previous two-dimensional example. We can still     dom labels) from a 200-dimensional space onto the ﬁrst two
+                obtainanintuitivepicturebyprojectingontwo-dimensional    coordinate axes (x and x). (b) Projection of the same points 1     2
+                planes. In Fig. 3a, 200 input patterns with random coordi-     onto a plane which contains the coupling vector of a perfectly
+                nates (randomly labeled red and blue) in a 200-dimensional     trained perceptron.
+                input space are projected on the plane spanned by two arbi-
+                trary coordinate axes. If we instead use a plane for projec-
+                tion which contains the coupling vector (determined from    tions for small changes of the couplings). Hence, in general,
+                a variant of Rosenblatt’s algorithm) we obtain the view    in addition to the perfectly learnable perceptron case in
+                shown in Fig. 3b, in which red and green points are clearly     which the ﬁnal error is zero, minimizing the training error
+                separated and there is even a gap between the two clouds.     is usually a difﬁcult task which could take a large amount of
+                   It is evident that there are cases in which the two sets of    computer time. However, in practice, iterative approaches,
+                points are too mixed and there is no line in two dimensions    which are based on the minimization of other smooth cost
+                (or no hyperplane in higher dimensions which separates     functions,areusedtotrainaneuralnetwork(Bishop,1995).
+                them). In these cases, the rule is too complex to be per-                              ................................................fectly learned by a perceptron. If this happens, we must at-                             ◗
+
+                tempt to determine the choice of the coupling which mini-               Capacity, VC Dimension, 
+                mizesthenumberoferrorsonagivensetofexamples.Here,           and Worst-Case Generalization
+                Rosenblatt’s algorithm does not work and the problem of
+                ﬁnding the minimum is much more difﬁcult from the algo-    As previously shown, perceptrons are only able to realize a
+                rithmic point. The training error, which is the number of     very restricted type of classiﬁcation rules, the so-called lin-
+                errorsmadeonthetrainingset,isusuallyanonsmoothfunc-    early separable ones. Hence, independently from the issue
+                tion of the network couplings (i.e., it may have large varia-    of ﬁnding the best algorithm to learn the rule, one may ask
+
+
+
+
+
+                766                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 767
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                    the following question: In how many cases will the percep-     exp[Nf(m/N)], where the function f(a) vanishes for 
+                    tron be able to learn a given set of training examples per-     a2 and it is positive for a2. Such a threshold phe-
+                    fectly if the output labels are chosen arbitrarily? In order to     nomenon is an example of a phase transition (i.e., a sharp
+                    answer this question in a quantitative way, it is convenient     change of behavior) which can occur in the thermodynamic
+                    tointroducesomeconceptssuchascapacity,VCdimension,     limit of a large network size.
+                    andworst-casegeneralization,whichcanbeusedinthecase       Generally, the point at which such a transition takesof the perceptron and have a more general meaning.          place deﬁnes the so-called capacity of the neural network.In the case of perceptrons, this question was answered in     Although the capacity measures the ability of a network tothe 1960s by Cover (1965). He calculated for any set of in-     learn random mappings of the inputs, it is also related to itsput patterns, e.g., m,the fraction of all the 2 m possible map-     ability to learn a rule (i.e., to generalize from examples).pings that can be linearly separated and are thus learnable     The question now is, how does the network perform on aby perceptrons. This fraction is shown in Fig. 4 as a func-     new example after having been trained to learn mexampletion of the number of examples per coupling for different     on the training set?numbers of input nodes (couplings) N.Three regions can       To obtain an intuitive idea of the connection betweenbe distinguished:                                        capacity and ability to generalize, we assume a training set
+                       Region in which m/N1: Simple linear algebra shows     of size mand a single pattern for test. Suppose we deﬁne
+                    that it is always possible to learn all mappings when the     a possible rule by an arbitrary learnable mapping from
+                    number mof input patterns is less than or equal to the     inputs to outputs. If m1 is much larger than the capac-
+                    number Nof couplings (there are simply enough adjustable     ity, then for most rules the labels on the mtraining pat-
+                    parameters).                                            terns which the perceptron is able to recognize will nearly
+                       Region in which m/N1: For this region, there are ex-     uniquely determine the couplings (and consequently the
+                    amples of rules that cannot be learned. However, when the     answer of the learning algorithm on the test pattern), and
+                    number of examples is less than twice the number of cou-     therulecanbeperfectlyunderstoodfromtheexamples.Be-
+                    plings (m/N2), if the network is large enough almost all     low capacity, in most cases there are two different choices
+                    mappings can be learned. If the output labels for each of    of couplings which give opposite answers for the test pat-
+                    the minputs are chosen randomly 1 or 1 with equal    tern. Hence, a correct classiﬁcation will occur with proba-
+                    probability, the probability of ﬁnding a nonrealizable cou-    bility 0.5 assuming all rules to be equally probable. Figure 5
+                    pling goes to zero exponentially when Ngoes to inﬁnity at    displays the two types of situations form3andN2.
+                    ﬁxed ratio m/N.                                           This intuitive connection can be sharpened. Vapnik and
+                       Region in which m/N2: For m/N2 the probabil-     Chervonenkis established a relation between a capacity
+                    ity for a mapping to be realizable by perceptrons decreases     such as quantity and the generalization ability that is valid
+                    to zero rapidly and it goes to zero exponentially when N     for general classiﬁers (Vapnik, 1982, 1995). The VC dimen-
+                    goes to inﬁnity at ﬁxed ratio m/N(it is proportional to     sion is deﬁned as the size of the largest set of inputs for
+                                                                           which all mappings can be learned by the type of classi-
+                                                                           ﬁer. It equals Nfor the perceptron. Vapnik and Chervo-
+                        1.0                                                 nenkis were able to show that for any training set of size m
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                      fraction of realizable mappings 0.8
+
+
+                        0.6
+
+
+                        0.4                                                                   ?                           ?
+
+
+                        0.2
+
+
+                        0.0                                                 a                            b
+                          01234 FIGURE 5 Classiﬁcation rules for four patterns based on a m/N                         perceptron. The patterns colored in red represent the training
+                    FIGURE 4 Fraction of all mappings of minput patterns    examples, and triangles and circles represent different class la-
+                    which are learnable by perceptrons as a function of m/Nfor    bels. The question mark is a test pattern. (a) There are two
+                    different numbers of couplings N: N10 (in green), N20    possible ways of classifying the test point consistent with the
+                    (in blue), and N100 (in red).                             examples; (b) only one classiﬁcation is possible.
+
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       767        262-A1677  7/24/01  11:12 AM  Page 768
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                larger than the VC dimension D , the growth of the num-    blue curve in Fig. 6, the minimal training error will decrease VC
+                ber of realizable mappings is bounded by an expression     for increasing complexity of the nets. On the other hand,
+                which grows much slower than 2 m (in fact, only like a poly-     the VC dimension and the complexity of the networks in-
+                nomial in m).                                           crease with the increasing number of hidden units, leading
+                   They proved that a large difference between training er-    to an increasing expected difference (conﬁdence interval)
+                ror (i.e., the minimum percentage of errors that is done on    between training error and generalization error as indi-
+                the training set) and generalization error (i.e., the proba-     cated by the red curve. The sum of both (green curve) will
+                bility of producing an error on the test pattern after having    have a minimum, giving the smallest bound on the general-
+                learned the examples) of classiﬁers is highly improbable if    ization error. As discussed later, this procedure will in some
+                the number of examples is well above D . This theorem    cases lead to not very realistic estimates by the rather pes- VC
+                implies a small expected generalization error for perfect     simistic bounds of the theory. In other words, the rigorous
+                learning of the training set results. The expected general-     bounds, which are obtained from an arbitrary network and
+                ization error is bounded by a quantity which increases pro-    rule, are much larger than those determined from the re-
+                portionally to D  and decreases (neglecting logarithmic     sults for most of the networks and rules. VC
+                corrections in m) inversely proportional to m.                                         ................................................Conversely, one can construct a worst-case distribution                             ◗
+
+                of input patterns, for which a size of the training set larger           Typical Scenario: The Approach
+                than D  is also necessary for good generalization. The VC                  of Statistical Physics VC
+                results should, in practice, enable us to select the network
+                with the proper complexity which guarantees the smallest    When the number of examples is comparable to the size of
+                bound on the generalization error. For example, in order     the network, which for a perceptron equals the VC dimen-
+                toﬁnd the proper size of the hidden layer of a network with    sion, the VC theory states that one can construct malicious
+                twolayers,onecouldtrainnetworksofdifferentsizesonthe    situations which prevent generalizations. However, in gen-
+                same data.                                             eral, we would not expect that the world acts as an adver-
+                   The relation among these concepts can be better under-    sary. Therefore, how should one model a typical situation?
+                stood if we consider a family of networks of increasing com-    As a ﬁrst step, one may construct rules and pattern dis-
+                plexity which have to learn the same rule. A qualitative pic-    tributions which act together in a nonadversarial way. The
+                ture of the results is shown in Fig. 6. As indicated by the    teacher–student paradigm has proven to be useful in such a
+                                                                       situation. Here, the rule to be learned is modeled by a sec-
+                                                                       ondnetwork,theteachernetwork;inthiscase,iftheteacher
+                                                                       and the student have the same architecture and the same
+                                  upper bound on                         numberofunits,theruleisevidentlyrealizable.Thecorrect generalization error                        class labels for any inputs are given by the outputs of the
+                                                                       teacher. Within this framework, it is often possible to ob-
+                                                                       tain simple expressions for the generalization error. For a
+                                            upper bound on               perceptron, we can use the geometric picture to visualize confidence interval             the generalization error. A misclassiﬁcation of a new in-
+                                                                       put vector by a student perceptron with coupling vector ST
+                                                                       occurs only if the input pattern is between the separating
+                                                                       planes (dashed region in Fig. 7) deﬁned by ST and the vec-
+                                                                       tor of teacher couplings TE. If the inputs are drawn ran- training error               domlyfromauniformdistribution,thegeneralizationerror
+                                                                       is directly proportional to the angle between ST and TE.
+                                 network complexity                      Hence, the generalization error is small when teacher and
+                                                                       student vectors are close together and decreases to zero
+                                                                       when both coincide.
+                                                                          In the limit, when the number of examples is very large
+                                                                       all the students which learn the training examples perfectly
+                                                                       will not differ very much from and their couplings will be FIGURE 6 As the complexity of the network varies (i.e.,     close to those of the teacher. Such cases with a small gen- of the number of hidden units, as shown schematically below),
+                the generalization error (in red), calculated from the sum of     eralization error have been successfully treated by asymp-
+                the training error (in green) and the conﬁdence interval (in     totic methods of statistics. On the other hand, when the
+                blue) according to the theory of Vapnik–Chervonenkis, shows     number of examples is relatively small, there are many dif-
+                a minimum; this corresponds to the network with the best gen-    ferent students which are consistent with the teacher re-
+                eralization ability.                                        garding the training examples, and the uncertainty about
+
+
+
+                768                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 769
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                                                                           with the number of couplings N(like typical volumes in 
+                                                                           N-dimensional spaces) and Bdecreases exponentially with
+                                                                           m(because it becomes more improbable to be correct ST                        mtimes for any e0), both factors can balance each other
+                                                                           when mincreases like maN.ais an effective measure for TE                   the size of the training set when Ngoes to inﬁnity. In order
+                                                                           to have quantities which remain ﬁnite as NSq, it is also
+                                                                           useful to take the logarithm of V(e) and divide by N, which
+                                                                           transforms the product into a sum of two terms. The ﬁrst
+                                                                           one (which is often called the entropic term) increases with
+                                                                           increasing generalization error (green curve in Fig. 8). This
+                    FIGURE 7 For a uniform distribution of patterns, the gen-     is true because there are many networks which are not
+                    eralization error of a perceptron equals the area of the    similar to the teacher, but there is only one network equal
+                    shaded region divided by the area of the entire circle. ST and     to the teacher. For almost all networks (remember, the
+                    TE represent the coupling vectors of the student and teacher,     entropic term does not include the effect of the training ex-
+                    respectively.                                             amples) e0.5, i.e., they are correct half of the time by
+                                                                           random guessing. On the other hand, the second term (red
+                                                                           curve in Fig. 8) decreases with increasing generalization er-
+                    the true couplings of the teacher is large. Possible general-    ror because the probability of being correct on an input
+                    ization errors may range from zero (if, by chance, a learn-     pattern increases when the student network becomes more
+                    ing algorithm converges to the teacher) to some worst-case    similar to the teacher. It is often called the energetic contri-
+                    value. We may say that the constraint which speciﬁes the     butionbecauseitfavorshighlyordered(towardtheteacher)
+                    macrostateofthenetwork(itstrainingerror)doesnotspec-    network states, reminiscent of the states of physical systems
+                    ify the microstate uniquely. Nevertheless, it makes sense to    at low energies. Hence, there will be a maximum (Fig. 8, ar-
+                    speak of a typical value for the generalization error, which     row) of V(e) at some value of ewhich by deﬁnition is the
+                    is deﬁned as the value which is realized by the majority of    typical generalization error.
+                    the students. In the thermodynamic limit known from sta-       The development of the learning process as the number
+                    tistical physics, in which the number of parameters of the    of examples aNincreases can be understood as a compe-
+                    network is taken to be large, we expect that in fact almost    tition between the entropic term, which favors disordered
+                    all students belong to this majority, provided the quantity    network conﬁgurations that are not similar to the teacher,
+                    of interest is a cooperative effect of all components of the    andtheenergeticterm.Thelattertermdominateswhenthe
+                    system. As the geometric visualization for the generaliza-    number of examples is large. It will later be shown that such
+                    tion error of the perceptron shows, this is actually the case.    a competition can lead to a rich and interesting behavior as
+                    The following approach, which was pioneered by Elizabeth    the number of examples is varied. The result for the learn-
+                    Gardner (Gardner, 1988; Gardner and Derrida, 1989), is    ing curve (Györgyi and Tishby, 1990; Sompolinsky et al.,
+                    based on the calculation of V(e), the volume of the space
+                    of couplings which both perfectly implement mtraining
+                    examples and have a given generalization error e. For an
+                    intuitive picture, consider that only discrete values for the                               entropic contribution
+                    couplings are allowed; then V(e) would be proportional to
+                    the number of students. The typical value of the general-
+                    ization error is the value of e, which maximizes V(e). It
+                    should be kept in mind that V(e) is a random number and                               energetic contribution
+                    ﬂuctuates from one training set to another. A correct treat-                 1/N logfV(ε)g
+                    ment of this randomness requires involved mathematical
+                    techniques (Mézard et al.,1987). To obtain a picture which
+                    is quite often qualitatively correct, we may replace it by its
+                    average over many realizations of training sets. From ele-
+                    mentary probability theory we see that this average num-              maximum
+                    ber can be found by calculating the volume Aof the space     0        0.1 0.2 0.3 0.4 0.5 
+                    of all students with generalization error e, irrespective of                                                ε
+                    their behavior on the training set, and multiplying it by    FIGURE 8 Logarithm of the average volume of students that
+                    the probability Bthat a student with generalization error e    havelearnedmexamplesandgiveegeneralizationerror(green
+                    gives mtimes the correct answers on independent draw-     curve). The blue and red curves represent the energetic and
+                    ings of the input patterns. Since Aincreases exponentially     entropic contributions, respectively.
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       769        262-A1677  7/24/01  11:12 AM  Page 770
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                  0.5                                                   student is free to ask the teacher questions, i.e., if the stu-
+                ε                                                      dent can choose highly informative input patterns. For the
+                                                                       simple perceptron a fruitful query strategy is to select a new 0.4                                                   input vector which is perpendicular to the current coupling
+                                                                       vector of the student (Kinzel and Ruján, 1990). Such an
+                  0.3                                                   input is a highly ambiguous pattern because small changes
+                                    continuous couplings                   in the student couplings produce different classiﬁcation an-
+                                                                       swers. For more complicated networks it may be difﬁcult 0.2                                                   to obtain similar ambiguous inputs by an explicit construc-
+                                                                       tion. A general algorithm has been proposed (Seung et al.,
+                  0.1                                                   1992a) which uses the principle of maximal disagreement discrete couplings                          in a committee of several students as a selection process for
+                                                                       training patterns. Using an appropriate randomized train- 0.00.0 0.1 0.2     0.3 0.4 0.5 0. 6     ingstrategy,differentstudentsaregeneratedwhichalllearn α        the same set of examples. Next, any new input vector is only
+                FIGURE 9 Learning curves for typical student perceptrons.     accepted for training when the disagreement of its classi-
+                am/Nis the ratio between the number of examples and the     ﬁcation between the students is maximal. For a committee
+                coupling number.                                        of two students it can be shown that when the number of
+                                                                       examples is large, the information gain does not decrease
+                                                                       but reaches a positive constant. This results in a much faster
+                1990) of a perceptron obtained by the statistical physics ap-    decrease of the generalization error. Instead of being in-
+                proach (treating the random sampling the proper way) is     versely proportional to the number of examples, the de-
+                shown by the red curve of Fig. 9. In contrast to the worst-     crease is now exponentially fast.
+                casepredictionsoftheVCtheory,itispossibletohavesome                              ................................................generalization ability below VC dimension or capacity. As                             ◗
+
+                we might have expected, the generalization error decreases          Bad Students and Good Students
+                monotonically, showing that the more that is learned, the
+                more that is understood. Asymptotically, the error is pro-    Although the typical student perceptron has a smooth,
+                portional to Nand inversely proportional to m, in agree-    monotonically decreasing learning curve, the possibility
+                ment with the VC predictions. This may not be true for    that some concrete learning algorithm may result in a set
+                more complicated networks.                              of student couplings which are untypical in the sense of
+                                                                       our theory cannot be ruled out. For bad students, even non-................................................ ◗                              monotic generalization behavior is possible. The problem
+                                Query Learning                    of a concrete learning algorithm can be made to ﬁt into the
+                                                                       statistical physics framework if the algorithm minimizes a
+                Soon after Gardner’s pioneering work, it was realized that    certain cost function. Treating the achieved values of the
+                the approach of statistical physics is closely related to ideas    new cost function as a macroscopic constraint, the tools of
+                in information theory and Bayesian statistics (Levin et al.,     statistical physics apply again.
+                1989;GyörgyiandTishby,1990;OpperandHaussler,1991),       As an example, it is convenient to consider a case in
+                for which the reduction of an initial uncertainty about the    which the teacher and the student have a different archi-
+                true state of a system (teacher) by observing data is a cen-     tecture: In one of the simplest examples one tries to learn
+                tral topic of interest. The logarithm of the volume of rele-     a classiﬁcation problem by interpreting it as a regression
+                vant microstates as deﬁned in the previous section is a di-     problem, i.e., a problem of ﬁtting a continuous function
+                rect measure for such uncertainty. The moderate progress     through data points. To be speciﬁc, we study the situation
+                in generalization ability displayed by the red learning curve    in which the teacher network is still given by a percep-
+                of Fig. 9 can be understood by the fact that as learning pro-    tron which computes binary valued outputs of the form 
+                gresses less information about the teacher is gained from a     ywx, 1, but as the student we choose a network i  i i
+                newrandomexample.Here,theinformationgainisdeﬁned    with a linear transfer function (the yellow curve in Fig. 1a)
+                as the reduction of the uncertainty when a new example is
+                learned. The decrease in information gain is due to the in-                        Y awxi i
+                crease in the generalization performance. This is plausible                              i
+                because inputs for which the majority of student networks    and try to ﬁt this linear expression to the binary labels of
+                give the correct answer are less informative than those for    the teacher. If the number of couplings is sufﬁciently large
+                which a mistake is more likely. The situation changes if the    (larger than the number of examples) the linear function
+
+
+
+
+                770                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 771
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                    (unlike the sign) is perfectly able to ﬁt arbitrary continuous    the student learns all examples perfectly. Although it may
+                    output values. This linear ﬁt is an attempt to explain the    not be easy to construct a learning algorithm which per-
+                    data in a more complicated way than necessary, and the    forms such a maximization in practice, the resulting gener-
+                    couplings have to be ﬁnely tuned in order to achieve this    alization error can be calculated using the statistical phys-
+                    goal. We ﬁnd that the student trained in such a way does    ics approach (Engel and Van den Broeck, 1993). The result
+                    not generalize well (Opper and Kinzel, 1995). In order to    is in agreement with the VC theory: There is no prediction
+                    compare the classiﬁcations of teacher and student on a new    better than random guessing below the capacity.
+                    random input after training, we have ﬁnally converted the       Although the previous algorithms led to a behavior
+                    student’s output into a classiﬁcation label by taking the sign    whichisworsethanthetypicalone,wenowexaminetheop-
+                    of its output. As shown in the red curve of Fig. 10, after    positecaseofanalgorithmwhichdoesbetter.Sincethegen-
+                    an initial improvement of performance the generalization    eralization ability of a neural network is related to the fact
+                    error increases again to the random guessing value e0.5    that similar input vectors are mapped onto the same out-
+                    at a1 (Fig. 10, red curve). This phenomenon is called    put, one can assume that such a property can be enhanced
+                    overﬁtting.For a1 (i.e., for more data than parameters),    if the separating gap between the two classes is maximized,
+                    it is no longer possible to have a perfect linear ﬁt through    which deﬁnes a new cost function for an algorithm. This
+                    the data, but a ﬁt with a minimal deviation from a linear    optimal margin perceptron can be practically realized and
+                    function leads to the second part of the learning curve.ede-    when applied to a set of data leads to the projection of
+                    creases again and approaches 0 asymptotically for aSq.    Fig. 11. As a remarkable result, it can be seen that there is a
+                    This shows that when enough data are available, the details    relatively large fraction of patterns which are located at the
+                    of the training algorithm are less important.                 gap. These points are called support vectors(SVs). In order
+                       The dependence of the generalization performance on    to understand their importance for the generalization abil-
+                    the complexity of the assumed data model is well-known. If    ity, we make the following gedankenexperimentand assume
+                    function class is used that is too complex, data values can be    that all the points which lie outside the gap (the nonsupport
+                    perfectly ﬁtted but the predicted function will be very sen-    vectors) are eliminated from the training set of examples.
+                    sitive to the variations of the data sample, leading to very       From the two-dimensional projection of Fig. 11, we may
+                    unreliable predictions on novel inputs. On the other hand,    conjecture that by running the maximal margin algorithm
+                    functions that are too simple make the best ﬁt almost insen-    on the remaining examples (the SVs) we cannot create a
+                    sitive to the data, which prevents us from learning enough    larger gap between the points. Hence, the algorithm will
+                    from them.                                             converge to the same separating hyperplane as before. This
+                       It is also possible to calculate the worst-case generaliza-    intuitive picture is actually correct. If the SVs of a training
+                    tion ability of perceptron students learning from a percep-    set were known beforehand (unfortunately, they are only
+                    tron teacher. The largest generalization error is obtained    identiﬁed after running the algorithm), the margin classi-
+                    (Fig. 7) when the angle between the coupling vectors of    ﬁer would have to be trained only on the SVs. It would au-
+                    teacher and student is maximized under the constraint that    tomatically classify the rest of the training inputs correctly.
+
+
+
+
+
+                     0.50
+                    ε
+                     0.40
+
+
+                     0.30            linear student
+
+
+                     0.20
+                           margin classifier
+
+                     0.10
+
+
+                     0.000123456 α
+                    FIGURE 10 Learning curves for a linear student and for a     FIGURE 11 Learning with a margin classiﬁer and m300
+                    margin classiﬁer. am/N.                                 examples in an N150-dimensional space.
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       771        262-A1677  7/24/01  11:12 AM  Page 772
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                Hence, if in an actual classiﬁcation experiment the number    ber of consistent students is small; nevertheless, the few re-
+                of SVs is small compared to the number of non-SVs, we    maining ones must still differ in a ﬁnite fraction of bits from
+                may expect a good generalization ability.                    each other and from the teacher so that perfect generaliza-
+                   The learning curve for a margin classiﬁer (Opper and    tion is still impossible. For aslightly above a only the cou- c
+                Kinzel, 1995) learning from a perceptron teacher (calcu-     plings of the teacher survive.
+                lated by the statistical physics approach) is shown in Fig. 10
+                (blue curve). The concept of a margin classiﬁer has recently                              ................................................
+                been generalized to the so-called support vector machines                             ◗
+
+                                                                                    Learning with Errors (Vapnik, 1995), for which the inputs of a perceptron are re-
+                placed by suitable features which are cleverly chosen non-
+                linear functions of the original inputs. In this way, nonlin-    The example of the Ising perceptron teaches us that it will
+                ear separable rules can be learned, providing an interesting     not always be simple to obtain zero training error. More-
+                alternative to multilayer networks.                         over, an algorithm trying to achieve this goal may get stuck
+                                                                       in local minima. Hence, the idea of allowing errors explic-
+                                                                       itly in the learning procedure, by introducing an appropri-................................................ ◗                              ate noise, can make sense. An early analysis of such a sto-
+                             The Ising Perceptron                 chastic training procedure and its generalization ability for
+                                                                       the learning in so-called Boolean networks (with elemen-
+                The approach of statistical physics can develop a speciﬁc     tary computing units different from the ones used in neural
+                predictivepowerinsituationsinwhichonewouldliketoun-    networks) can be found in Carnevali and Patarnello (1987).
+                derstand novel network models or architectures for which    A stochastic algorithm can be useful to escape local min-
+                currently no efﬁcient learning algorithm is known. As the    ima of the training error, enabling a better learning of the
+                simplest example, we consider a perceptron for which the     training set. Surprisingly, such a method can also lead to
+                couplings w are constrained to binary values 1 and 1    bettergeneralizationabilitiesiftheclassiﬁcationruleisalso j
+                (Gardner and Derrida, 1989; Györgyi, 1990; Seung et al.,    corrupted by some degree of noise (Györgyi and Tishby,
+                1992b). For this so-called Ising perceptron(named after    1990). A stochastic training algorithm can be realized by
+                Ernst Ising, who studied coupled binary-valued elements as    the Monte Carlo metropolis method, which was invented
+                a model for a ferromagnet), perfect learning of examples is    to generate the effects of temperature in simulations of
+                equivalent to a difﬁcult combinatorial optimization prob-    physical systems. Any changes of the network couplings
+                lem (integer linear programming), which in the worst case    which lead to a decrease of the training error during learn-
+                is believed to require a learning time that increases expo-     ing are allowed. However, with some probability that in-
+                nentially with the number of couplings N.                   creases with the temperature, an increase of the training
+                   To obtain the learning curve for the typical student, we    error is also accepted. Although in principle this algorithm
+                can proceed as before, replacing V(e) by the number of    may visit all the network’s conﬁgurations, for a large sys-
+                student conﬁgurations that are consistent with the teacher    tem, with an overwhelming probability, only states close to
+                which results in changing the entropic term appropriately.    some ﬁxed training error will actually appear. The method
+                When the examples are provided by a teacher network of     of statistical physics applied to this situation shows that for
+                thesamebinarytype,onecanexpectthatthegeneralization     sufﬁciently large temperatures (T) we often obtain a quali-
+                error will decrease monotonically to zero as a function of a.    tatively correct picture if we repeat the approximate calcu-
+                The learning curve is shown as the blue curve in Fig. 9. For    lation for the noise-free case and replace the relative num-
+                sufﬁciently small a, the discreteness of the couplings has al-    ber of examples aby the effective number a/T.Hence, the
+                most no effect. However, in contrast to the continuous case,    learning curves become essentially stretched and good gen-
+                perfect generalization does not require inﬁnitely many ex-    eralization ability is still possible at the price of an increase
+                amples but is achieved already at a ﬁnite number a 1.24.     in necessary training examples. c
+                This is not surprising because the teacher’s couplings con-       Within the stochastic framework, learning (with errors)
+                tain only a ﬁnite amount of information (one bit per cou-    can now also be realized for the Ising perceptron, and it is
+                pling) and one would expect that it does not take much     interesting to study the number of relevant student conﬁgu-
+                more than aboutNexamples to learn them. The remark-     rations as a function of ein more detail (Fig. 12). The green
+                ableandunexpectedresultoftheanalysisisthefactthatthe     curve is obtained for a small value ofawhere a strong maxi-
+                transition to perfect generalization is discontinuous. The     mum with high generalization error exists. By increasing a,
+                generalization error decreases immediately from a non-     this maximum decreases until it is the same as the second
+                zero value to zero. This gives an impression about the com-     maximum at e0.5, indicating a transition like that of the
+                plex structure of the space of all consistent students and     blue learning curve in Fig. 9. For larger a, the state of per-
+                also gives a hint as to why perfect learning in the Ising per-     fect generalization should be the typical state. Neverthe-
+                ceptron is a difﬁcult task. For aslightly below a, the num-     less, if the stochastic algorithm starts with an initial state c
+
+
+
+                772                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 773
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+
+                                                                 α        lar model. Here, each hidden unit is connected to a dif- 1       ferent set of the input nodes. A further simpliﬁcation is the
+
+
+
+
+
+
+
+
+
+                      log (number of students)                                           α        replacement of adaptive couplings from the hidden units to 2
+                                                                           the output node by a prewired ﬁxed function which maps
+                                                                           the states of the hidden units to the output. α3          Two such functions have been studied in great detail.
+                                                                           For the ﬁrst one, the output gives just the majority vote of
+                                                                 α        the hidden units—that is, if the majority of the hidden units 4
+                                          α                               is negative, then the total output is negative, and vice versa. 4  >α3  >α2  >α1                     This network is called a committee machine.For the second
+                      0 0.1 0.2 0.3 0.4 0.5      type of network, the parity machine,the output is the par- ε          ity of the hidden outputs—that is, a minus results from an
+                    FIGURE 12 Logarithm of the number of relevant Ising stu-     odd number of negative hidden units and a plus from an
+                    dents for different values of a.                              even number. For both types of networks, the capacity has
+                                                                           been calculated in the thermodynamic limit of a large num-
+                                                                           ber Nof (ﬁrst layer) couplings (Barkai et al.,1990; Monas-
+                    which has no resemblance to the (unknown) teacher (i.e.,    son and Zecchina, 1995). By increasing the number of hid-
+                    with e0.5), it will spend time that increases exponentially    den units (but always keeping it much smaller than N),
+                    with Nin the smaller local maximum, the metastable state.    the capacity per coupling (and the VC dimension) can be
+                    Hence, a sudden transition to perfect generalization will be    made arbitrarily large. Hence, the VC theory predicts that
+                    observable only in examples which correspond to the blue    the ability to generalize begins at a size of the training set
+                    curve of Fig. 12, where this metastable state disappears.    which increases with the capacity. The learning curves of
+                    For large vales of a(yellow curve), the stochastic algorithm    the typical parity machine (Fig. 14) being trained by a par-
+                    will converge always to the state of perfect generalization.    ity teacher for (from left to right) one, two, four, and six
+                    On the other hand, since the state with e0.5 is always    hidden units seem to partially support this prediction.
+                    metastable, a stochastic algorithm which starts with the       Belowacertainnumberofexamples,onlymemorization
+                    teacher’s couplings will never drive the student out of the    ofthelearnedpatternsoccursandnotgeneralization.Then,
+                    state of perfect generalization. It should be made clear that    a transition to nontrivial generalization takes place (Han-
+                    the sharp phase transitions are the result of the thermody-    sel et al.,1992; Opper, 1994). Far beyond the transition, the
+                    namic limit, where the macroscopic state is entirely domi-    decay of the learning curves becomes that of a simple per-
+                    nated by the typical conﬁgurations. For simulations of any    ceptron (black curve in Fig. 14) independent of the num-
+                    ﬁnite system a rounding and softening of the transitions    ber of hidden units, and this occurs much faster than for
+                    will be observed.                                        the bound given by VC theory. This shows that the typical
+                    ................................................                              learning curve can in fact be determined by more than one ◗
+
+                         More Sophisticated Computations 
+                        Are Needed for Multilayer Networks         0.5
+                                                                           ε
+                    As a ﬁrst step to understand the generalization perfor-      0.4 mance of multilayer networks, one can study an archi-                               46
+                    tecture which is simpler than the fully connected one of
+                    Fig. 1b. The tree architecture of Fig. 13 has become a popu-      0.3                  2
+
+                                                                                            10.2
+
+
+                                                                            0.1
+
+
+                                                                            0.00.0 0.1     0.2 0.3 0.4 0.5 0.6 α
+
+                                                                           FIGURE 14 Learning curves for the parity machine with
+                    FIGURE 13 A two-layer network with tree architecture.    tree architecture. Each curve represents the generalization er-
+                    The arrow indicates the direction of propagation of the    ror eas a function of aand is distinguished by the number of
+                    information.                                            hidden units of the network.
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       773        262-A1677  7/24/01  11:12 AM  Page 774
+
+
+
+
+
+
+
+                MANFRED OPPER
+
+
+                complexity parameter. In contrast, the learning curve of     the same similarity to every teacher perceptron. Although
+                the committee machine with the tree architecture of Fig. 13    this symmetric state allows for some degree of generaliza-
+                (Schwarze and Hertz, 1992) is smooth and resembles that     tion, it is not able to recover the teacher’s rule completely.
+                of the simple perceptron. As the number of hidden units    After a long plateau, the symmetry is broken and each of
+                is increased (keeping Nﬁxed and very large), the general-    the student perceptrons specializes to one of the teacher
+                ization error increases, but despite the diverging VC di-    perceptrons, and thus their similarity with the others is
+                mension the curves converge to a limiting one having an    lost. This leads to a rapid (but continuous) decrease in the
+                asymptotic decay which is only twice as slow as that of the    generalization error. Such types of learning curves with
+                perceptron. This is an example for which typical and worst-    plateaus can actually be observed in applications of fully
+                case generalization behaviors are entirely different.           connected multilayer networks.
+                   Recently, more light has been shed on the relation be-                              ................................................tween average and worst-case scenarios of the tree com-                             ◗
+
+                mittee. A reduced worst-case scenario, in which a tree                         Outlook
+                committee teacher was to be learned from tree committee
+                students under an input distribution, has been analyzed     The worst-case approach of the VC theory and the typical
+                from a statistical physics perspective (Urbanczik, 1996). As     case approach of statistical physics are important theories
+                expected, few students show a much worse generalization     for modeling and understanding the complexity of learning
+                ability than the typical one. Moreover, such students may     to generalize from examples. Although the VC approach
+                also be difﬁcult to ﬁnd by most reasonable learning algo-     plays an important role in a general theory of learnabil-
+                rithms because bad students require very ﬁne tuning of    ity, its practical applications for neural networks have been
+                their couplings. Calculation of the couplings with ﬁnite pre-    limited by the overall generality of the approach. Since only
+                cision requires many bits per coupling that increases faster    weak assumptions about probability distributions and ma-
+                than exponentially with aand which for sufﬁciently large a    chines are considered by the theory, the estimates for gen-
+                willbebeyondthecapabilityofpracticalalgorithms.Hence,    eralization errors have often been too pessimistic. Recent
+                it is expected that, in practice, a bad behavior will not be     developments of the theory seem to overcome these prob-
+                observed.                                              lems. By using modiﬁed VC dimensions, which depend on
+                   Transitions of the generalization error such as those     the data that have actually occurred and which in favorable
+                observed for the tree parity machine are a characteristic     cases are much smaller than the general dimensions, more
+                feature of large systems which have a symmetry that can     realistic results seem to be possible. For the support vec-
+                be spontaneously broken. To explain this, consider the sim-    tor machines (Vapnik, 1995) (generalizations of the margin
+                plest case of two hidden units. The output of this parity ma-    classiﬁers which allow for nonlinear boundaries that sepa-
+                chine does not change if we simultaneously change the sign    rate the two classes), Vapnik and collaborators have shown
+                of all the couplings for both hidden units. Hence, if the    the effectiveness of the modiﬁed VC results for selecting
+                teacher’s couplings are all equal to 1, a student with all    the optimal type of model in practical applications.
+                couplings equal to 1 acts exactly as the same classiﬁer. If       The statistical physics approach, on the other hand, has
+                there are few examples in the training set, the entropic con-    revealed new and unexpected behavior of simple network
+                tribution will dominate the typical behavior and the typi-    models,suchasavarietyofphasetransitions.Whethersuch
+                cal students will display the same symmetry. Their coupling    transitions play a cognitive role in animal or human brains
+                vectors will consist of positive and negative random num-    is an exciting topic. Recent developments of the theory
+                bers. Hence, there is no preference for the teacher or the     aim to understand dynamical problems of learning. For ex-
+                reversed one and generalization is not possible. If the num-    ample, online learning (Saad, 1998), in which the problems
+                ber of examples is large enough, the symmetry is broken     of learning and generalization are strongly mixed, has en-
+                and there are two possible types of typical students, one    abled the study of complex multilayer networks and has
+                with more positive and the other one with more negative     stimulated research on the development of optimized algo-
+                couplings. Hence, any of the typical students will show     rithms. In addition to an extension of the approach to more
+                some similarity with the teacher (or it’s negative image) and    complicated networks, an understanding of the robustness
+                generalization occurs. A similar type of symmetry break-     of the typical behavior, and an interpolation to the other
+                ing also leads to a continuous phase transition in the fully     extreme, the worst-case scenario is an important subject of
+                connected committee machine. This can be viewed as a     research.
+                committee of perceptrons, one for each hidden unit, which
+                share the same input nodes. Any permutation of these per-                    Acknowledgments
+                ceptrons obviously leaves the output invariant. Again, if    I thank members of the Department of Physics of Complex Sys-
+                few examples are learned, the typical state reﬂects the sym-    tems at the Weizmann Institute in Rehovot, Israel, where parts of
+                metry. Each student perceptron will show approximately     this article were written, for their warm hospitality.
+
+
+
+
+
+                774                                                                           VOLUME III / INTELLIGENT SYSTEMS        262-A1677  7/24/01  11:12 AM  Page 775
+
+
+
+
+
+
+
+                                                                                                        LEARNING TO GENERALIZE
+
+
+                                    References Cited                    OPPER , M., and H AUSSLER , M. (1991). Generalization perfor-
+                                                                              mance of Bayes optimal classiﬁcation algorithm for learning a
+                    AMARI , S., and M URATA , N. (1993). Statistical theory of learning       perceptron. Phys. Rev. Lett.66,2677.
+                       curves under entropic loss. Neural Comput.5,140.             OPPER , M., and K INZEL , W. (1995). Statistical mechanics of gen-
+                    BARKAI , E., H ANSEL , D., and K ANTER , I. (1990). Statistical me-       eralization. In Physics of Neural Networks III(J. L. van Hem-
+                       chanics of a multilayered neural network. Phys. Rev. Lett.65,       men, E. Domany, and K. Schulten, Eds.). Springer-Verlag,
+                       2312.                                                   New York.
+                    BISHOP , C. M. (1995). Neural Networks for Pattern Recognition.    SAAD , D. (Ed.) (1998). Online Learning in Neural Networks.
+                       Clarendon/Oxford Univ. Press, Oxford/New York.                Cambridge Univ. Press, New York.
+                    CARNEVALI , P., and P ATARNELLO , S. (1987). Exhaustive thermo-    SCHWARZE , H., and H ERTZ , J. (1992). Generalization in a large
+                       dynamical analysis of Boolean learning networks. Europhys.       committee machine. Europhys. Lett.20,375.
+                       Lett.4,1199.                                          SCHWARZE , H., and H ERTZ , J. (1993). Generalization in fully con-
+                    COVER , T. M. (1965). Geometrical and statistical properties of       nected committee machines. Europhys. Lett.21,785.
+                       systems of linear inequalities with applications in pattern rec-    SEUNG , H. S., S OMPOLINSKY , H., and T ISHBY , N. (1992a). Statis-
+                       ognition. IEEE Trans. El. Comp.14,326.                       tical mechanics of learning from examples. Phys. Rev. A45,
+                    ENGEL , A., and V AN DEN BROECK , C. (1993). Systems that can       6056.
+                       learn from examples: Replica calculation of uniform conver-     SEUNG , H. S., O PPER , M., and S OMPOLINSKY , H. (1992b). Query
+                       gence bound for the perceptron. Phys. Rev. Lett.71,1772.          by committee. InThe Proceedings of the Vth Annual Workshop
+                    GARDNER ,E.(1988).Thespaceofinteractionsinneuralnetworks.       on Computational Learning Theory (COLT92),p. 287. Associ-
+                       J. Phys. A21,257.                                         ation for Computing Machinery, New York.
+                    GARDNER , E., and D ERRIDA , B. (1989). Optimal storage proper-     SOMPOLINSKY , H., T ISHBY , N., and S EUNG , H. S. (1990). Learning
+                       ties of neural network models. J. Phys. A21,271.                 from examples in large neural networks. Phys. Rev. Lett.65,
+                    GYÖRGYI , G. (1990). First order transition to perfect generaliza-       1683.
+                       tion in a neural network with binary synapses. Phys. Rev. A41,    URBANCZIK , R. (1996). Learning in a large committee machine:
+                       7097.                                                   Worst case and average case. Europhys. Lett.35,553.
+                    GYÖRGYI , G., and T ISHBY , N. (1990). Statistical theory of learn-     VALLET , F., C AILTON , J., and R EFREGIER , P. (1989). Linear and
+                       ing a rule. In Neural Networks and Spin Glasses: Proceedings       nonlinear extension of the pseudo-inverse solution for learn-
+                       of the STATPHYS 17 Workshop on Neural Networks and Spin       ing Boolean functions. Europhys. Lett.9,315.
+                       Glasses(W. K. Theumann and R. Koberle, Eds.). World Scien-     VAPNIK , V. N. (1982). Estimation of Dependencies Based on Em-
+                       tiﬁc, Singapore.                                           pirical Data.Springer-Verlag, New York.
+                    HANSEL , D., M ATO , G., and M EUNIER , C. (1992). Memorization     VAPNIK , V. N. (1995). The Nature of Statistical Learning Theory.
+                       without generalization in a multilayered neural network. Eu-       Springer-Verlag, New York.
+                       rophys. Lett.20,471.                                    VAPNIK , V. N., and C HERVONENKIS , A. (1971). On the uniform
+                    KINZEL , W., and R UJÀN , P. (1990). Improving a network general-       convergence of relative frequencies of events to their probabil-
+                       ization ability by selecting examples. Europhys. Lett.13,473.       ities. Theory Probability Appl.16,254.
+                    LEVIN ,E.,T ISHBY ,N.,andS OLLA ,S.(1989).Astatisticalapproach
+                       to learning and generalization in neural networks. In Proceed-                   General References ings of the Second Workshop on Computational Learning The-
+                       ory(R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan     ARBIB , M. A. (Ed.) (1995). The Handbook of Brain Theory and
+                       Kaufmann, San Mateo, CA.                                 Neural Networks.MIT Press, Cambridge, MA.
+                    MÉZARD , M., P ARISI , G., and V IRASORO , M. A. (1987). Spin glass    BERGER , J. O. (1985). Statistical Decision Theory and Bayesian
+                       theory and beyond. In Lecture Notes in Physics,Vol. 9. World       Analysis.Springer-Verlag, New York.
+                       Scientiﬁc, Singapore.                                    HERTZ ,J.A.,K ROGH ,A.,andP ALMER , R. G. (1991).Introduction
+                    MONASSON , R., and Z ECCHINA , R. (1995). Weight space structure       to the Theory of Neural Computation.Addison-Wesley, Red-
+                       andinternalrepresentations:Adirectapproachtolearningand       wood City, CA.
+                       generalization in multilayer neural networks. Phys. Rev. Lett.    MINSKY , M., and P APERT , S. (1969). Perceptrons.MIT Press,
+                       75,2432.                                                Cambridge, MA.
+                    OPPER , M. (1994). Learning and generalization in a two-layer     WATKIN , T. L. H., R AU , A., and B IEHL , M. (1993). The statistical
+                       neural network: The role of the Vapnik–Chervonenkis dimen-       mechanics of learning a rule. Rev. Modern Phys.65,499.
+                       sion. Phys. Rev. Lett.72,2113.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                    PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS                                                       775        262-A1677  7/24/01  11:12 AM  Page 776
--- a/Corpus/MIXED
+++ b/Corpus/MIXED
--- a/Corpus/MOGRIFIER
+++ b/Corpus/MOGRIFIER
--- a/Corpus/Model
+++ b/Corpus/Model
--- a/Fine-Tuning.txt
+++ b/Fine-Tuning.txt
@ -0,0 +1,662 @@
+                                      Movement Pruning:
+                              Adaptive Sparsity by Fine-Tuning
+
+
+
+
+                                Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2
+                                       1 Hugging Face, 2 Cornell University
+                             {victor,thomas}@huggingface.co;arush@cornell.edu
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+     arXiv:2005.07683v1  [cs.CL]  15 May 2020                                         Abstract
+
+                       Magnitude pruning is a widely used strategy for reducing model size in pure
+                       supervised learning; however, it is less effective in the transfer learning regime that
+                       has become standard for state-of-the-art natural language processing applications.
+                       We propose the use ofmovement pruning, a simple, deterministic ﬁrst-order weight
+                       pruning method that is more adaptive to pretrained model ﬁne-tuning. We give
+                       mathematical foundations to the method and compare it to existing zeroth- and
+                       ﬁrst-order pruning methods. Experiments show that when pruning large pretrained
+                       language models, movement pruning shows signiﬁcant improvements in high-
+                       sparsity regimes. When combined with distillation, the approach achieves minimal
+                       accuracy loss with down to only 3% of the model parameters.
+
+
+                 1 Introduction
+
+                 Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art
+                 performance in applications in natural language processing and related ﬁelds. In this setup, a large
+                 model pretrained on a massive generic dataset is then ﬁne-tuned on a smaller annotated dataset to
+                 perform a speciﬁc end-task. Model accuracy has been shown to scale with the pretrained model and
+                 dataset size [Raffel et al., 2019]. However, signiﬁcant resources are required to ship and deploy these
+                 large models, and training the models have high environmental costs [Strubell et al., 2019].
+                 Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at
+                 only a small cost of accuracy. Pruning methods, which remove weights based on their importance,
+                 are a particularly simple and effective method for compressing models to be sent to edge devices such
+                 as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high
+                 absolute values, is the most widely used method for weight pruning. It has been applied to a large
+                 variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al.,
+                 2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis
+                 [Frankle et al., 2019].
+                 While magnitude pruning is highly effective for standard supervised learning, it is inherently less
+                 useful in the transfer learning regime. In supervised learning, weight values are primarily determined
+                 by the end-task training data. In transfer learning, weight values are mostly predetermined by the
+                 original model and are only ﬁne-tuned on the end task. This prevents these methods from learning to
+                 prune based on the ﬁne-tuning step, or “ﬁne-pruning.”
+                 In this work, we argue that to effectively reduce the size of models for transfer learning, one should
+                 instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during
+                 ﬁne-tuning. Movement pruning differs from magnitude pruning in that both weights with low and
+                 high values can be pruned if they shrink during training. This strategy moves the selection criteria
+                 from the 0th to the 1st-order and facilitates greater pruning based on the ﬁne-tuning objective. To
+
+
+                 Preprint. Under review.                 test this approach, we introduce a particularly simple, deterministic version of movement pruning
+                 utilizing the straight-through estimator [Bengio et al., 2013].
+                 We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019,
+                 Vaswani et al., 2017] on a diverse set of ﬁne-tuning tasks. In highly sparse regimes (less than 15% of
+                 remaining weights), we observe signiﬁcant improvements over magnitude pruning and other 1st-order
+                 methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original
+                 BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI)
+                 [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of
+                 the differences between magnitude pruning and movement pruning shows that the two methods lead
+                 to radically different pruned models with movement pruning showing greater ability to adapt to the
+                 end-task.
+
+                 2 Related Work
+
+                 In addition to magnitude pruning, there are many other approaches for generic model weight pruning.
+                 Most similar to our approach are methods for using parallel score matrices to augment the weight
+                 matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo-
+                 lutional networks. Differing from our methods, these methods keep the weights of the model ﬁxed
+                 (either from a randomly initialized network or a pre-trained network) and the scores are updated to
+                 ﬁnd a good sparse subnetwork.
+                 Many previous works have also explored using higher-order information to select prunable weights.
+                 LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for
+                 deletion. Our method does not require the (possibly costly) computation of second-order derivatives
+                 since the importance scores are obtained simply as the by-product of the standard ﬁne-tuning. Theis
+                 et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In
+                 contrast, we found it useful to preserve the direction of movement in our algorithm.
+                 Compressing pretrained language models for transfer learning is also a popular area of study. Other
+                 approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning
+                 [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model
+                 and targets individual weight. We also show that having a teacher can further improve our approach.
+                 Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train
+                 sparse language models from scratch. This differs from our approach which focuses on the ﬁne-tuning
+                 stage. Finally, another popular compression approach is quantization. Quantization has been applied
+                 to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014]
+                 providing high memory compression rates at the cost of no or little performance. As shown in
+                 previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and
+                 can be combined to further improve the performance/size ratio.
+
+                 3 Background: Score-Based Pruning
+
+                 We ﬁrst establish shared notation for discussing different neural network pruning strategies. Let
+                 W2Rnn refer to a generic weight matrix in the model (we consider square matrices, but they
+                 could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of
+                 associated importance scoresS2Rnn . Given importance scores, each pruning strategy computes a
+                 maskM2 f0;1gnn . Inference for an inputxbecomesa= (WM)x, whereis the Hadamard
+                 product. A common strategy is to keep the top-vpercent of weights by importance. We deﬁne Top v as a function which selects thev%highest values inS:1; STop(S)                                     (1) v  i;j =     i;j in topv%
+                                                  0; o.w.
+
+                 Magnitude-based weight pruning determines the mask based on the absolute value of each weight    as a measure of importance. Formally, we have importance scoresS= jWi;j j     , and masks 1i;jn M=Top(S)(Eq(1)). There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude pruning: the model is ﬁrst trained until convergence and weights with the lowest
+                 magnitudes are removed afterward. The sparsiﬁed model is then re-trained with the removed weights
+                 ﬁxed to 0. This loop is repeated until the desired sparsity level is reached.
+
+                                                  2                              Magnitude pruning    L0 regularization Movement pruning Soft movement pruning
+                  Pruning Decision 0th order 1st order 1st order 1st order
+                  Masking Function Top v     Continuous Hard-Concrete Top v        Thresholding
+                  Pruning Structure Local or Global Global Local or Global Global
+                  Learning Objective      L L+l0 E(L0 )          L L+mvp R(S)
+                  Gradient Form Gumbel-Softmax Straight-Through Straight-ThroughP               P           PScoresS          jW              )                               )i;j j   (@L )(t W(t) f(S(t) )   (@L )(t) W(t)    (@L )(t) W(t
+                                            t@W    i;j  i;j                         i;j i;j            t@W    i;j i;j         t@W i;j
+                 Table 1: Summary of the pruning methods considered in this work and their speciﬁcities. The
+                 expression offofL0 regularization is detailed in Eq (3).
+
+
+                 In this study, we focus onautomated gradual pruning[Zhu and Gupta, 2018]. It supplements
+                 magnitude pruning by allowing masked weights to be updated such that they are not ﬁxed for the
+                 entire duration of the training. Automated gradual pruning enables the model to recover from previous
+                 masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity levelv     during training using a cubic sparsity scheduler:v(t) =vf + (v          t 3
+                                                             i vf ) 1ti . The sparsity nt level at time stept,v(t) is increased from an initial valuevi (usually 0) to a ﬁnal valuevf innpruning
+                 steps afterti steps of warm-up. The model is thus pruned and trained jointly.
+
+                 4 Movement Pruning
+
+                 Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running
+                 model. In this work, we focus on movement pruning methods where importance is derived from
+                 ﬁrst-order information. Intuitively, instead of selecting weights that are far from zero, we retain
+                 connections that are moving away from zero during the training process. We consider two versions of
+                 movement pruning: hard and soft.
+                 For (hard) movement pruning, masks are computed using the Top v function:M=Top(S). Unlike v magnitude pruning, during training, we learn both the weightsP      Wand their importance scoresS.
+                 During the forward pass, we compute for alli,a    ni =    Wk=1 i;k Mi;k xk .
+                 Since the gradient of Top v is 0 everywhere it is deﬁned, we follow Ramanujan et al. [2020], Mallya
+                 and Lazebnik [2018] and approximate its value with thestraight-through estimator[Bengio et al.,
+                 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The
+                 approximation of gradient of the lossLwith respect toSi;j is given by
+                                        @L   @L @a=     i   @L=   W x@S                   j                   (2)
+                                         i;j  @a i @S i;j  @a  i;ji
+                 This implies that the scores of weights are updated, even if these weights are masked in the forward
+                 pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge.
+                 We also consider a relaxed (soft) version of movement pruning based on the binary mask function
+                 described by Mallya and Lazebnik [2018]. Here we replace hyperparametervwith a ﬁxed global
+                 threshold valuethat controls the binary mask. The mask is calculated asP M= (S> ). In order to
+                 control the sparsity level, we add a regularization termR(S) =mvp   (Si;j   i;j )which encourages
+                 the importance scores to decrease over time 1 . The coefﬁcientmvp controls the penalty intensity and
+                 thus the sparsity level.
+                 Finally we note that these approaches yield a similar updateL0 regularization based pruning, another
+                 movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the
+                 hard-concretedistribution, where the maskMis sampled for alli;jwith hyperparametersb >0,
+                 l <0, andr >1:                                             u U(0;1)            Si;j =(log(u)log(1u) +Si;j )=b
+                         Zi;j = (rl)Si;j +l M i;j = min(1;ReLU(Zi;j ))
+
+                 The expectedP      L0 norm has a closed form involving the parameters of the hard-concrete:                                      E(L0 ) =
+                      logSi;j     i;j blog(l=r). Thus, the weights and scores of the model can be optimized in
+                                     P1 We also experimented with   jSi;j i;j jbut it turned out to be harder to tune while giving similar results.
+
+                                                  3                              (a) Magnitude pruning             (b) Movement pruning
+                 Figure 1: During ﬁne-tuning (on MNLI), the weights stay close to their pre-trained values which
+                 limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are
+                 plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning
+                 selects weights that are moving away from 0.
+
+
+                 an end-to-end fashion to minimize the sum of the training lossLand the expectedL0 penalty. A
+                 coefﬁcientl0 controls theL0 penalty and indirectly the sparsity level. Gradients take a similar form:
+                         @L   @L                      rl=   W@S        i;j xj f(Si;j )wheref(Si;j ) =    S          Zi;j 1g     (3)
+                           i;j  @a i                       b  i;j (1Si;j )1f0
+                                                                               
+                 At test time, a non-stochastic estimation of the mask is used:M^= min 1;ReLU (rl)(S)+l
+                  and weights multiplied by 0 can simply be discarded.
+                 Table 1 highlights the characteristics of each pruning method. The main differences are in the masking
+                 functions, pruning structure, and the ﬁnal gradient form.
+
+                 Method Interpretation In movement pruning, the gradient ofLwith respect toWi;j is given
+                 by the standard gradient derivation: @L = @L M@W i;j   @a  i;j xj . By combining it to Eq(2), we have i @L = @L W@S i;j  @W   i;j (we omit the binary mask termMi;j for simplicity). From the gradient update in i;j
+                 Eq (2),S               @Li;j is increasing when    <0, which happens in two cases: @S i;j
+                      (a) @L <0andW@W         i;j >0i;j
+                      (b) @L >0andW@W         i;j <0i;j
+                 It means that during trainingWi;j is increasing while being positive or is decreasing while being
+                 negative. It is equivalent to saying thatSi;j is increasing whenWi;j is moving away from 0. Inversely,
+                 Si;j is decreasing when @L >0which means thatW@S                  i;j is shrinking towards 0. i;j
+                 While magnitude pruning selects the most important weights as the ones which maximize their
+                 distance to 0 (jWi;j j), movement pruning selects the weights which are moving the most away from
+                 0 (Si;j ). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement
+                 pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from
+                 equation (2), afterTgradient updates, we have
+                                                XS(T)          @L=        )(t) W(t)                   (4) i;j     S  (@W     i;j i;jt<T
+
+                 Figure 1 shows this difference empirically by comparing weight values during ﬁne-tuning against
+                 their pre-trained value. As observed by Gordon et al. [2020], ﬁne-tuned weights stay close in absolute
+                 value to their initial pre-trained values. For magnitude pruning, this stability around the pre-trained
+
+                                                  4                 values implies that we know with high conﬁdence before even ﬁne-tuning which weights will be
+                 pruned as the weights with the smallest absolute value at pre-training will likely stay small and be
+                 pruned. In contrast, in movement pruning, the pre-trained weights do not have such an awareness of
+                 the pruning decision since the selection is made during ﬁne-tuning (moving away from 0), and both
+                 low and high values can be pruned. We posit that this is critical for the success of the approach as it
+                 is able to prune based on the task-speciﬁc data, not only the pre-trained value.
+
+                 5 Experimental Setup
+
+                 Transfer learning for NLP uses large pre-trained language models that are ﬁne-tuned on target tasks
+                 [Ruder et al., 2019, Devlin et al., 2019, Radford et al., 2019, Liu et al., 2019]. We experiment with task-
+                 speciﬁc pruning ofBERT-base-uncased, a pre-trained model that contains roughly 84M parameters.
+                 We freeze the embedding modules and ﬁne-tune the transformer layers and the task-speciﬁc head.
+                 We perform experiments on three monolingual (English) tasks, which are common benchmarks for
+                 the recent progress in transfer learning for NLP: question answering (SQuAD v1.1) [Rajpurkar et al.,
+                 2016], natural language inference (MNLI) [Williams et al., 2018], and sentence similarity (QQP)
+                 [Iyer et al., 2017]. The datasets respectively contain 8K, 393K, and 364K training examples. SQuAD
+                 is formulated as a span extraction task, MNLI and QQP are paired sentence classiﬁcation tasks.
+                 For a given task, we ﬁne-tune the pre-trained model for the same number of updates (between 6
+                 and 10 epochs) across pruning methods 2 . We follow Zhu and Gupta [2018] and use a cubic sparsity
+                 scheduling for Magnitude Pruning (MaP), Movement Pruning (MvP), and Soft Movement Pruning
+                 (SMvP). Adding a few steps of cool-down at the end of pruning empirically improves the performance
+                 especially in high sparsity regimes. The schedule forvis:
+                                 8<vi                    0t < t i
+                                   v     v          )3 t                        (5):f + (vi  f )(1tti tf
+                                                  nt     i t < Ttf
+                                   vf                    o.w.
+                 wheretf is the number of cool-down steps.
+                 We compare our results against several state-of-the-art pruning baselines: Reweighted Proximal
+                 Pruning (RPP) [Guo et al., 2019] combines re-weightedL1 minimization and Proximal Projection
+                 [Parikh and Boyd, 2014] to perform unstructured pruning. LayerDrop [Fan et al., 2020a] leverages
+                 structured dropout to prune models at test time. For RPP and LayerDrop, we report results from
+                 authors. We also compare our method against the mini-BERT models, a collection of smaller BERT
+                 models with varying hyper-parameters [Turc et al., 2019].
+
+                 6 Results
+
+                 Figure 2 displays the results for the main pruning methods at different levels of pruning on each
+                 dataset. First, we observe the consistency of the comparison between magnitude and movement
+                 pruning: at low sparsity (more than 70% of remaining weights), magnitude pruning outperforms
+                 all methods with little or no loss with respect to the dense model whereas the performance of
+                 movement pruning methods quickly decreases even for low sparsity levels. However, magnitude
+                 pruning performs poorly with high sparsity, and the performance drops extremely quickly. In contrast,
+                 ﬁrst-order methods show strong performances with less than 15% of remaining weights.
+                 Table 2 shows the speciﬁc model scores for different methods at high sparsity levels. Magnitude
+                 pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 withL0 regular-
+                 ization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning. These experiments
+                 indicate that in high sparsity regimes, importance scores derived from the movement accumulated
+                 during ﬁne-tuning induce signiﬁcantly better pruned models compared to absolute values.
+                 Next, we compare the difference in performance between ﬁrst-order methods. We see that straight-
+                 through based hard movement pruning (MvP) is comparable withL0 regularization (with a signiﬁcant
+                 gap in favor of movement pruning on QQP). Soft movement pruning (SMvP) consistently outperforms
+                    2 Preliminary experiments showed that increasing the number of pruning steps tended to improve the end
+                 performance
+
+                                                  5                 Figure 2: Comparisons between different pruning methods in high sparsity regimes.Soft move-
+                 ment pruning consistently outperforms other methods in high sparsity regimes.We plot the
+                 performance of the standard ﬁne-tuned BERT along with 95% of its performance.
+
+
+
+
+
+
+
+
+
+
+
+
+
+                 Table 2: Performance at high sparsity levels. (Soft) movement pruning outperforms current
+                  state-of-the art pruning methods at different high sparsity levels.
+
+                                BERT base  Remaining
+                                 ﬁne-tuned  Weights (%)   MaP   L0 Regu MvP soft MvP
+                      SQuAD - Dev             10%    67.7/78.5 69.9/80.1 71.9/81.7 71.3/81.580.4/88.1EM/F1               3%    40.1/54.5 61.6/73.6 65.2/76.3 69.6/79.9
+                      MNLI - Dev              10%    77.8/79.0 77.9/78.5 79.3/79.5 80.7/81.2acc/MM acc   84.5/84.9     3%    68.9/69.8 75.2/75.6 76.1/76.7 79.0/79.7
+                      QQP - Dev              10%    78.8/75.1 87.6/81.9 89.1/85.5 90.2/86.891.4/88.4acc/F1                3%    72.1/58.4 86.5/81.1 85.6/81.0 89.2/85.5
+
+
+                 hard movement pruning andL0 regularization by a strong margin and yields the strongest performance
+                 among all pruning methods in high sparsity regimes. These comparisons support the fact that even if
+                 movement pruning (and its relaxed version soft movement pruning) is simpler thanL0 regularization,
+                 it yet yields stronger performances for the same compute budget.
+                 Finally, movement pruning and soft movement pruning compare favorably to the other baselines, ex-
+                 cept for QQP where RPP is on par with soft movement pruning. Movement pruning also outperforms
+                 the ﬁne-tuned mini-BERT models. This is coherent with [Li et al., 2020]: it is both more efﬁcient and
+                 more effective to train a large model and compress it afterward than training a smaller model from
+                 scratch. We do note though that current hardware does not support optimized inference for sparse
+                 models: from an inference speed perspective, it might often desirable to use a small dense model
+                 such as mini-BERT over a sparse alternative of the same size.
+
+                 Distillation further boosts performance Following previous work, we can further leverage knowl-
+                 edge distillation [Bucila et al., 2006, Hinton et al., 2014] to boost performance for free in the pruned
+                 domain [Jiao et al., 2019, Sanh et al., 2019] using our baseline ﬁne-tunedBERT-basemodel as
+                 teacher. The training objective is a linear combination of the training loss and a knowledge distillation
+
+
+                 Figure 3: Comparisons between different pruning methods augmented with distillation.Distillation
+                 improves the performance across all pruning methods and sparsity levels.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                  6                 Table 3: Distillation-augmented performances for selected high sparsity levels.All pruning methods
+                 beneﬁt from distillation signal further enhancing the ratio Performance VS Model Size.
+
+                                BERT base  Remaining
+                                 ﬁne-tuned  Weights (%)   MaP   L0 Regu MvP soft MvP
+                      SQuAD - Dev             10%    70.2/80.1 72.4/81.9 75.6/84.3 76.6/84.980.4/88.1EM/F1               3%    45.5/59.6 65.5/75.9 67.5/78.0 72.9/82.4
+                      MNLI - Dev              10%    78.3/79.3 78.7/79.8 80.1/80.4 81.2/81.8acc/MM acc   84.5/84.9     3%    69.4/70.6 76.2/76.5 76.5/77.4 79.6/80.2
+                      QQP - Dev              10%    79.8/65.0 88.1/82.8 89.7/86.2 90.5/87.191.4/88.4acc/F1                3%    72.4/57.8 87.1/82.0 86.1/81.5 89.3/85.6
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                            (a) Distribution of remaining weights     (b) Scores and weights learned by
+                                                        movement pruning
+                 Figure 4: Magnitude pruning and movement pruning leads to pruned models with radically different
+                 weight distribution.
+
+
+                 loss on the output distributions. Figure 3 shows the results on SQuAD, MNLI, and QQP for the three
+                 pruning methods boosted with distillation. Overall, we observe that the relative comparisons of the
+                 pruning methods remain unchanged while the performances are strictly increased. Table 3 shows for
+                 instance that on SQuAD, movement pruning at 10% goes from 81.7 F1 to 84.3 F1. When combined
+                 with distillation, soft movement pruning yields the strongest performances across all pruning methods
+                 and studied datasets: it reaches 95% ofBERT-basewith only a fraction of the weights in the encoder
+                 (5% on SQuAD and MNLI).
+
+                 7 Analysis
+
+                 Movement pruning is adaptive Figure 4a compares the distribution of the remaining weights for
+                 the same matrix of a model pruned at the same sparsity using magnitude and movement pruning. We
+                 observe that by deﬁnition, magnitude pruning removes all the weights that are close to zero, ending
+                 up with two clusters. In contrast, movement pruning leads to a smoother distribution, which covers
+                 the whole interval except for values close to 0.
+                 Figure 4b displays each individual weight against its associated importance score in movement
+                 pruning. We plot pruned weights in grey. We observe that movement pruning induces no simple
+                 relationship between the scores and the weights. Both weights with high absolute value or low
+                 absolute value can be considered important. However, high scores are systematically associated with
+                 non-zero weights (and thus the “v-shape”). This is coherent with the interpretation we gave to the
+                 scores (section 4): a high scoreSindicates that during ﬁne-tuning, the associated weight moved away
+                 from 0 and is thus non-null.
+
+                 Local and global masks perform similarly  We study the inﬂuence of the locality of the pruning
+                 decision. While local Top v selects thev% most important weights matrix by matrix, global Top v uncovers non-uniform sparsity patterns in the network by selecting thev% most important weights in
+
+                                                  7                 Figure 5: Comparison of local and global selec- Figure 6:Remaining weights per layer in thetions of weights on SQuAD at different sparsity Transformer.Global magnitude pruning tends tolevels.For magnitude and movement pruning, prune uniformly layers. Global 1st order meth-local and global Top v performs similarly at all ods allocate the weight to the lower layers whilelevels of sparsity.                     heavily pruning the highest layers.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                 the whole network. Previous work has shown that a non-uniform sparsity across layers is crucial to
+                 the performance in high sparsity regimes [He et al., 2018]. In particular, Mallya and Lazebnik [2018]
+                 found that the sparsity tends to increase with the depth of the network layer.
+                 Figure 5 compares the performance of local selection (matrix by matrix) against global selection
+                 (all the matrices) for magnitude pruning and movement pruning. Despite being able to ﬁnd a
+                 global sparsity structure, we found that global did not signiﬁcantly outperform local, except in high
+                 sparsity regimes (2.3 F1 points of difference with 3% of remaining weights for movement pruning).
+                 Even though the distillation signal boosts the performance of pruned models, the end performance
+                 difference between local and global selections remains marginal.
+                 Figure 6 shows the remaining weights percentage obtained per layer when the model is pruned until
+                 10% with global pruning methods. Global weight pruning is able to allocate sparsity non-uniformly
+                 through the network, and it has been shown to be crucial for the performance in high sparsity regimes
+                 [He et al., 2018]. We notice that except for global magnitude pruning, all the global pruning methods
+                 tend to allocate a signiﬁcant part of the weights to the lowest layers while heavily pruning in the
+                 highest layers. Global magnitude pruning tends to prune similarly to local magnitude pruning, i.e.,
+                 uniformly across layers.
+
+
+                 8 Conclusion
+
+                 We consider the case of pruning of pretrained models for task-speciﬁc ﬁne-tuning and compare
+                 zeroth- and ﬁrst-order pruning methods. We show that a simple method for weight pruning based on
+                 straight-through gradients is effective for this task and that it adapts using a ﬁrst-order importance
+                 score. We apply this movement pruning to a transformer-based architecture and empirically show that
+                 our method consistently yields strong improvements over existing methods in high-sparsity regimes.
+                 The analysis demonstrates how this approach adapts to the ﬁne-tuning regime in a way that magnitude
+                 pruning cannot. In future work, it would also be interesting to leverage group-sparsity inducing
+                 penalties [Bach et al., 2011] to remove entire columns or ﬁlters. In this setup, we would associate a
+                 score to a group of weights (a column or a row for instance). In the transformer architecture, it would
+                 give a systematic way to perform feature selection and remove entire columns of the embedding
+                 matrix.
+
+
+                 References
+                 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
+                   Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text
+                   transformer.ArXiv, abs/1910.10683, 2019.
+
+                                                  8                 Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep
+                   learning in nlp. InACL, 2019.
+                 Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for
+                   efﬁcient neural network. InNIPS, 2015.
+                 Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
+                   with pruning, trained quantization and huffman coding. InICLR, 2016.
+                 Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. InNIPS,
+                   2016.
+                 Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.ArXiv,
+                   abs/1902.09574, 2019.
+                 Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket
+                   hypothesis at scale.ArXiv, abs/1903.01611, 2019.
+                 Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients
+                   through stochastic neurons for conditional computation.ArXiv, abs/1308.3432, 2013.
+                 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
+                   bidirectional transformers for language understanding. InNAACL, 2019.
+                 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
+                   Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017.
+                 Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through
+                   l0 regularization. InICLR, 2017.
+                 Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for
+                   sentence understanding through inference. InNAACL, 2018.
+                 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for
+                   machine comprehension of text. InEMNLP, 2016.
+                 Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, ﬁxed network by
+                   learning to mask.ArXiv, abs/1801.06519, 2018.
+                 Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari.
+                   What’s hidden in a randomly weighted neural network? InCVPR, 2020.
+                 Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNIPS, 1989.
+                 Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and
+                   performance comparisons. InNIPS, 1993.
+                 Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with
+                   dense networks and ﬁsher pruning.ArXiv, abs/1801.05787, 2018.
+                 Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Ji Liu, and Jungong Han. Global sparse
+                   momentum sgd for pruning very deep neural networks. InNeurIPS, 2019.
+                 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
+                   bert: smaller, faster, cheaper and lighter. InNeurIPS EMC2 Workshop, 2019.
+                 Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling
+                   task-speciﬁc knowledge from bert into simple neural networks.ArXiv, abs/1903.12136, 2019.
+                 Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
+                   structured dropout. InICLR, 2020a.
+                 Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InNeurIPS,
+                   2019.
+
+                                                  9                 Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and
+                   multiple languages: lottery tickets in rl and nlp. InICLR, 2020.
+                 Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou,
+                   and Armand Joulin. Training with quantization noise for extreme model compression.ArXiv,
+                   abs/2004.07320, 2020b.
+                 Oﬁr Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert.ArXiv,
+                   abs/1910.06188, 2019.
+                 Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional
+                   networks using vector quantization.ArXiv, abs/1412.6115, 2014.
+                 Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gon-
+                   zalez. Train large, then compress: Rethinking model size for efﬁcient training and inference of
+                   transformers.ArXiv, abs/2002.11794, 2020.
+                 Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efﬁcacy of pruning for model
+                   compression. InICLR, 2018.
+                 Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of
+                   weight pruning on transfer learning.ArXiv, abs/2002.08307, 2020.
+                 Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in
+                   natural language processing. InNAACL, 2019.
+                 Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
+                   models are unsupervised multitask learners. 2019.
+                 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
+                   Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
+                   approach.ArXiv, abs/1907.11692, 2019.
+                 Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017.
+                   URLhttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
+                 Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lian Lin, and Yanzhi Wang. Reweighted proximal
+                   pruning for large-scale language representation.ArXiv, abs/1909.12486, 2019.
+                  Neal Parikh and Stephen P. Boyd. Proximal algorithms.Found. Trends Optim., 1:127–239, 2014.
+                 Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better:
+                   The impact of student initialization on knowledge distillation.ArXiv, abs/1908.08962, 2019.
+                 Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InKDD, 2006.
+                 Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
+                   InNIPS, 2014.
+                 Xiaoqi Jiao, Y. Yin, Lifeng Shang, Xin Jiang, Xusong Chen, Linlin Li, Fang Wang, and Qun Liu.
+                   Tinybert: Distilling bert for natural language understanding.ArXiv, abs/1909.10351, 2019.
+                 Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
+                   compression and acceleration on mobile devices. InECCV, 2018.
+                 Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity
+                   through convex optimization.Statistical Science, 27, 09 2011. doi: 10.1214/12-STS394.
+
+
+
+
+
+
+
+
+
+                                                  10                 A Appendices
+
+                 A.1 Guarantees on the decrease of the training loss
+
+                 As the scores are updated, the relative order of the importances is likely shufﬂed, and some connections
+                 will be replaced by more important ones. Under certain conditions, we are able to formally prove that
+                 as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from
+                 [Ramanujan et al., 2020] to consider the case of ﬁne-tuableW.
+                 We suppose that (a) the training lossLis smooth and admits a ﬁrst-order Taylor development
+                 everywhere it is deﬁned and (b) the learning rate ofW(W >0) is small. We deﬁne the TopK
+                 function as the analog of the Top v function, wherekis an integer instead of a proportion. We ﬁrst
+                 consider the case wherek= 1in the TopK masking, meaning that only one connection is remaining
+                 (and the other weights are deactivated/masked). Let’s denoteWi;j this sole remaining connection at
+                 stept. Following Eq (1), it means that81u;vn;S (t)u;v S(t) .i;j
+                 We suppose that at stept+ 1, connections are swapped and the only remaining connection at step
+                 t+ 1is(k;l). We have:
+                                   (
+                                    Att;    81u;vn;S (t)u;v S(t)
+                                                             i;j                   (6) Att+ 1; 81u;vn;S (t+1)u;v  S(t+1)
+                                                              k;l
+
+                 Eq(6)gives the following inequality:S(t+1) S(t) S(t+1) S(t) . After re-injecting the gradient k;l    k;l   i;j     i;j update in Eq (2), we have:
+                                          @L      )       @L
+                                         S   W(t x                                (7)@a  k;l l  S  W(t) x
+                                            k          @a  i;j ji
+
+                 Moreover, the conditions in Eq(6)lead to the following inferences:a(t) =W(t) x    a(t+1) =i     i;j j and k
+                 W(t+1) xk;l  l .
+
+                 Since                       t)W is small,jj(a(t+1) ;a (t+1) )(a( ;a (t) )jj i    k      i  k  2 is also small. Because the training lossLis
+                 smooth, we can write the 1st order Taylor development ofLin point(a(t) ;a (t) ):i  k
+
+
+                            L(a(t+1) ;a (t+1) ) L(a(t) ;a (t) )i    k       i  k
+                               @L   (a(t+1) a(t)   @L) +   (a(t+1) a(t) )@a  k     kk            @a  i     ii
+                               @L=   W(t+1)   @Lx     W(t) x@a  k;l  l 
+                                k         @a  i;j ji
+                               @L           @L       @L                         (8) =   W(t+1) x                 W(t)    @Lx       (t) x@a  k;l  l + (   W(t) x
+                                k           @a  k;l l +
+                                              k       @a  k;l l )   Wi;j jk        @a i
+                               @L=   (W(t+1) x                )   @L
+                              @a   k;l  l W(t)     @Lx          xk;l l ) + (   W(t
+                                k                 @a  k;l l    W(t) x
+                                                     k       @a  i;j j )
+                                                               i
+                               @L       @L            @L       @L=   x           (S(t) )           x@a l (W   x     k;l ) + (   W(t) l    W(t) x
+                                k      @a l m
+                                          k            @a  k;lk       @a  i;j j )
+                                                                  i
+                 The ﬁrst term is null because of inequalities(6)and the second term is negative because of inequality
+                 (7). ThusL(a(t+1) ;a (t+1) ) L(a(t) ;a (t) ): when connection(k;l)becomes more important than i    k        i (i;j), the connections are swapped and the training loss decreases between step k                         tandt+ 1.          Similarly, we can generalize the proof to a setE=f(ai ;b i );(ci ;d i );iNg ofNswapping
+                 connections.
+                 We note that this proof is not speciﬁc to theTopKmasking function. In fact, we can extend the proof
+                 using theThresholdmasking functionM:= (S>=)[Mallya and Lazebnik, 2018]. Inequalities
+                 (6) are still valid and the proof stays unchanged.
+
+                                                  11                 Last, we note these guarantees do not hold if we consider the absolute value of the scoresjSi;j j(as
+                 it is done in Ding et al. [2019] for instance). We prove it by contradiction. If it was the case, it
+                 would also be true one speciﬁc case: thenegative thresholdmasking function (M:= (S< )where
+                  <0).
+                 We suppose that at stept+ 1, the only remaining connection(i;j)is replaced by(k;l):
+                                  (
+                                   Att;    81u;vn;S (t) S(t)
+                                                      i;j      u;v                 (9)Att+ 1; 81u;vn;S (t+1) S(t+1)
+                                                      k;l        u;v
+
+                 The inequality on the gradient update becomes: @LS  W(t) x@a k k;l l < @LS  W@a  i;j xj and following i
+                 the same development as in Eq(8), we haveL(a(t+1) ;a (t+1) ) L(a(t) ;a (t) )0: the loss increases. i    k       i We proved by contradiction that the guarantees on the decrease of the loss do not hold if we consider k
+                 the absolute value of the score as a proxy for importance.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                  12
--- a/Corpus/Network
+++ b/Corpus/Network
@ -0,0 +1,150 @@
+  Network Pruning
+
+
+     As one of the earliest works in network pruning, Yann Lecun's Optimal brain 
+     damage (OBD) paper has been cited in many of the papers.
+     Some research focuses on module network designs. "These models, such as 
+     SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions 
+     convolution with lesser parameters and better performance."
+     Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a 
+     compression and regularization method, as opposed to other techniques such as 
+     non-structured pruning (weight sparsifying and connection pruning), low rank 
+     approximation and vector quantization (references to these approaches can be 
+     found in the related work sections of the following papers). 
+     Difference between structred and non-structured pruning:
+       "Non-structured pruning aims to remove single parameters that have little 
+       influence on the accuracy of networks". For example, L1-norm regularization on 
+       weights is noted as non-structured pruning- since it's basically a weight 
+       sparsifying method, i.e removes single parameter. 
+       The term 'structure' refers to a structured unit in the network. So instead of 
+       pruning individual weights or connections, structured pruning targets neurons, 
+       filters, channels, layers etc. But the general implementation idea is the same as 
+       penalizing individual weights: introducing a regularization term (mostly in the 
+       form of L1-norm) to the loss function to penalize (sparsify) structures.
+     I focused on structured pruning and read through the following papers:
+
+   1. Structured Pruning of Convolutional Neural Networks via L1 
+     Regularization (August 2019)
+       "(...) network pruning is useful to remove redundant parameters, filters, 
+       channels or neurons, and address the over-fitting issue."
+
+       Provides a good review of previous work on non-structured and structured 
+       pruning.
+       "This study presents a scheme to prune filters or neurons of fully-connected 
+       layers based on L1 regularization to zero out the weights of some filters or 
+       neurons."
+       Didn't quite understand the method and implementation. There are two key 
+       elements: mask and threshold. "(...) the problem of zeroing out the values of 
+       some filters can be transformed to zero some mask." || "Though the proposed 
+       method introduces mask, the network topology will be preserved because the        mask can be absorbed into weight." || "Here the mask value cannot be 
+       completely zeroed in practical application, because the objective function (7) is 
+       non-convex and the global optimal solution may not be obtained. A strategy is 
+       adopted in the proposed method to solve this problem. If the order of 
+       magnitude of the mask value is small enough, it can be considered almost as 
+       zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...) 
+       The average value of the product of the mask and the weight is used to 
+       determine whether the mask is exactly zero or not."
+       From what I understand they use L1 norm in the loss function to penalize 
+       useless filters through peenalizing masks. And a threshold value is introduced 
+       to determine when the mask is small enough to be considered zero. 
+       They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
+       32)
+
+   2. Learning Efficient Convolutional Networks through Network Slimming (August 
+     2017) + Git repo
+       "Our approach imposes L1 regular- ization on the scaling factors in batch 
+       normalization (BN) layers, thus it is easy to implement without introducing any 
+       change to existing CNN architectures. Pushing the values of BN scaling factors 
+       towards zero with L1 regularization enables us to identify insignificant channels 
+       (or neurons), as each scaling factor corresponds to a specific con- volutional 
+       channel (or a neuron in a fully-connected layer)."
+       They provide a good insight on advantages and disadvantages of other 
+       computation reduction methods such as low rank approximation, vector 
+       quantization etc. 
+       I belive here they use the word 'channel' to refer to filters (?).
+       "Our idea is introducing a scaling factor γ for each channel, which is multiplied 
+       to the output of that channel. Then we jointly train the network weights and 
+       these scaling factors, with sparsity regularization imposed on the latter. Finally 
+
+       we prune those channels with small factors, and fine-tune the pruned network. 
+       " --> so instead of 'mask' they use the 'scaling factor' and impose regularization 
+       on that, but the idea is very similar.
+       "The way BN normalizes the activations motivates us to design a simple and 
+       efficient method to incorporates the channel-wise scaling factors. Particularly, 
+       BN layer normalizes the internal activa- tions using mini-batch statistics." || " 
+       (...) we can directly leverage the γ parameters in BN layers as the scaling factors 
+       we need for network slim- ming. It has the great advantage of introducing no 
+       overhead to the network."       They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40), 
+       ImageNet (model: VGG-A) and MNIST (model: Lenet)
+
+   3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
+       " (...) we propose Structured Sparsity Learning (SSL) method to directly learn a 
+       compressed structure of deep CNNs by group Lasso regularization during the 
+       training. SSL is a generic regularization to adaptively adjust mutiple structures 
+       in DNN, including structures of filters, channels, and filter shapes within each 
+       layer, and structure of depth beyond the layers." || " (...) offering not only well-
+       regularized big models with improved accuracy but greatly accelerated 
+       computation."
+
+
+
+        "Here W represents the collection of all weights in the DNN; ED(W) is the loss 
+       on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
+       norm; and Rg(·) is the structured sparsity regularization on each layer. Because 
+       Group Lasso can effectively zero out all weights in some groups [14][15], we 
+       adopt it in our SSL. The regularization of group Lasso on a set of weights w can 
+       be represented as  
+
+
+        , where w(g) is a group of partial weights in w and G is the total number of 
+       groups. " || "In SSL, the learned “structure” is decided by the way of splitting 
+       groups of w(g). We investigate and formulate the filer-wise, channel-wise, 
+       shape-wise, and depth-wise structured sparsity (...)"
+       They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
+       20) and ImageNet (model:AlexNet)
+       The authors also provide a visualization of filters after pruning, showing that 
+       only important detectors of patterns remain after pruning.
+
+       In conclusions: "Moreover, a variant of SSL can be performed as structure 
+       regularization to improve classification accuracy of state-of-the-art DNNs."
+
+   4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
+       "After an initial training phase, we remove all connections whose weight is 
+       lower than a threshold. This pruning converts a dense, fully-connected layer to 
+       a sparse layer." || "We then retrain the sparse network so the remaining 
+       connections can compensate for the connections that have been removed. The 
+       phases of pruning and retraining may be repeated iteratively to further reduce        network complexity. In effect, this training process learns the network 
+       connectivity in addition to the weights (...)"
+       Although the description above implies the pruning was done only for FC 
+       layers, they also do pruning on convolutional layers - although they don't 
+       provide much detail on this in the methods. But there's this statement when 
+       they explain retraining: "(...) we fix the parameters for CONV layers and only 
+       retrain the FC layers after pruning the FC layers, and vice versa.". The results 
+       section also shows that convolutional layer connections were also 
+       pruned on the tested models.
+       They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and 
+       ImageNet (models: AlexNet, VGG-16)
+       The authors provide a visualization of the sparsity patterns of neurons after 
+       pruning (for an FC layer) which shows that pruning can detect visual attention 
+       regions.
+       The method used in this paper targets individual parameters (weights) to 
+       prune. So, technically this should be considered as a non-structured pruning 
+       method. However, the reason I think this is referenced as a structured pruning 
+       method is that if all connections of a neuron is pruned (i.e all input and output 
+       weights were below threshold), the neuron itself will be removed from the 
+       network:  "After pruning connections, neurons with zero input connections or 
+       zero output connections may be safely pruned."
+       SIDENOTE: They touch on the use of global average pooling instead of fully 
+       connected layers in CNNs: "There have been other attempts to reduce the 
+       number of parameters of neural networks by replacing the fully connected 
+       layer with global average pooling."
+
+   5. Many more can be picked from the references of these papers. 
+
+
+
+     There's a paper on Bayesion compression for Deep Learning from 2017. Their 
+     hypothesis is: "By employing sparsity inducing priors for hidden units (and not 
+     individual weights) we can prune neurons including all their ingoing and outgoing 
+     weights." However, the method is mathematically heavy and the related work 
+     references are quite old (1990s, 2000s). 
--- a/Architectures.txt
+++ b/Architectures.txt
--- a/Corpus/Optimal
+++ b/Corpus/Optimal
--- a/Corpus/PLUG
+++ b/Corpus/PLUG
--- a/Corpus/Predicting
+++ b/Corpus/Predicting
--- a/Corpus/Predicting
+++ b/Corpus/Predicting
--- a/Corpus/Pruning
+++ b/Corpus/Pruning
--- a/Corpus/Scalable
+++ b/Corpus/Scalable
--- a/Corpus/Scaling
+++ b/Corpus/Scaling
--- a/Corpus/Structured
+++ b/Corpus/Structured
--- a/HYPOTHESIS.txt
+++ b/HYPOTHESIS.txt
--- a/Corpus/TOWARDS
+++ b/Corpus/TOWARDS
--- a/Efficiently.txt
+++ b/Efficiently.txt
@ -0,0 +1,535 @@
+                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+
+       To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy,
+       including cookie policy.
+
+
+
+
+
+
+
+         The 4 Research Techniques to
+
+         Train Deep Neural Network
+
+         Models More E:ciently
+
+
+               James Le Follow
+               Oct 29, 2019 · 9 min read
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                           Photo by Victor Freitas on Unsplash
+
+  https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205          Page 1 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      Deep learning and unsupervised feature learning have shown
+      great promise in many practical applications. State-of-the-art
+      performance has been reported in several domains, ranging
+      from speech recognition and image recognition to text
+      processing and beyond.
+
+
+      It’s also been observed that increasing the scale of deep
+      learning—with respect to numbers of training examples, model
+      parameters, or both—can drastically improve accuracy. These
+      results have led to a surge of interest in scaling up the training
+      and inference algorithms used for these models and in
+      improving optimization techniques for both.
+
+
+      The use of GPUs is a signiFcant advance in recent years that
+      makes the training of modestly-sized deep networks practical.
+      A known limitation of the GPU approach is that the training
+      speed-up is small when the model doesn’t Ft in a GPU’s
+      memory (typically less than 6 gigabytes).
+
+
+      To use a GPU eLectively, researchers often reduce the size of
+      the dataset or parameters so that CPU-to-GPU transfers are not
+      a signiFcant bottleneck. While data and parameter reduction
+      work well for small problems (e.g. acoustic modeling for speech
+      recognition), they are less attractive for problems with a large
+      number of examples and dimensions (e.g., high-resolution
+      images).
+
+
+                               In the previous post, we
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 2 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+                               talked about 5 diLerent
+                               algorithms for ePcient deep
+                               learning inference. In this
+                               article, we’ll discuss the
+                               upper right part of the
+                               quadrant on the left. What
+                               are the best research
+                               techniques to train deep
+                               neural networks more
+      ePciently?
+
+
+
+      1 — Parallelization Training
+      Let’s start with parallelization. As the Fgure below shows, the
+      number of transistors keeps increasing over the years. But
+      single-threaded performance and frequency are plateauing in
+      recent years. Interestingly, the number of cores is increasing.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 3 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+
+
+      So what we really need to know is how to parallelize the
+      problem to take advantage of parallel processing. There are a
+      lot of opportunities to do that in deep neural networks.
+
+
+      For example, we can do data parallelism: feeding 2 images
+      into the same model and running them at the same time. This
+      does not aLect latency for any single input. It doesn’t make it
+      shorter, but it makes the batch size larger. It also requires
+      coordinated weight updates during training.
+
+
+      For example, in JeL Dean’s paper “Large Scale Distributed Deep
+      Networks,” there’s a parameter server (as a master) and a
+      couple of model workers (as slaves) running their own pieces of
+      training data and updating the gradient to the master.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+      Another idea is model parallelism — splitting up the model
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 4 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      and distributing each part to diLerent processors or diLerent
+      threads. For example, imagine we want to run convolution in
+      the image below by doing a 6-dimension “for” loop. What we
+      can do is cut the input image by 2x2 blocks, so that each
+      thread/processor handles 1/4 of the image. Also, we can
+      parallelize the convolutional layers by the output or input
+      feature map regions, and the fully-connected layers by the
+      output activation.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+                          ...
+
+
+
+
+         Machine learning models are moving closer
+
+         and closer to edge devices. Fritz AI is here
+
+         to help with this transition. Explore our
+
+         suite of developer tools that makes it easy to
+
+         teach devices to see, hear, sense, and think.
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 5 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+                    ...
+
+
+     2 — Mixed Precision Training
+     Larger models usually require more compute and memory
+     resources to train. These requirements can be lowered by using
+     reduced precision representation and arithmetic.
+
+     Performance (speed) of any program, including neural network
+     training and inference, is limited by one of three factors:
+     arithmetic bandwidth, memory bandwidth, or latency.
+     Reduced precision addresses two of these limiters. Memory
+     bandwidth pressure is lowered by using fewer bits to store the
+     same number of values. Arithmetic time can also be lowered on
+     processors that oLer higher throughput for reduced precision
+     math. For example, half-precision math throughput in recent
+     GPUs is 2× to 8× higher than for single-precision. In addition
+     to speed improvements, reduced precision formats also reduce
+     the amount of memory required for training.
+
+     Modern deep learning training systems use a single-precision
+     (FP32) format. In their paper “Mixed Precision Training,”
+     researchers from NVIDIA and Baidu addressed training with
+     reduced precision while maintaining model accuracy.
+
+     SpeciFcally, they trained various neural networks using the
+     IEEE half-precision format (FP16). Since FP16 format has a
+     narrower dynamic range than FP32, they introduced three
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205     Page 6 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      techniques to prevent model accuracy loss: maintaining a
+      master copy of weights in FP32, loss-scaling that minimizes
+      gradient values becoming zeros, and FP16 arithmetic with
+      accumulation in FP32.
+
+
+                               Using these techniques, they
+                               demonstrated that a wide
+                               variety of network
+                               architectures and
+                               applications can be trained
+                               to match the accuracy of
+                               FP32 training. Experimental
+                               results include convolutional
+                               and recurrent network
+      architectures, trained for classiFcation, regression, and
+      generative tasks.
+
+
+      Applications include image classiFcation, image generation,
+      object detection, language modeling, machine translation, and
+      speech recognition. The proposed methodology requires no
+      changes to models or training hyperparameters.
+
+
+
+      3 — Model Distillation
+      Model distillation refers to the idea of model compression by
+      teaching a smaller network exactly what to do, step-by-step,
+      using a bigger, already-trained network. The ‘soft labels’ refer
+      to the output feature maps by the bigger network after every
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 7 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      convolution layer. The smaller network is then trained to learn
+      the exact behavior of the bigger network by trying to replicate
+      its outputs at every level (not just the Fnal loss).
+
+
+      The method was Frst proposed by Bucila et al., 2006 and
+      generalized by Hinton et al., 2015. In distillation, knowledge is
+      transferred from the teacher model to the student by
+      minimizing a loss function in which the target is the
+      distribution of class probabilities predicted by the teacher
+      model. That is — the output of a softmax function on the
+      teacher model’s logits.
+
+
+                               So how do teacher-student
+                               networks exactly work?
+
+
+                               The highly-complex teacher
+                               network is Frst trained
+                               separately using the
+                               complete dataset. This step
+                               requires high computational
+                               performance and thus can
+                               only be done ohine (on
+         high-performing GPUs).
+
+         While designing a student network, correspondence needs
+         to be established between intermediate outputs of the
+         student network and the teacher network. This
+         correspondence can involve directly passing the output of a
+         layer in the teacher network to the student network, or
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 8 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+         performing some data augmentation before passing it to the
+         student network.
+
+         Next, the data are forward-passed through the teacher
+         network to get all intermediate outputs, and then data
+         augmentation (if any) is applied to the same.
+
+         Finally, the outputs from the teacher network are back-
+         propagated through the student network so that the student
+         network can learn to replicate the behavior of the teacher
+         network.
+
+                          ...
+
+
+
+
+         The future of machine learning is on the
+
+         edge. Subscribe to the Fritz AI Newsletter
+
+         to discover the possibilities and beneIts of
+
+         embedding ML models inside mobile apps.
+
+                          ...
+
+
+
+      4 — Dense-Sparse-Dense Training
+      The research paper “Dense-Sparse-Dense Training for Deep
+      Neural Networks” was published back in 2017 by researchers
+      from Stanford, NVIDIA, Baidu, and Facebook. Applying Dense-
+      Sparse-Dense (DSD) takes 3 sequential steps:
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 9 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+         Dense: Normal neural net training…business as usual. It’s
+         notable that even though DSD acts as a regularizer, the
+         usual regularization methods such as dropout and weight
+         regularization can be applied as well. The authors don’t
+         mention batch normalization, but it would work as well.
+
+
+                               Sparse: We regularize the
+                               network by removing
+                               connections with small
+                               weights. From each layer in
+                               the network, a percentage of
+                               the layer’s weights that are
+         closest to 0 in absolute value is selected to be pruned. This
+         means that they are set to 0 at each training iteration. It’s
+         worth noting that the pruned weights are selected only
+         once, not at each SGD iteration. Eventually, the network
+         recovers the pruned weights’ knowledge and condenses it in
+         the remaining ones. We train this sparse net until
+         convergence.
+
+         Dense: First, we re-enable the pruned weights from the
+         previous step. The net is again trained normally until
+         convergence. This step increases the capacity of the model.
+         It can use the recovered capacity to store new knowledge.
+         The authors note that the learning rate should be 1/10th of
+         the original. Since the model is already performing well, the
+         lower learning rate helps preserve the knowledge gained in
+         the previous step.
+
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 10 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+      Removing pruning in the dense step allows the training to
+      escape saddle points to eventually reach a better minimum.
+      This lower minimum corresponds to improved training and
+      validation metrics.
+
+
+      Saddle points are areas in the multidimensional space of the
+      model that might not be a good solution but are hard to escape
+      from. The authors hypothesize that the lower minimum is
+      achieved because the sparsity in the network moves the
+      optimization problem to a lower-dimensional space. This space
+      is more robust to noise in the training data.
+
+
+      The authors tested DSD on image classiFcation (CNN), caption
+      generation (RNN), and speech recognition (LSTM). The
+      proposed method improved accuracy across all three tasks. It’s
+      quite remarkable that DSD works across domains.
+
+
+         DSD improved all CNN models tested — ResNet50, VGG,
+         and GoogLeNet. The improvement in absolute top-1
+         accuracy was respectively 1.12%, 4.31%, and 1.12%. This
+         corresponds to a relative improvement of 4.66%, 13.7%,
+         and 3.6%. These results are remarkable for such Fnely-
+         tuned models!
+
+
+                               DSD was applied to
+                               NeuralTalk, an amazing
+                               model that generates a
+                               description from an image.
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 11 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+                               To verify that the Dense-
+                               Sparse-Dense method works
+                               on an LSTM, the CNN part of
+                               Neural Talk is frozen. Only
+         the LSTM layers are trained. Very high (80% deducted by
+         the validation set) pruning was applied at the Sparse step.
+         Still, this gives the Neural Talk BLEU score an average
+         relative improvement of 6.7%. It’s fascinating that such a
+         minor adjustment produces this much improvement.
+
+         Applying DSD to speech recognition (Deep Speech 1)
+         achieves an average relative improvement of Word Error
+         Rate of 3.95%. On a similar but more advanced Deep
+         Speech 2 model Dense-Sparse-Dense is applied iteratively
+         two times. On the Frst iteration, pruning 50% of the
+         weights, then 25% of the weights are pruned. After these
+         two DSD iterations, the average relative improvement is
+         6.5%.
+
+
+
+      Conclusion
+      I hope that I’ve managed to explain these research techniques
+      for ePcient training of deep neural networks in a transparent
+      way. Work on this post allowed me to grasp how novel and
+      clever these techniques are. A solid understanding of these
+      approaches will allow you to incorporate them into your model
+      training procedure when needed.
+
+                          ...
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 12 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+
+
+      Editor’s Note: Heartbeat is a contributor-driven online
+      publication and community dedicated to exploring the emerging
+      intersection of mobile app development and machine learning.
+      We’re committed to supporting and inspiring developers and
+      engineers from all walks of life.
+
+
+      Editorially independent, Heartbeat is sponsored and published by
+      Fritz AI, the machine learning platform that helps developers
+      teach devices to see, hear, sense, and think. We pay our
+      contributors, and we don’t sell ads.
+
+
+      If you’d like to contribute, head on over to our call for
+      contributors. You can also sign up to receive our weekly
+      newsletters (Deep Learning Weekly and the Fritz AI
+      Newsletter), join us on Slack, and follow Fritz AI on Twitter for
+      all the latest in mobile machine learning.
+
+
+
+       Neural Networks  Deep Learning  Heartbeat  Guides And Tutorials
+
+       Machine Learning
+
+
+
+
+
+
+
+
+
+
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 13 of 14                             The 4 Research Techniques to Train Deep Neural Network Models More Efficiently                                                                                                                                                                                                                                                                                                                                                            26/05/2020, 21:12
+
+
+
+      Discover Medium   Make Medium     Become a member
+                     yours Welcome to a place where                 Get unlimited access to the
+      words matter. On Medium,  Follow all the topics you   best stories on Medium —
+      smart voices and original   care about, and we’ll     and support writers while
+      ideas take center stage -   deliver the best stories for  you’re at it. Just $5/month.
+      with no ads in sight. Watch  you to your homepage and  Upgrade
+                     inbox. Explore
+
+
+
+
+                                  About   Help   Legal
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205       Page 14 of 14
--- a/Corpus/The
+++ b/Corpus/The
@ -0,0 +1,678 @@
+                         The State of Sparsity in Deep Neural Networks
+
+
+
+                                Trevor Gale * 1y Erich Elsen * 2 Sara Hooker 1y
+
+
+                         Abstract                  like image classiﬁcation and machine translation commonly
+                                                   have tens of millions of parameters, and require billions ofWe rigorously evaluate three state-of-the-art tech-      ﬂoating-point operations to make a prediction for a singleniques for inducing sparsity in deep neural net-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+     arXiv:1902.09574v1  [cs.LG]  25 Feb 2019                                             input sample.works on two large-scale learning tasks: Trans-
+            former trained on WMT 2014 English-to-German,      Sparsity has emerged as a leading approach to address these
+            and ResNet-50 trained on ImageNet. Across thou-      challenges. By sparsity, we refer to the property that a subset
+            sands of experiments, we demonstrate that com-      of the model parameters have a value of exactly zero 2 . With
+            plex techniques (Molchanov et al.,2017;Louizos      zero valued weights, any multiplications (which dominate
+            et al.,2017b) shown to yield high compression      neural network computation) can be skipped, and models
+            rates on smaller datasets perform inconsistently,      can be stored and transmitted compactly using sparse matrix
+            and that simple magnitude pruning approaches      formats. It has been shown empirically that deep neural
+            achieve comparable or better results. Based on      networks can tolerate high levels of sparsity (Han et al.,
+            insights from our experiments, we achieve a      2015;Narang et al.,2017;Ullrich et al.,2017), and this
+            new state-of-the-art sparsity-accuracy trade-off      property has been leveraged to signiﬁcantly reduce the cost
+            for ResNet-50 using only magnitude pruning. Ad-      associated with the deployment of deep neural networks,
+            ditionally, we repeat the experiments performed      and to enable the deployment of state-of-the-art models in
+            byFrankle & Carbin(2018) andLiu et al.(2018)      severely resource constrained environments (Theis et al.,
+            at scale and show that unstructured sparse archi-      2018;Kalchbrenner et al.,2018;Valin & Skoglund,2018).
+            tectures learned through pruning cannot be trained      Over the past few years, numerous techniques for induc-from scratch to the same test set performance as      ing sparsity have been proposed and the set of models anda model trained with joint sparsiﬁcation and op-      datasets used as benchmarks has grown too large to rea-timization. Together, these results highlight the      sonably expect new approaches to explore them all. Inneed for large-scale benchmarks in the ﬁeld of      addition to the lack of standardization in modeling tasks, themodel compression. We open-source our code,      distribution of benchmarks tends to slant heavily towardstop performing model checkpoints, and results of      convolutional architectures and computer vision tasks, andall hyperparameter conﬁgurations to establish rig-      the tasks used to evaluate new techniques are frequentlyorous baselines for future work on compression      not representative of the scale and complexity of real-worldand sparsiﬁcation.                          tasks where model compression is most useful. These char-
+                                                   acteristics make it difﬁcult to come away from the sparsity
+                                                   literature with a clear understanding of the relative merits
+         1. Introduction                             of different approaches.
+         Deep neural networks achieve state-of-the-art performance  In addition to practical concerns around comparing tech-
+         in a variety of domains including image classiﬁcation (He   niques, multiple independent studies have recently proposed
+         et al.,2016), machine translation (Vaswani et al.,2017),  that the value of sparsiﬁcation in neural networks has been
+         and text-to-speech (van den Oord et al.,2016;Kalchbren-  misunderstood (Frankle & Carbin,2018;Liu et al.,2018).
+         ner et al.,2018). While model quality has been shown to  While both papers suggest that sparsiﬁcation can be viewed
+         scale with model and dataset size (Hestness et al.,2017),  as a form of neural architecture search, they disagree on
+         the resources required to train and deploy large neural net-  what is necessary to achieve this. Speciﬁcally,Liu et al.
+         works can be prohibitive. State-of-the-art models for tasks     2 The term sparsity is also commonly used to refer to the pro-
+          * Equal contribution y This work was completed as part of the   portion of a neural networks weights that are zero valued. Higher
+         Google AI Residency 1 Google Brain 2 DeepMind. Correspondence   sparsity corresponds to fewer weights, and smaller computational
+         to: Trevor Gale<tgale@google.com>.                  and storage requirements. We use the term in this way throughout
+                                                   this paper.                                  The State of Sparsity in Deep Neural Networks
+
+         (2018) re-train learned sparse topologies with a random  Some of the earliest techniques for sparsifying neural net-
+         weight initialization, whereasFrankle & Carbin(2018) posit  works make use of second-order approximation of the loss
+         that the exact random weight initialization used when the   surface to avoid damaging model quality (LeCun et al.,
+         sparse architecture was learned is needed to match the test  1989;Hassibi & Stork,1992). More recent work has
+         set performance of the model sparsiﬁed during optimization.  achieved comparable compression levels with more com-
+                                                   putationally efﬁcient ﬁrst-order loss approximations, andIn this paper, we address these ambiguities to provide a   further reﬁnements have related this work to efﬁcient em-strong foundation for future work on sparsity in neural net-  pirical estimates of the Fisher information of the modelworks.Our main contributions:(1) We perform a com-  parameters (Molchanov et al.,2016;Theis et al.,2018).prehensive evaluation of variational dropout (Molchanov
+         et al.,2017),l0 regularization (Louizos et al.,2017b), and  Reinforcement learning has also been applied to automat-
+         magnitude pruning (Zhu & Gupta,2017) on Transformer  ically prune weights and convolutional ﬁlters (Lin et al.,
+         trained on WMT 2014 English-to-German and ResNet-50  2017;He et al.,2018), and a number of techniques have
+         trained on ImageNet. To the best of our knowledge, we   been proposed that draw inspiration from biological phe-
+         are the ﬁrst to apply variational dropout andl0 regulariza-  nomena, and derive from evolutionary algorithms and neu-
+         tion to models of this scale. While variational dropout and   romorphic computing (Guo et al.,2016;Bellec et al.,2017;
+         l0 regularization achieve state-of-the-art results on small   Mocanu et al.,2018).
+         datasets, we show that they perform inconsistently for large-  A key feature of a sparsity inducing technique is if andscale tasks and that simple magnitude pruning can achieve  how it imposes structure on the topology of sparse weights.comparable or better results for a reduced computational  While unstructured weight sparsity provides the most ﬂex-budget. (2) Through insights gained from our experiments,  ibility for the model, it is more difﬁcult to map efﬁcientlywe achieve a new state-of-the-art sparsity-accuracy trade-off   to parallel processors and has limited support in deep learn-for ResNet-50 using only magnitude pruning. (3) We repeat   ing software packages. For these reasons, many techniquesthe lottery ticket (Frankle & Carbin,2018) and scratch (Liu   focus on removing whole neurons and convolutional ﬁlters,et al.,2018) experiments on Transformer and ResNet-50   or impose block structure on the sparse weights (Liu et al.,across a full range of sparsity levels. We show that unstruc-  2017;Luo et al.,2017;Gray et al.,2017). While this is prac-tured sparse architectures learned through pruning cannot   tical, there is a trade-off between achievable compressionbe trained from scratch to the same test set performance as  levels for a given model quality and the level of structurea model trained with pruning as part of the optimization   imposed on the model weights. In this work, we focusprocess. (4) We open-source our code, model checkpoints,  on unstructured sparsity with the expectation that it upperand results of all hyperparameter settings to establish rig-  bounds the compression-accuracy trade-off achievable withorous baselines for future work on model compression and  structured sparsity techniques.sparsiﬁcation 3 .
+
+                                                   3. Evaluating Sparsiﬁcation Techniques at2. Sparsity in Neural Networks                 Scale
+         We brieﬂy provide a non-exhaustive review of proposed
+         approaches for inducing sparsity in deep neural networks.   As a ﬁrst step towards addressing the ambiguity in the
+                                                   sparsity literature, we rigorously evaluate magnitude-based
+         Simple heuristics based on removing small magnitude   pruning (Zhu & Gupta,2017), sparse variational dropoutweights have demonstrated high compression rates with  (Molchanov et al.,2017), andl0 regularization (Louizosminimal accuracy loss (Strom¨ ,1997;Collins & Kohli,2014;  et al.,2017b) on two large-scale deep learning applications:
+         Han et al.,2015), and further reﬁnement of the sparsiﬁca-  ImageNet classiﬁcation with ResNet-50 (He et al.,2016),
+         tion process for magnitude pruning techniques has increased   and neural machine translation (NMT) with the Transformer
+         achievable compression rates and greatly reduced computa-  on the WMT 2014 English-to-German dataset (Vaswani
+         tional complexity (Guo et al.,2016;Zhu & Gupta,2017).   et al.,2017). For each model, we also benchmark a random
+         Many techniques grounded in Bayesian statistics and in-  weight pruning technique, representing the lower bound
+         formation theory have been proposed (Dai et al.,2018;  of compression-accuracy trade-off any method should be
+         Molchanov et al.,2017;Louizos et al.,2017b;a;Ullrich   expected to achieve.
+         et al.,2017). These methods have achieved high compres-  Here we brieﬂy review the four techniques and introduce sion rates while providing deep theoretical motivation and   our experimental framework. We provide a more detailed
+         connections to classical sparsiﬁcation and regularization   overview of each technique in AppendixA.
+         techniques.
+           3 https://bit.ly/2ExE8Yj                                  The State of Sparsity in Deep Neural Networks
+
+         3.1. Magnitude Pruning                         Table 1.Constant hyperparameters for all Transformer exper-
+         Magnitude-based weight pruning schemes use the magni-  iments.More details on the standard conﬁguration for training the
+         tude of each weight as a proxy for its importance to model  Transformer can be found inVaswani et al.(2017).
+         quality, and remove the least important weights according     Hyperparameter Value
+         to some sparsiﬁcation schedule over the course of training.        dataset translatewmtendepacked
+         For our experiments, we use the approach introduced in     training iterations 500000
+         Zhu & Gupta(2017), which is conveniently available in the       batch size 2048 tokens
+         TensorFlow modelpruning library 4 . This technique allows   learning rate schedule standard transformerbase
+         for masked weights to reactivate during training based on        optimizer Adam
+         gradient updates, and makes use of a gradual sparsiﬁcation      sparsity range 50% - 98%
+         schedule with sorting-based weight thresholding to achieve       beam search beam size 4; length penalty 0.6
+         a user speciﬁed level of sparsiﬁcation. These features enable
+         high compression ratios at a reduced computational cost rel-  optimized directly using the reparameterization trick, and
+         ative to the iterative pruning and re-training approach used  the expectedl0 -norm can be computed using the value of the
+         byHan et al.(2015), while requiring less hyperparame-  cumulative distribution function of the random gate variable
+         ter tuning relative to the technique proposed byGuo et al.  evaluated at zero.
+         (2016).
+                                                   3.4. Random Pruning Baseline
+         3.2. Variational Dropout                        For our experiments, we also include a random sparsiﬁcation
+         Variational dropout was originally proposed as a re-  procedure adapted from the magnitude pruning technique
+         interpretation of dropout training as variational inference,  ofZhu & Gupta(2017). Our random pruning technique
+         providing a Bayesian justiﬁcation for the use of dropout   uses the same sparsity schedule, but differs by selecting the
+         in neural networks and enabling useful extensions to the  weights to be pruned each step at random rather based on
+         standard dropout algorithms like learnable dropout rates   magnitude and does not allow pruned weights to reactivate.
+         (Kingma et al.,2015). It was later demonstrated that by  This technique is intended to represent a lower-bound of the
+         learning a model with variational dropout and per-parameter  accuracy-sparsity trade-off curve.
+         dropout rates, weights with high dropout rates can be re-
+         moved post-training to produce highly sparse solutions   3.5. Experimental Framework
+         (Molchanov et al.,2017).                         For magnitude pruning, we used the TensorFlow model
+         Variational dropout performs variational inference to learn   pruning library. We implemented variational dropout and
+         the parameters of a fully-factorized Gaussian posterior over  l0 regularization from scratch. For variational dropout, we
+         the weights under a log-uniform prior. In the standard for-  veriﬁed our implementation by reproducing the results from
+         mulation, we apply a local reparameterization to move the   the original paper. To verify ourl0 regularization implemen-
+         sampled noise from the weights to the activations, and then   tation, we applied our weight-level code to Wide ResNet
+         apply the additive noise reparameterization to further reduce  (Zagoruyko & Komodakis,2016) trained on CIFAR-10 and
+         the variance of the gradient estimator. Under this parame-  replicated the training FLOPs reduction and accuracy re-
+         terization, we directly optimize the mean and variance of   sults from the original publication. Veriﬁcation results for
+         the neural network parameters. After training a model with  variational dropout andl0 regularization are included in
+         variational dropout, the weights with the highest learned  AppendicesBandC. For random pruning, we modiﬁed
+         dropout rates can be removed to produce a sparse model.    the TensorFlow model pruning library to randomly select
+                                                   weights as opposed to sorting them based on magnitude.
+         3.3.l0 Regularization                          For each model, we kept the number of training steps con-
+         l0 regularization explicitly penalizes the number of non-  stant across all techniques and performed extensive hyper-
+         zero weights in the model to induce sparsity. However,  parameter tuning. While magnitude pruning is relatively
+         thel0 -norm is both non-convex and non-differentiable. To   simple to apply to large models and achieves reasonably
+         address the non-differentiability of thel0 -norm,Louizos  consistent performance across a wide range of hyperparame-
+         et al.(2017b) propose a reparameterization of the neural   ters, variational dropout andl0 -regularization are much less
+         network weights as the product of a weight and a stochastic  well understood. To our knowledge, we are the ﬁrst to apply
+         gate variable sampled from a hard-concrete distribution.  these techniques to models of this scale. To produce a fair
+         The parameters of the hard-concrete distribution can be  comparison, we did not limit the amount of hyperparameter
+                                                   tuning we performed for each technique. In total, our results 4 https://bit.ly/2T8hBGn                   encompass over 4000 experiments.                                  The State of Sparsity in Deep Neural Networks
+
+
+
+
+
+
+
+
+
+
+
+
+
+                                                   Figure 2.Average sparsity in Transformer layers.Distributions
+                                                   calculated on the top performing model at 90% sparsity for each
+                                                   technique.l0 regularization and variational dropout are able to
+                                                   learn non-uniform distributions of sparsity, while magnitude prun-
+                                                   ing induces user-speciﬁed sparsity distributions (in this case, uni-
+                                                   form).
+                                                   form the random pruning technique, randomly removing
+                                                   weights produces surprisingly reasonable results, which is
+                                                   perhaps indicative of the models ability to recover from
+         Figure 1.Sparsity-BLEU trade-off curves for the Transformer.  damage during optimization.
+         Top: Pareto frontiers for each of the four sparsiﬁcation techniques
+         applied to the Transformer. Bottom: All experimental results with  What is particularly notable about the performance of mag-
+         each technique. Despite the diversity of approaches, the relative  nitude pruning is that our experiments uniformly remove the
+         performance of all three techniques is remarkably consistent. Mag-  same fraction of weights for each layer. This is in stark con-
+         nitude pruning notably outperforms more complex techniques for  trast to variational dropout andl0 regularization, where the
+         high levels of sparsity.                            distribution of sparsity across the layers is learned through
+                                                   the training process. Previous work has shown that a non-
+         4. Sparse Neural Machine Translation         uniform sparsity among different layers is key to achieving
+                                                   high compression rates (He et al.,2018), and variational
+         We adapted the Transformer (Vaswani et al.,2017) model   dropout andl0 regularization should theoretically be able to
+         for neural machine translation to use these four sparsiﬁca-  leverage this feature to learn better distributions of weights
+         tion techniques, and trained the model on the WMT 2014   for a given global sparsity.
+         English-German dataset. We sparsiﬁed all fully-connected
+         layers and embeddings, which make up 99.87% of all of   Figure2shows the distribution of sparsity across the differ-
+         the parameters in the model (the other parameters coming  ent layer types in the Transformer for the top performing
+         from biases and layer normalization). The constant hyper-  model at 90% global sparsity for each technique. Bothl0
+         parameters used for all experiments are listed in table1. We   regularization and variational dropout learn to keep more
+         followed the standard training procedure used byVaswani   parameters in the embedding, FFN layers, and the output
+         et al.(2017), but did not perform checkpoint averaging.  transforms for the multi-head attention modules and induce
+         This setup yielded a baseline BLEU score of 27.29 averaged  more sparsity in the transforms for the query and value in-
+         across ﬁve runs.                               puts to the attention modules. Despite this advantage,l0
+                                                   regularization and variational dropout did not signiﬁcantly
+         We extensively tuned the remaining hyperparameters for  outperform magnitude pruning, even yielding inferior re-
+         each technique. Details on what hyperparameters we ex-  sults at high sparsity levels.
+         plored, and the results of what settings produced the best
+         models can be found in AppendixD.                 It is also important to note that these results maintain a
+                                                   constant number of training steps across all techniques and
+                                                   that the Transformer variant with magnitude pruning trains4.1. Sparse Transformer Results & Analysis          1.24x and 1.65x faster thanl0 regularization and variational
+         All results for the Transformer are plotted in ﬁgure1. De-  dropout respectively. While the standard Transformer train-
+         spite the vast differences in these approaches, the relative   ing scheme produces excellent results for machine transla-
+         performance of all three techniques is remarkably consis-  tion, it has been shown that training the model for longer
+         tent. Whilel0 regularization and variational dropout pro-  can improve its performance by as much as 2 BLEU (Ott
+         duce the top performing models in the low-to-mid sparsity  et al.,2018). Thus, when compared for a ﬁxed training cost
+         range, magnitude pruning achieves the best results for highly  magnitude pruning has a distinct advantage over these more
+         sparse models. While all techniques were able to outper-  complicated techniques.                                  The State of Sparsity in Deep Neural Networks
+
+
+         Table 2.Constant hyperparameters for all RN50 experiments.
+            Hyperparameter Value
+                dataset ImageNet
+            training iterations 128000
+               batch size 1024 images
+          learning rate schedule standard
+               optimizer SGD with Momentum
+             sparsity range 50% - 98%
+
+
+
+         5. Sparse Image Classiﬁcation
+         To benchmark these four sparsity techniques on a large-
+         scale computer vision task, we integrated each method into
+         ResNet-50 and trained the model on the ImageNet large-
+         scale image classiﬁcation dataset. We sparsiﬁed all convolu-
+         tional and fully-connected layers, which make up 99.79%
+         of all of the parameters in the model (the other parameters  Figure 3.Sparsity-accuracy trade-off curves for ResNet-50.
+         coming from biases and batch normalization).           Top: Pareto frontiers for variational dropout, magnitude pruning,
+                                                   and random pruning applied to ResNet-50. Bottom: All experi- The hyperparameters we used for all experiments are listed  mental results with each technique. We observe large variation in
+         in Table2. Each model was trained for 128000 iterations   performance for variational dropout andl0 regularization between
+         with a batch size of 1024 images, stochastic gradient descent  Transformer and ResNet-50. Magnitude pruning and variational
+         with momentum, and the standard learning rate schedule  dropout achieve comparable performance for most sparsity levels,
+         (see AppendixE.1). This setup yielded a baseline top-1  with variational dropout achieving the best results for high sparsity
+         accuracy of 76.69% averaged across three runs. We trained   levels.
+         each model with 8-way data parallelism across 8 accelera-
+         tors. Due to the extra parameters and operations required for  will be non-zero. 5 .Louizos et al.(2017b) reported results
+         variational dropout, the model was unable to ﬁt into device  applyingl0 regularization to a wide residual network (WRN)
+         memory in this conﬁguration. For all variational dropout  (Zagoruyko & Komodakis,2016) on the CIFAR-10 dataset,
+         experiments, we used a per-device batch size of 32 images  and noted that they observed small accuracy loss at as low
+         and scaled the model over 32 accelerators.             as 8% reduction in the number of parameters during training.
+                                                   Applying our weight-levell0 regularization implementation
+         5.1. ResNet-50 Results & Analysis                  to WRN produces a model with comparable training time
+                                                   sparsity, but with no sparsity in the test-time parameters.Figure3shows results for magnitude pruning, variational   For models that achieve test-time sparsity, we observe sig-dropout, and random pruning applied to ResNet-50. Surpris-  niﬁcant accuracy degradation on CIFAR-10. This result isingly, we were unable to produce sparse ResNet-50 mod-  consistent with our observation forl els withl                                                           0 regularization applied
+               0 regularization that did not signiﬁcantly damage   to ResNet-50 on ImageNet.model quality. Across hundreds of experiments, our models
+         were either able to achieve full test set performance with  The variation in performance for variational dropout andl0
+         no sparsiﬁcation, or sparsiﬁcation with test set performance  regularization between Transformer and ResNet-50 is strik-
+         akin to random guessing. Details on all hyperparameter  ing. While achieving a good accuracy-sparsity trade-off,
+         settings explored are included in AppendixE.           variational dropout consistently ranked behindl0 regulariza-
+                                                   tion on Transformer, and was bested by magnitude pruningThis result is particularly surprising given the success ofl0   for sparsity levels of 80% and up. However, on ResNet-50regularization on Transformer. One nuance of thel0 regular-  we observe that variational dropout consistently producesization technique ofLouizos et al.(2017b) is that the model
+         can have varying sparsity levels between the training and     5 The fraction of time a parameter is set to zero during training
+         test-time versions of the model. At training time, a parame-  depends on other factors, e.g. theparameter of the hard-concrete
+         ter with a dropout rate of 10% will be zero 10% of the time   distribution. However, this point is generally true that the training
+                                                   and test-time sparsities are not necessarily equivalent, and that when sampled from the hard-concrete distribution. How-  there exists some dropout rate threshold below which a weight that
+         ever, under the test-time parameter estimator, this weight   is sometimes zero during training will be non-zero at test-time.                                  The State of Sparsity in Deep Neural Networks
+
+
+
+
+
+
+
+
+
+
+
+
+
+         Figure 4.Average sparsity in ResNet-50 layers.Distributions  Figure 5.Sparsity-accuracy trade-off curves for ResNet-50
+         calculated on the top performing model at 95% sparsity for each  with modiﬁed sparsiﬁcation scheme. Altering the distribution
+         technique. Variational dropout is able to learn non-uniform dis-  of sparsity across the layers and increasing training time yield
+         tributions of sparsity, decreasing sparsity in the input and output  signiﬁcant improvement for magnitude pruning.
+         layers that are known to be disproportionately important to model
+         quality.                                     5.2. Pushing the Limits of Magnitude Pruning
+                                                   Given that a uniform distribution of sparsity is suboptimal,
+                                                   and the signiﬁcantly smaller resource requirements for ap-
+                                                   plying magnitude pruning to ResNet-50 it is natural to won-
+         models on-par or better than magnitude pruning, and that   der how well magnitude pruning could perform if we were to
+         l0 regularization is not able to produce sparse models at  distribute the non-zero weights more carefully and increase
+         all. Variational dropout achieved particularly notable results   training time.
+         in the high sparsity range, maintaining a top-1 accuracy  To understand the limits of the magnitude pruning heuristic,over 70% with less than 4% of the parameters of a standard  we modify our ResNet-50 training setup to leave the ﬁrstResNet-50.                                  convolutional layer fully dense, and only prune the ﬁnal
+         The distribution of sparsity across different layer types in the  fully-connected layer to 80% sparsity. This heuristic is
+         best variational dropout and magnitude pruning models at  reasonable for ResNet-50, as the ﬁrst layer makes up a small
+         95% sparsity are plotted in ﬁgure4. While we kept sparsity  fraction of the total parameters in the model and the ﬁnal
+         constant across all layers for magnitude and random prun-  layer makes up only .03% of the total FLOPs. While tuning
+         ing, variational dropout signiﬁcantly reduces the amount of   the magnitude pruning ResNet-50 models, we observed that
+         sparsity induced in the ﬁrst and last layers of the model.    the best models always started and ended pruning during
+                                                   the third learning rate phase, before the second learning rateIt has been observed that the ﬁrst and last layers are often   drop. To take advantage of this, we increase the number ofdisproportionately important to model quality (Han et al.,  training steps by 1.5x by extending this learning rate region.2015;Bellec et al.,2017). In the case of ResNet-50, the   Results for ResNet-50 trained with this scheme are plottedﬁrst convolution comprises only .037% of all the parame-  in ﬁgure5.ters in the model. At 98% sparsity the ﬁrst layer has only
+         188 non-zero parameters, for an average of less than 3 pa-  With these modiﬁcations, magnitude pruning outperforms
+         rameters per output feature map. With magnitude pruning  variational dropout at all but the highest sparsity levels while
+         uniformly sparsifying each layer, it is surprising that it is   still using less resources. However, variational dropout’s per-
+         able to achieve any test set performance at all with so few  formance in the high sparsity range is particularly notable.
+         parameters in the input convolution.                 With very low amounts of non-zero weights, we ﬁnd it likely
+                                                   that the models performance on the test set is closely tied toWhile variational dropout is able to learn to distribute spar-  precise allocation of weights across the different layers, andsity non-uniformly across the layers, it comes at a signiﬁcant   that variational dropout’s ability to learn this distributionincrease in resource requirements. For ResNet-50 trained   enables it to better maintain accuracy at high sparsity levels.with variational dropout we observed a greater than 2x in-  This result indicates that efﬁcient sparsiﬁcation techniquescrease in memory consumption. When scaled across 32   that are able to learn the distribution of sparsity across layersaccelerators, ResNet-50 trained with variational dropout   are a promising direction for future work. completed training in 9.75 hours, compared to ResNet-50
+         with magnitude pruning ﬁnishing in 12.50 hours on only 8   Its also worth noting that these changes produced mod-
+         accelerators. Scaled to a 4096 batch size and 32 accelerators,  els at 80% sparsity with top-1 accuracy of 76.52%, only
+         ResNet-50 with magnitude pruning can complete the same   .17% off our baseline ResNet-50 accuracy and .41% better
+         number of epochs in just 3.15 hours.                 than the results reported byHe et al.(2018), without the                                  The State of Sparsity in Deep Neural Networks
+
+         extra complexity and computational requirements of their
+         reinforcement learning approach. This represents a new
+         state-of-the-art sparsity-accuracy trade-off for ResNet-50
+         trained on ImageNet.
+
+         6. Sparsiﬁcation as Architecture Search
+         While sparsity is traditionally thought of as a model com-
+         pression technique, two independent studies have recently
+         suggested that the value of sparsiﬁcation in neural net-
+         works is misunderstood, and that once a sparse topology
+         is learned it can be trained from scratch to the full perfor-
+         mance achieved when sparsiﬁcation was performed jointly
+         with optimization.
+         Frankle & Carbin(2018) posited that over-parameterized
+         neural networks contain small, trainable subsets of weights,
+         deemed ”winning lottery tickets”. They suggest that sparsity
+         inducing techniques are methods for ﬁnding these sparse
+         topologies, and that once found the sparse architectures can
+         be trained from scratch withthe same weight initialization  Figure 6.Scratch and lottery ticket experiments with magni- that was used when the sparse architecture was learned.  tude pruning.Top: results with Transformer. Bottom: Results They demonstrated that this property holds across different  with ResNet-50. Across all experiments, training from scratch
+         convolutional neural networks and multi-layer perceptrons   using a learned sparse architecture is unable to re-produce the
+         trained on the MNIST and CIFAR-10 datasets.          performance of models trained with sparsiﬁcation as part of the
+                                                   optimization process. Liu et al.(2018) similarly demonstrated this phenomenon
+         for a number of activation sparsity techniques on convolu-
+         tional neural networks, as well as for weight level sparsity  To clarify the questions surrounding the idea of sparsiﬁ-learned with magnitude pruning. However, they demon-  cation as a form of neural architecture search, we repeatstrate this result using a random initialization during re-  the experiments ofFrankle & Carbin(2018) andLiu et al.training.                                    (2018) on ResNet-50 and Transformer. For each model,
+         The implications of being able to train sparse architectures  we explore the full range of sparsity levels (50% - 98%)
+         from scratch once they are learned are large: once a sparse   and compare to our well-tuned models from the previous
+         topology is learned, it can be saved and shared as with   sections.
+         any other neural network architecture. Re-training then
+         can be done fully sparse, taking advantage of sparse linear  6.1. Experimental Framework
+         algebra to greatly accelerate time-to-solution. However, the  The experiments ofLiu et al.(2018) encompass taking thecombination of these two studies does not clearly establish  ﬁnal learned weight mask from a magnitude pruning model,how this potential is to be realized.                  randomly re-initializing the weights, and training the model
+         Beyond the question of whether or not the original random  with the normal training procedure (i.e., learning rate, num-
+         weight initialization is needed, both studies only explore  ber of iterations, etc.). To account for the presence of spar-
+         convolutional neural networks (and small multi-layer per-  sity at the start of training, they scale the variance of the
+         ceptrons in the case ofFrankle & Carbin(2018)). The   initial weight distribution by the number of non-zeros in the
+         majority of experiments in both studies also limited their  matrix. They additionally train a variant where they increase
+         analyses to the MNIST, CIFAR-10, and CIFAR-100 datasets.  the number of training steps (up to a factor of 2x) such that
+         While these are standard benchmarks for deep learning mod-  the re-trained model uses approximately the same number of
+         els, they are not indicative of the complexity of real-world   FLOPs during training as model trained with sparsiﬁcation
+         tasks where model compression is most useful.Liu et al.  as part of the optimization process. They refer to these two
+         (2018) do explore convolutional architectures on the Ima-  experiments as ”scratch-e” and ”scratch-b” respectively.
+         geNet datasets, but only at two relatively low sparsity levels   Frankle & Carbin(2018) follow a similar procedure, but use(30% and 60%). They also note that weight level sparsity  the same weight initialization that was used when the sparseon ImageNet is the only case where they are unable to re-  weight mask was learned and do not perform the longerproduce the full accuracy of the pruned model.          training time variant.                                  The State of Sparsity in Deep Neural Networks
+
+         For our experiments, we repeat the scratch-e, scratch-b and   sparsity levels, we observe that the quality of the models
+         lottery ticket experiments with magnitude pruning on Trans-  degrades relative to the magnitude pruning baseline as spar-
+         former and ResNet-50. For scratch-e and scratch-b, we also   sity increases. For unstructured weight sparsity, it seems
+         train variants that do not alter the initial weight distribution.  likely that the phenomenon observed byLiu et al.(2018)
+         For the Transformer, we re-trained ﬁve replicas of the best  was produced by a combination of low sparsity levels and
+         magnitude pruning hyperparameter settings at each spar-  small-to-medium sized tasks. We’d like to emphasize that
+         sity level and save the weight initialization and ﬁnal sparse   this result is only for unstructured weight sparsity, and that
+         weight mask. For each of the ﬁve learned weight masks,  prior workLiu et al.(2018) provides strong evidence that
+         we train ﬁve identical replicas for the scratch-e, scratch-  activation pruning behaves differently.
+         b, scratch-e with augmented initialization, scratch-b with
+         augmented initialization, and the lottery ticket experiments.  7. Limitations of This Study For ResNet-50, we followed the same procedure with three
+         re-trained models and three replicas at each sparsity level   Hyperparameter exploration. For all techniques and
+         for each of the ﬁve experiments. Figure6plots the averages   models, we carefully hand-tuned hyperparameters and per-
+         and min/max of all experiments at each sparsity level 6 .     formed extensive sweeps encompassing thousands of exper-
+                                                   iments over manually identiﬁed ranges of values. However,
+         6.2. Scratch and Lottery Ticket Results & Analysis      the number of possible settings vastly outnumbers the set
+                                                   of values that can be practically explored, and we cannotAcross all of our experiments, we observed that training  eliminate the possibility that some techniques signiﬁcantlyfrom scratch using a learned sparse architecture is not able   outperform others under settings we did not try.to match the performance of the same model trained with
+         sparsiﬁcation as part of the optimization process.         Neural architectures and datasets. Transformer and
+                                                   ResNet-50 were chosen as benchmark tasks to represent aAcross both models, we observed that doubling the number  cross section of large-scale deep learning tasks with diverseof training steps did improve the quality of the results for  architectures. We can’t exclude the possibility that somethe scratch experiments, but was not sufﬁcient to match the   techniques achieve consistently high performance acrosstest set performance of the magnitude pruning baseline. As   other architectures. More models and tasks should be thor-sparsity increased, we observed that the deviation between   oughly explored in future work.the models trained with magnitude pruning and those trained
+         from scratch increased. For both models, we did not observe
+         a beneﬁt from using the augmented weight initialization for  8. Conclusion
+         the scratch experiments.                         In this work, we performed an extensive evaluation of three
+         For ResNet-50, we experimented with four different learn-  state-of-the-art sparsiﬁcation techniques on two large-scale
+         ing rates schemes for the scratch-b experiments. We found  learning tasks. Notwithstanding the limitations discussed in
+         that scaling each learning rate region to double the number  section7, we demonstrated that complex techniques shown
+         of epochs produced the best results by a wide margin. These   to yield state-of-the-art compression on small datasets per-
+         results are plotted in ﬁgure6. Results for the ResNet-50   form inconsistently, and that simple heuristics can achieve
+         scratch-b experiments with the other learning rate variants   comparable or better results on a reduced computational bud-
+         are included with our release of hyperparameter tuning re-  get. Based on insights from our experiments, we achieve a
+         sults.                                      new state-of-the-art sparsity-accuracy trade-off for ResNet-
+                                                   50 with only magnitude pruning and highlight promisingFor the lottery ticket experiments, we were not able to repli-  directions for research in sparsity inducing techniques.cate the phenomenon observed byFrankle & Carbin(2018).
+         The key difference between our experiments is the complex-  Additionally, we provide strong counterexamples to two re-
+         ity of the tasks and scale of the models, and it seems likely  cently proposed theories that models learned through prun-
+         that this is the main factor contributing to our inability to   ing techniques can be trained from scratch to the same test
+         train these architecture from scratch.                 set performance of a model learned with sparsiﬁcation as
+                                                   part of the optimization process. Our results highlight theFor the scratch experiments, our results are consistent with  need for large-scale benchmarks in sparsiﬁcation and modelthe negative result observed by (Liu et al.,2018) for Im-  compression. As such, we open-source our code, check-ageNet and ResNet-50 with unstructured weight pruning.  points, and results of all hyperparameter conﬁgurations to By replicating the scratch experiments at the full range of   establish rigorous baselines for future work.
+           6 Two of the 175 Transformer experiments failed to train from
+         scratch at all and produced BLEU scores less than 1.0. We omit
+         these outliers in ﬁgure6                                  The State of Sparsity in Deep Neural Networks
+
+         Acknowledgements                         Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S.,
+                                                    Casagrande, N., Lockhart, E., Stimberg, F., van den Oord,We would like to thank Benjamin Caine, Jonathan Frankle,    A., Dieleman, S., and Kavukcuoglu, K. Efﬁcient NeuralRaphael Gontijo Lopes, Sam Greydanus, and Keren Gu for    Audio Synthesis. InProceedings of the 35th Interna-helpful discussions and feedback on drafts of this paper.      tional Conference on Machine Learning, ICML 2018,
+                                                    Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018¨                           ,
+         References                                  pp. 2415–2424, 2018.
+         Bellec, G., Kappel, D., Maass, W., and Legenstein, R. A.  Kingma, D. P. and Welling, M. Auto-encoding variational
+          Deep Rewiring: Training Very Sparse Deep Networks.    bayes.CoRR, abs/1312.6114, 2013.
+          CoRR, abs/1711.05136, 2017.                    Kingma, D. P., Salimans, T., and Welling, M. Variational
+         Collins, M. D. and Kohli, P. Memory Bounded Deep Con-    dropout and the local reparameterization trick. CoRR,
+          volutional Networks.CoRR, abs/1412.1442, 2014. URL    abs/1506.02557, 2015.
+          http://arxiv.org/abs/1412.1442.          LeCun, Y., Denker, J. S., and Solla, S. A. Optimal Brain
+         Dai, B., Zhu, C., and Wipf, D. P. Compressing Neural    Damage. InNIPS, pp. 598–605. Morgan Kaufmann,
+          Networks using the Variational Information Bottleneck.    1989.
+          CoRR, abs/1802.10399, 2018.                    Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning.
+                                                    InNIPS, pp. 2178–2188, 2017.Frankle, J. and Carbin, M. The Lottery Ticket Hy-
+          pothesis: Training Pruned Neural Networks. CoRR,  Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang,abs/1803.03635, 2018. URLhttp://arxiv.org/    C. Learning Efﬁcient Convolutional Networks throughabs/1803.03635.                           Network Slimming. InIEEE International Conference
+                                                    on Computer Vision, ICCV 2017, Venice, Italy, OctoberGray, S., Radford, A., and Kingma, D. P. Block-    22-29, 2017, pp. 2755–2763, 2017.sparse gpu kernels.https://blog.openai.com/
+          block-sparse-gpu-kernels/, 2017.          Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T.
+                                                    Rethinking the Value of Network Pruning.  CoRR,
+         Guo, Y., Yao, A., and Chen, Y. Dynamic Network Surgery    abs/1810.05270, 2018.
+          for Efﬁcient DNNs. InNIPS, 2016.                Louizos, C., Ullrich, K., and Welling, M. Bayesian Com-
+         Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both    pression for Deep Learning. InAdvances in Neural In-
+          Weights and Connections for Efﬁcient Neural Network.    formation Processing Systems 30: Annual Conference
+          InNIPS, pp. 1135–1143, 2015.                     on Neural Information Processing Systems 2017, 4-9 De-
+                                                    cember 2017, Long Beach, CA, USA, pp. 3290–3300,
+         Hassibi, B. and Stork, D. G. Second order derivatives for    2017a.
+          network pruning: Optimal brain surgeon. InNIPS, pp.
+          164–171. Morgan Kaufmann, 1992.                Louizos, C., Welling, M., and Kingma, D. P. Learn-
+                                                    ing Sparse Neural Networks through L0Regularization.
+         He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learn-    CoRR, abs/1712.01312, 2017b.
+          ing for Image Recognition. In2016 IEEE Conference on  Luo, J., Wu, J., and Lin, W. Thinet: A Filter Level PruningComputer Vision and Pattern Recognition, CVPR 2016,    Method for Deep Neural Network Compression. InIEEELas Vegas, NV, USA, June 27-30, 2016, pp. 770–778,    International Conference on Computer Vision, ICCV2016.                                      2017, Venice, Italy, October 22-29, 2017, pp. 5068–5076,
+                                                    2017.He, Y., Lin, J., Liu, Z., Wang, H., Li, L., and Han, S. AMC:
+          automl for model compression and acceleration on mo-  Mitchell, T. J. and Beauchamp, J. J. Bayesian Variablebile devices. InComputer Vision - ECCV 2018 - 15th    Selection in Linear Regression.Journal of the AmericanEuropean Conference, Munich, Germany, September 8-    Statistical Association, 83(404):1023–1032, 1988.14, 2018, Proceedings, Part VII, pp. 815–832, 2018.
+                                                   Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H.,
+         Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun,    Gibescu, M., and Liotta, A. Scalable Training of Artiﬁ-
+          H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and    cial Neural Networks with Adaptive Sparse Connectivity
+          Zhou, Y. Deep learning scaling is predictable, empirically.    Inspired by Network Science.Nature Communications,
+          CoRR, abs/1712.00409, 2017.                      2018.                                  The State of Sparsity in Deep Neural Networks
+
+         Molchanov, D., Ashukha, A., and Vetrov, D. P. Variational  Zagoruyko, S. and Komodakis, N. Wide Residual Networks.
+          Dropout Sparsiﬁes Deep Neural Networks. InProceed-    InProceedings of the British Machine Vision Conference
+          ings of the 34th International Conference on Machine    2016, BMVC 2016, York, UK, September 19-22, 2016,
+          Learning, ICML 2017, Sydney, NSW, Australia, 6-11 Au-    2016.
+          gust 2017, pp. 2498–2507, 2017.                  Zhu, M. and Gupta, S. To prune, or not to prune: exploring
+         Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J.    the efﬁcacy of pruning for model compression.CoRR,
+          Pruning Convolutional Neural Networks for Resource Ef-    abs/1710.01878, 2017. URLhttp://arxiv.org/
+          ﬁcient Transfer Learning.CoRR, abs/1611.06440, 2016.    abs/1710.01878.
+
+         Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. Ex-
+          ploring Sparsity in Recurrent Neural Networks.CoRR,
+          abs/1704.05119, 2017.
+
+         Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling
+          Neural Machine Translation. InProceedings of the Third
+          Conference on Machine Translation: Research Papers,
+          WMT 2018, Belgium, Brussels, October 31 - November 1,
+          2018, pp. 1–9, 2018.
+
+         Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
+          Backpropagation and Approximate Inference in Deep
+          Generative models. InICML, volume 32 ofJMLR
+          Workshop and Conference Proceedings, pp. 1278–1286.
+          JMLR.org, 2014.
+
+         Strom, N. Sparse Connection and Pruning in Large Dynamic¨
+          Artiﬁcial Neural Networks. InEUROSPEECH, 1997.
+
+         Theis, L., Korshunova, I., Tejani, A., and Huszar, F. Faster´
+          gaze prediction with dense networks and Fisher pruning.
+          CoRR, abs/1801.05787, 2018. URLhttp://arxiv.
+          org/abs/1801.05787.
+
+         Ullrich, K., Meeds, E., and Welling, M. Soft Weight-
+          Sharing for Neural Network Compression.  CoRR,
+          abs/1702.04008, 2017.
+
+         Valin, J. and Skoglund, J. Lpcnet: Improving Neural
+          Speech Synthesis Through Linear Prediction. CoRR,
+          abs/1810.11846, 2018. URLhttp://arxiv.org/
+          abs/1810.11846.
+
+         van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
+          Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,
+          and Kavukcuoglu, K. Wavenet: A Generative Model for
+          Raw Audio. InThe 9th ISCA Speech Synthesis Workshop,
+          Sunnyvale, CA, USA, 13-15 September 2016, pp. 125,
+          2016.
+
+         Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
+          L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten-
+          tion is All you Need. InAdvances in Neural Information
+          Processing Systems 30: Annual Conference on Neural In-
+          formation Processing Systems 2017, 4-9 December 2017,
+          Long Beach, CA, USA, pp. 6000–6010, 2017.                    The State of Sparsity in Deep Neural Networks: Appendix
+
+
+
+
+         A. Overview of Sparsity Inducing Techniques   p(w)with observed dataDinto an updated belief over the
+                                                   parameters in the form of the posterior distributionp(wjD).Here we provide a more detailed review of the three sparsity  In practice, computing the true posterior using Bayes’ ruletechniques we benchmarked.                      is computationally intractable and good approximations are
+                                                   needed. In variational inference, we optimize the parame-A.1. Magnitude Pruning                        tersof some parameterized modelq (w)such thatq (w)
+         Magnitude-based weight pruning schemes use the magni-  is a close approximation to the true posterior distribution
+         tude of each weight as a proxy for its importance to model  p(wjD)as measured by the Kullback-Leibler divergence
+         quality, and remove the least important weights according   between the two distributions. The divergence of our ap-
+         to some sparsiﬁcation schedule over the course of training.  proximate posterior from the true posterior is minimized in
+         Many variants have been proposed (Collins & Kohli,2014;  practice by maximizing the variational lower-bound
+         Han et al.,2015;Guo et al.,2016;Zhu & Gupta,2017),
+         with the key differences lying in when weights are removed,        L() =D              Lwhether weights should be sorted to remove a precise pro-                KL (q (w)jjp(w)) + D ()
+
+         portion or thresholded based on a ﬁxed or decaying value,               PwhereLand whether or not weights that have been pruned still re-        D () =     Eq (w) [logp(yjx;w)]
+                                                              (x;y)2D
+         ceive gradient updates and have the potential to return after  Using the Stochastic Gradient Variational Bayes (SGVB)being pruned.                                (Kingma et al.,2015) algorithm to optimize this bound,
+         Han et al.(2015) use iterative magnitude pruning and re-  LD ()reduces to the standard cross-entropy loss, and the
+         training to progressively sparsify a model. The target model   KL divergence between our approximate posterior and prior
+         is ﬁrst trained to convergence, after which a portion of   over the parameters serves as a regularizer that enforces our
+         weights are removed and the model is re-trained with these   initial belief about the parametersw.
+         weights ﬁxed to zero. This process is repeated until the  In the standard formulation of variational dropout, we as-target sparsity is achieved.Guo et al.(2016) improve on   sume the weights are drawn from a fully-factorized Gaussianthis approach by allowing masked weights to still receive   approximate posterior.gradient updates, enabling the network to recover from in-
+         correct pruning decisions during optimization. They achieve
+         higher compression rates and interleave pruning steps with           wij q (wij ) =N(ij ; ij 2 )ij gradient update steps to avoid expensive re-training.Zhu
+         & Gupta(2017) similarly allow gradient updates to masked  Whereandare neural network parameters. For eachweights, and make use of a gradual sparsiﬁcation schedule   training step, we sample weights from this distribution andwith sorting-based weight thresholding to maintain accuracy  use thereparameterization trick(Kingma & Welling,2013; while achieving a user speciﬁed level of sparsiﬁcation.     Rezende et al.,2014) to differentiate the loss w.r.t. the pa-
+         Its worth noting that magnitude pruning can easily be   rameters through the sampling operation. Given the weights
+         adapted to induce block or activation level sparsity by re-  are normally distributed, the distribution of the activations
+         moving groups of weights based on their p-norm, average,  Bafter a linear operation like matrix multiplication or con-
+         max, or other statistics. Variants have also been proposed  volution is also Gaussian and can be calculated in closed
+         that maintain a constant level of sparsity during optimization   form 7 .
+         to enable accelerated training (Mocanu et al.,2018).
+                                                             q (bmj jA) N(mj ; mj )
+         A.2. Variational Dropout
+         Consider the setting of a datasetDofNi.i.d. samples            PK               PK with (x;y)and a standard classiﬁcation problem where the goal       mj =   ami ij andmj =   a2 mi ij 2 and iji=1              i=1
+         is to learn the parameterswof the conditional probability  whereami 2Aare the inputs to the layer. Thus, rather
+         p(yjx;w). Bayesian inference combines some initial belief     7 We ignore correlation in the activations, as is done by over the parameterswin the form of a prior distribution   Molchanov et al.(2017)                               The State of Sparsity in Deep Neural Networks: Appendix
+
+         than sample weights, we can directly sample the activations   andandstretch the distribution s.t.zj takes value 0 or 1
+         at each layer. This step is known as thelocal reparame-  with non-zero probability.
+         terization trick, and was shown byKingma et al.(2015) to   On each training iteration,zreduce the variance of the gradients relative to the standard                      j is sampled from this distri-
+                                                   bution and multiplied with the standard neural networkformulation in which a single set of sampled weights must  weights. The expectedlbe shared for all samples in the input batch for efﬁciency.                    0 -normLC can then be calcu-
+                                                   lated using the cumulative distribution function of the hard-Molchanov et al.(2017) showed that the variance of the gra-  concrete distribution and optimized directly with stochasticdients could be further reduced by using anadditive noise  gradient descent.reparameterization, where we deﬁne a new parameter
+
+                       2 =ij   ij 2ij                      Xjj            Xjj                LC =  (1Qs (0j)) =   sigmoid(log
+         Under this parameterization, we directly optimize the mean              j                  j log  )j=1           j=1
+         and variance of the neural network parameters.
+         Under the assumption of a log-uniform prior on the weights  At test-time,Louizos et al.(2017b) use the following esti-
+         w, the KL divergence component of our objective function   mate for the model parameters.
+         DKL (q (wij )jjp(wij ))can be accurately approximated
+         (Molchanov et al.,2017):
+                                                                   =~ z^
+                                                      z^=min(1;max(0;sigmoid(log)() +))
+                    DKL (q (wij )jjp(wij ))
+            k1 (k2 +k3 logij )0:5log(1 +1 +kij    1 )      Interestingly,Louizos et al.(2017b) showed that their ob-
+             k                                     jective function under thel1 = 0:63576 k2 = 1:87320 k3 = 1:48695                         0 penalty is a special case of a
+                                                   variational lower-bound over the parameters of the network
+                                                   under a spike and slab (Mitchell & Beauchamp,1988) prior.After training a model with variational dropout, the weights
+         with the highestvalues can be removed. For all their
+         experiments,Molchanov et al.(2017) removed weights with   B. Variational Dropout Implementation
+         loglarger than 3.0, which corresponds to a dropout rate     Veriﬁcation
+         greater than 95%. Although they demonstrated good results,
+         it is likely that the optimalthreshold varies across different  To verify our implementation of variational dropout, we
+         models and even different hyperparameter settings of the   applied it to LeNet-300-100 and LeNet-5-Caffe on MNIST
+         same model. We address this question in our experiments.   and compared our results to the original paper (Molchanov
+                                                   et al.,2017). We matched our hyperparameters to those
+                                                   used in the code released with the paper 8 . All results areA.3.l0 Regularization                          listed in table3
+         To optimize thel0 -norm, we reparameterize the model
+         weightsas the product of a weight and a random vari-  Table 3.Variational Dropout MNIST Reproduction Results. able drawn from the hard-concrete distribution.           Network Experiment Sparsity (%) Accuracy (%)
+                                                            original (Molchanov et al.,2017) 98.57 98.08
+                                                            ours (log= 3.0) 97.52 98.42LeNet-300-100 ours (log= 2.0) 98.50 98.40
+                                                          ours (log= 0.1) 99.10 98.13
+                          j =~j zj                            original (Molchanov et al.,2017) 99.60 99.25
+            wherez                                  LeNet-5-Caffe ours (log= 3.0) 99.29 99.26
+                 j min(1;max(0;s));s=s() +              ours (log= 2.0) 99.50 99.25
+             s=sigmoid((logulog(1u) +log)=)
+                       andu U(0;1)                 Our baseline LeNet-300-100 model achieved test set accu-
+                                                   racy of 98.42%, slightly higher than the baseline of 98.36%
+                                                   reported in (Molchanov et al.,2017). Applying our varia-In this formulation, theparameter that controls the posi-  tional dropout implementation to LeNet-300-100 with these tion of the hard-concrete distribution (and thus the proba-  hyperparameters produced a model with 97.52% global spar-bility thatzj is zero) is optimized with gradient descent.,  sity and 98.42% test accuracy. The original paper produced, andare ﬁxed parameters that control the shape of the
+         hard-concrete distribution.controls the curvature ortem-    8 https://github.com/ars-ashuha/variational-dropout-sparsiﬁes-
+         peratureof the hard-concrete probability density function,  dnn                               The State of Sparsity in Deep Neural Networks: Appendix
+
+                                                   Our baseline WRN-28-10 implementation trained on
+                                                   CIFAR-10 achieved a test set accuracy of 95.45%. Using
+                                                   ourl0 regularization implementation and al0 -norm weight
+                                                   of .0003, we trained a model that achieved 95.34% accuracy
+                                                   on the test set while achieving a consistent training-time
+                                                   FLOPs reduction comparable to that reported byLouizos
+                                                   et al.(2017b). Floating-point operations (FLOPs) required
+                                                   to compute the forward over the course of training WRN-
+                                                   28-10 withl0 are plotted in ﬁgure7.
+                                                   During our re-implementation of the WRN experiments
+         Figure 7.Forward pass FLOPs for WRN-28-10 trained withl0   fromLouizos et al.(2017b), we identiﬁed errors in the orig- regularization.Our implementation achieves FLOPs reductions   inal publications FLOP calculations that caused the number comparable to those reported inLouizos et al.(2017b).         of ﬂoating-point operations in WRN-28-10 to be miscalcu-
+                                                   lated. We’ve contacted the authors, and hope to resolve this
+                                                   issue to clarify their performance results.
+         a model with 98.57% global sparsity, and 98.08% test accu-
+         racy. While our model achieves .34% higher tests accuracy  D. Sparse Transformer Experiments with 1% lower sparsity, we believe the discrepancy is mainly
+         due to difference in our software packages: the authors of   D.1. Magnitude Pruning Details
+         (Molchanov et al.,2017) used Theano and Lasagne for their  For our magnitude pruning experiments, we tuned four keyexperiments, while we use TensorFlow.               hyperparameters: the starting iteration of the sparsiﬁcation
+         Given our model achieves highest accuracy, we can decrease   process, the ending iteration of the sparsiﬁcation process,
+         thelogthreshold to trade accuracy for more sparsity. With   the frequency of pruning steps, and the combination of other
+         alogthreshold of 2.0, our model achieves 98.5% global   regularizers (dropout and label smoothing) used during train-
+         sparsity with a test set accuracy of 98.40%. With alog   ing. We trained models with 7 different target sparsities:
+         threshold of 0.1, our model achieves 99.1% global sparsity  50%, 60%, 70%, 80%, 90%, 95%, and 98%. At each of
+         with 98.13% test set accuracy, exceeding the sparsity and  these sparsity levels, we tried pruning frequencies of 1000
+         accuracy of the originally published results.            and 10000 steps. During preliminary experiments we identi-
+                                                   ﬁed that the best settings for the training step to stop pruningOn LeNet-5-Caffe, our implementation achieved a global   at were typically closer to the end of training. Based on thissparsity of 99.29% with a test set accuracy of 99.26%, ver-  insight, we explored every possible combination of start andsus the originaly published results of 99.6% sparsity with   end points for the sparsity schedule in increments of 10000099.25% accuracy. Lowering thelogthreshold to 2.0, our  steps with an ending step of 300000 or greater.model achieves 99.5% sparsity with 99.25% test accuracy.
+                                                   By default, the Transformer uses dropout with a dropout
+         C.l                                      rate of 10% on the input to the encoder, decoder, and before 0 Regularization Implementation          each layer and performs label smoothing with a smooth- Veriﬁcation                             ing parameter of .1. We found that decreasing these other
+         The originall                                 regularizers produced higher quality models in the mid to 0 regularization paper uses a modiﬁed version
+         of the proposed technique for inducing group sparsity in   high sparsity range. For each hyperparameter combination,
+         models, so our weight-level implementation is not directly  we tried three different regularization settings: standard la-
+         comparable. However, to verify our implementation we   bel smoothing and dropout, label smoothing only, and no
+         trained a Wide ResNet (WRN) (Zagoruyko & Komodakis,  regularization.
+         2016) on CIFAR-10 and compared results to those reported
+         in the original publication for group sparsity.            D.2. Variational Dropout Details
+
+         As done byLouizos et al.(2017b), we applyl        For the Transformer trained with variational dropout, we 0 to the
+         ﬁrst convolutional layer in the residual blocks (i.e., where   extensively tuned the coefﬁcient for the KL divergence
+         dropout would normally be used). We use the weight decay  component of the objective function to ﬁnd models that
+         formulation for the re-parameterized weights, and scale the   achieved high accuracy with sparsity levels in the target
+         weight decay coefﬁcient to maintain the same initial length  range. We found that KL divergence weights in the range
+         scale of the parameters. We use the same batch size of 128   [:1 ;1 ], whereNis the number of samples in the training N N
+         samples and the same initial log, and train our model on a  set, produced models in our target sparsity range.
+         single GPU.                               The State of Sparsity in Deep Neural Networks: Appendix
+
+         (Molchanov et al.,2017) noted difﬁculty training some mod-  E. Sparse ResNet-50
+         els from scratch with variational dropout, as large portions
+         of the model adopt high dropout rates early in training be-  E.1. Learning Rate
+         fore the model can learn a useful representation from the   For all experiments, the we used the learning rate schemedata. To address this issue, they use a gradual ramp-up of the   used by the ofﬁcial TensorFlow ResNet-50 implementation 9 .KL divergence weight, linearly increasing the regularizer  With our batch size of 1024, this includes a linear ramp-upcoefﬁcient until it reaches the desired value.            for 5 epochs to a learning rate of .4 followed by learning
+         For our experiments, we explored using a constant regu-  rate drops by a factor of 0.1 at epochs 30, 60, and 80.
+         larizer weight, linearly increasing the regularizer weight,
+         and also increasing the regularizer weight following the   E.2. Magnitude Pruning Details
+         cubic sparsity function used with magnitude pruning. For  For magnitude pruning on ResNet-50, we trained modelsthe linear and cubic weight schedules, we tried each com-  with a target sparsity of 50%, 70%, 80%, 90%, 95%, andbination of possible start and end points in increments of   98%. At each sparsity level, we tried starting pruning at100000 steps. For each hyperparameter combination, we  steps 8k, 20k, and 40k. For each potential starting point, wealso tried the three different combinations of dropout and la-  tried ending pruning at steps 68k, 76k, and 100k. For everybel smoothing as with magnitude pruning. For each trained   hyperparameter setting, we tried pruning frequencies of 2k,model, we evaluated the model with 11logthresholds  4k, and 8k steps and explored training with and without labelin the range[0;5]. For all experiments, we initialized all   smoothing. During preliminary experiments, we observedlog2 parameters to the constant value10.            that removing weight decay from the model consistently
+                                                   caused signiﬁcant decreases in test accuracy. Thus, for allD.3.l0 Regularization Details                     hyperparameter combinations, we left weight decay on with
+         For Transformers trained withl                    the standard coefﬁcient. 0 regularization, we simi-
+         larly tuned the coefﬁcient for thel0 -norm in the objective  For a target sparsity of 98%, we observed that very few hy-function. We observed that much higher magnitude regu-  perparameter combinations were able to complete traininglarization coefﬁcients were needed to produce models with  without failing due to numerical issues. Out of all the hyper-the same sparsity levels relative to variational dropout. We   parameter conﬁgurations we tried, only a single model wasfound thatl                      100 -norm weights in the range[1 ; ]produced N N          able to complete training without erroring from the presencemodels in our target sparsity range.                  of NaNs. As explained in the main text, at high sparsity
+         For all experiments, we used the default settings for the   levels the ﬁrst layer of the model has very few non-zero
+         paramters of the hard-concrete distribution:= 2=3,=  parameters, leading to instability during training and low
+         0:1, and= 1:1. We initialized thelogparameters to  test set performance. Pruned ResNet-50 models with the
+         2:197, corresponding to a 10% dropout rate.            ﬁrst layer left dense did not exhibit these issues.
+
+         For each hyperparameter setting, we explored the three reg-  E.3. Variational Dropout Detailsularizer coefﬁcient schedules used with variational dropout
+         and each of the three combinations of dropout and label   For variational dropout applied to ResNet-50, we explored
+         smoothing.                                  the same combinations of start and end points for the kl-
+                                                   divergence weight ramp up as we did for the start and end
+         D.4. Random Pruning Details                     points of magnitude pruning. For all transformer experi-
+                                                   ments, we did not observe a signiﬁcant gain from using aWe identiﬁed in preliminary experiments that random prun-  cubic kl-divergence weight ramp-up schedule and thus onlying typically produces the best results by starting and ending   explored the linear ramp-up for ResNet-50. For each combi-pruning early and allowing the model to ﬁnish the rest of   nation of start and end points for the kl-divergence weight,the training steps with the ﬁnal sparse weight mask. For our  we explored 9 different coefﬁcients for the kl-divergenceexperiments, we explored all hyperparameter combinations   loss term: .01 / N, .03 / N, .05 / N, .1 / N, .3 / N, .5 / N, 1 /that we explored with magnitude pruning, and also included   N, 10 / N, and 100 / N.start/end pruning step combinations with an end step of less
+         than 300000.                                 Contrary to our experience with Transformer, we found
+                                                   ResNet-50 with variational dropout to be highly sensitive
+                                                   to the initialization for the log2 parameters. With the
+                                                   standard setting of -10, we couldn’t match the baseline accu-
+                                                   racy, and with an initialization of -20 our models achieved
+                                                     9 https://bit.ly/2Wd2Lk0                               The State of Sparsity in Deep Neural Networks: Appendix
+
+         good test performance but no sparsity. After some exper-  pruning frequencies of 2k, 4k, and 8k and explored training
+         imentation, we were able to produce good results with an  with and without label smoothing.
+         initialization of -15.
+         While with Transformer we saw a reasonable amount of   E.6. Scratch-B Learning Rate Variants
+         variance in test set performance and sparsity with the same  For the scratch-b (Liu et al.,2018) experiments with ResNet-
+         model evaluated at different logthresholds, we did not   50, we explored four different learning rate schemes for the
+         observe the same phenomenon for ResNet-50. Across a   extended training time (2x the default number of epochs).
+         range of logvalues, we saw consistent accuracy and nearly
+         identical sparsity levels. For all of the results reported in the  The ﬁrst learning rate scheme we explored was uniformly
+         main text, we used a logthreshold of 0.5, which we found  scaling each of the ﬁve learning rate regions to last for
+         to produce slightly better results than the standard threshold   double the number of epochs. This setup produced the best
+         of 3.0.                                     results by a wide margin. We report these results in the main
+                                                   text.
+         E.4.l0 Regularization Details                     The second learning rate scheme was to keep the standard
+                                                   learning rate, and maintain the ﬁnal learning rate for theForl0 regularization, we explored four different initial log   extra training steps as is common when ﬁne-tuning deepvalues corresponding to dropout rates of 1%, 5%, 10%,  neural networks. The third learning rate scheme was to and 30%. For each dropout rate, we extenively tuned thel0 -  maintain the standard learning rate, and continually drop norm weight to produce models in the desired sparsity range.  the learning rate by a factor of 0.1 every 30 epochs. The lastAfter identifying the proper range ofl0 -norm coefﬁcients,  scheme we explored was to skip the learning rate warm-up,we ran experiments with 20 different coefﬁcients in that  and drop the learning rate by 0.1 every 30 epochs. Thisrange. For each combination of these hyperparameters, we  learning rate scheme is closest to the one used byLiu et al.tried all four combinations of other regularizers: standard  (2018). We found that this scheme underperformed relativeweight decay and label smoothing, only weight decay, only  to the scaled learning rate scheme with our training setup.label smoothing, and no regularization. For weight decay,
+         we used the formulation for the reparameterized weights  Results for all learning rate schemes are included with the
+         provided in the original paper, and followed their approach  released hyperparameter tuning data.
+         of scaling the weight decay coefﬁcient based on the initial
+         dropout rate to maintain a constant length-scale between the
+         l0 regularized model and the standard model.
+         Across all of these experiments, we were unable to produce
+         ResNet models that achieved a test set performance better
+         than random guessing. For all experiments, we observed that
+         training proceeded reasonably normally until thel0 -norm
+         loss began to drop, at which point the model incurred severe
+         accuracy loss. We include the results of all hyperparameter
+         combinations in our data release.
+         Additionally, we tried a number of tweaks to the learning
+         process to improve the results to no avail. We explored
+         training the model for twice the number of epochs, training
+         with much higher initial dropout rates, modifying the
+         parameter for the hard-concrete distribution, and a modiﬁed
+         test-time parameter estimator.
+
+         E.5. Random Pruning Details
+         For random pruning on ResNet-50, we shifted the set of
+         possible start and end points for pruning earlier in training
+         relative to those we explored for magnitude pruning. At
+         each of the sparsity levels tried with magnitude pruning,
+         we tried starting pruning at step 0, 8k, and 20k. For each
+         potential starting point, we tried ending pruning at steps 40k,
+         68k, and 76k. For every hyperparameter setting, we tried
--- a/Corpus/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.txt
+++ b/Corpus/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.txt
--- a/Inference.txt
+++ b/Inference.txt
--- a/Corpus/convex-neural-networks.txt
+++ b/Corpus/convex-neural-networks.txt
--- a/Corpus/vDNN
+++ b/Corpus/vDNN