diff --git a/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt b/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt new file mode 100644 index 0000000..0c2f968 --- /dev/null +++ b/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt @@ -0,0 +1,555 @@ + IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1 + + + + A Survey of Model Compression and Acceleration + + for Deep Neural Networks + + Yu Cheng, Duo Wang, Pan Zhou,Member, IEEE,and Tao Zhang,Senior Member, IEEE + + + + + Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model + recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment + in devices with low memory resources or in applications with to billions [4]. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + arXiv:1710.09282v7 [cs.LG] 7 Feb 2019 strict latency requirements. Therefore, a natural thought is to As larger neural networks with more layers and nodes + perform model compression and acceleration in deep networks are considered, reducing their storage and computational cost + without significantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech- + niques for compacting and accelerating CNNs model developed. tion, recent years witnessed significant progress in virtual + These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre- + parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle + ferred/compact convolutional filters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced. + For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efficient deep learning methods can have + performance, related applications, advantages, and drawbacks significant impacts on distributed systems, embedded devices, + etc. Then we will go through a few very recent additional and FPGA for Artificial Intelligence. For example, the ResNet- + successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion floating number multiplications matrix, the main datasets used for evaluating the model per- + formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant + this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than + on this topic. 75% of parameters and 50% computational time. For devices + Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte + Model Compression and Acceleration, resources, how to compact the models used on them is also + important. + Achieving these goal calls for joint solutions from manyI. I NTRODUCTION disciplines, including but not limited to machine learning, op- + In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing, + lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works + achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which + These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community + billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years. + very high computation capability plays a key role in their We classify these approaches into four categories: pa- + success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans- + achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional filters, and knowledge distil- + using a network containing 60 million parameters with five lation. The parameter pruning and sharing based methods + convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to + it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor- + ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to + example is the top face verification results on the Labeled estimate the informative parameters of the deep CNNs. The + Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional filters + containing hundreds of millions of parameters, using a mix design special structural convolutional filters to reduce the + of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge + distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft + Way, Redmond, WA 98052, USA. compact neural network to reproduce the output of a larger + Duo Wang and Tao Zhang are with the Department of Automation, network. + Tsinghua University, Beijing 100084, China. In Table I, we briefly summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074, + China. rank factorization and knowledge distillation approaches can IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2 + + + TABLE I + SUMMARIZATION OF DIFFERENT APPROACHES FOR MODEL COMPRESSION AND ACCELERATION . + Theme Name Description Applications More details + Parameter pruning and sharing Reducing redundant parameters which Convolutional layer and Robust to various settings, can achieve + are not sensitive to the performance fully connected layer good performance, can support both train + from scratch and pre-trained model + Low-rank factorization Using matrix/tensor decomposition to Convolutional layer and Standardized pipeline, easily to be + estimate the informative parameters fully connected layer implemented, can support both train + from scratch and pre-trained model + Transferred/compact convolutional Designing special structural convolutional Convolutional layer Algorithms are dependent on applications, + filters filters to save parameters only usually achieve good performance, + only support train from scratch + Knowledge distillation Training a compact neural network with Convolutional layer and Model performances are sensitive + distilled knowledge of a large model fully connected layer to applications and network structure + only support train from scratch + + + be used in DNN models with fully connected layers and + convolutional layers, achieving comparable performances. On + the other hand, methods using transferred/compact filters are + designed for models with convolutional layers only. Low-rank + factorization and transfered/compact filters based approaches + provide an end-to-end pipeline and can be easily implemented + in CPU/GPU environment, which is straightforward. while + parameter pruning & sharing use different methods such as + vector quantization, binary coding and sparse constraints to + perform the task. Generally it will take several steps to achieve + the goal. Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output + eter pruning/sharing low-rank factorization can be extracted is the compression model. + from pre-trained ones or trained from scratch. While the + transferred/compact filter and knowledge distillation models + can only support train from scratch. These methods are inde- memory usage and float point operations with little loss in + pendently designed and complement each other. For example, classification accuracy. + transferred layers and parameter pruning & sharing can be The method proposed in [10] quantized the link weights + used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the + used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce + speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con- + properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the + small-weight connections. Finally, the network was retrained + II. P to learn the final weights for the remaining sparse connections. ARAMETER PRUNING AND SHARING This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importancefitting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classified into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix. instance, BinaryConnect [12], BinaryNet [13] and XNORNet- + works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization activation during the model training. The systematic study in + Network quantization compresses the original network by [15] showed that networks trained with back propagation could + reducing the number of bits required to represent each weight. be resilient to specific weight distortions, including binary + Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights. + quantization to the parameter values. Vanhouckeet al.[8] Drawbacks: the accuracy of the binary nets is significantly + showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet. + in significant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina- + work in [9] used 16-bit fixed-point representation in stochastic rization schemes are based on simple matrix approximations + rounding based CNN training, which significantly reduced and ignore the effect of binarization on the accuracy loss. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3 + + + To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of + Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear + directly minimizes the loss with respect to the binary weights. transformsf(x;M) =(Mx), where()is an element-wise + The work in [17] reduced the time on float point multiplication nonlinear operator,xis the input vector, andMis themn + in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense + converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing + significant changes. matrix-vector products inO(mn)time. Thus, an intuitive + way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing structural matrix. Anmnmatrix that can be described + Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured + network complexity and to address the over-fitting issue. An matrix. Typically, the structure should not only reduce the + early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference + [18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and + Surgeon [20] methods reduced the number of connections gradient computations. + based on the Hessian of the loss function, and their work sug- Following this direction, the work in [30], [31] proposed a + gested that such pruning gave higher accuracy than magnitude- simple and efficient approach based on circulant projections, + while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training (rprocedure of those methods followed the way training from 0 ;r 1 ;;r d1 ), a circulant matrixR2Rdd is defined + as: + scratch manner. 2 3 r A recent trend in this direction is to prune redundant, 0 rd1 ::: r 2 r1 6r6 1 r0 rd1 r2 77 non-informative weights in a pre-trained CNN model. For 6 .. . 7 + example, Srinivas and Babu [21] explored the redundancy R= circ(r) :=66 . r . .. . 71 r0 . 7: (1)6 . 7 among neurons, and proposed a data-free pruning method to 4r . .. .. 5d2 rd1 + remove redundant neurons. Hanet al.[22] proposed to reduce rd1 rd2 ::: r 1 r0 + the total number of parameters and operations in the entire thus the memory cost becomesO(d)instead ofO(d2 ).network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourierused a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan- In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fullyon soft weight-sharing was proposed, which included both connected layers. The Adaptive Fastfood transform matrixquantization and pruning in one simple (re-)training procedure. R2Rnd was defined as:The above pruning schemes typically produce connections + pruning in CNNs. R=SHGHB (2) + There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices. 2 + with sparsity constraints. Those sparsity constraints are typ- f0;1gdd is a random permutation matrix, andHdenotes + ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con- + norm regularizers. The work in [25] imposed group sparsity nected layer withdinputs andnoutputs using the Adaptive + constraint on the convolutional filters to achieve structured Fastfood transform reduces the storage and the computational + brain Damage, i.e., pruning entries of the convolution kernels costs fromO(nd)toO(n)and fromO(nd)toO(nlogd), + in a group-wise fashion. In [26], a group-sparse regularizer respectively. + on neurons was introduced during the training stage to learn The work in [29] showed the effectiveness of the new + compact CNNs with reduced filters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their + structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured + filters, channels or even layers. In the filter-level pruning, all matrix classes, including block and multi-level Toeplitz-like + the above works usedl2;1 -norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34]. + usedl1 -norm to select and prune unimportant filters. Following this idea, [35] proposed a general structured effi- + Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs. + and sharing. First, pruning withl1 orl2 regularization requires Drawbacks: one problem of this kind of approaches is that + more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the + pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand, + which demands fine-tuning of the parameters and could be how to find a proper structural matrix is difficult. There is no + cumbersome for some applications. theoretical way to derive it out. + + C. Designing Structural Matrix III. L OW -RANK FACTORIZATION AND SPARSITY + In architectures that contain fully-connected layers, it is Convolution operations contribute the bulk of most com- + critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4 + + + TABLE II + COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES + ON ILSVRC-2012. + Model TOP-5 Accuracy Speed-up Compression Rate + AlexNet 80.03% 1. 1. + BN Low-rank 80.56% 1.09 4.94 + CP Low-rank 79.66% 1.82 5. + VGG-16 90.60% 1. 1. + Fig. 2. A typical framework of the low-rank regularization method. The left BN Low-rank 90.47% 1.53 2.72 + is the original convolutional layer and the right is the low-rank constraint CP Low-rank 90.31% 2.05 2.75 + convolutional layer with rank-K. GoogleNet 92.21% 1. 1. + BN Low-rank 91.88% 1.08 2.79 + CP Low-rank 91.79% 1.20 2.84 + would improve the compression rate as well as the overall + speedup. For the convolution kernels, it can be viewed as a + 4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic + the intuition that there is a significant amount of redundancy parameters in deep models using the low-rank method. [42] + in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the final weight + remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted + it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite + help. the fully connected layer for designing compact multi-task + It has been a long time for using low-rank filters to acceler- deep learning architectures. + ate convolution, for example, high dimensional DCT (discrete Drawbacks: low-rank approaches are straightforward for + cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements + to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti- + respectively. Learning separable 1D filters was introduced fied units and maxout. However, the implementation is not + by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which + idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current + approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and + kernels were proposed in [37]. They achieved 2speedup thus cannot perform global parameter compression, which + for a single convolutional layer with 1% drop in classification is important as different layers hold different information. + accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to + decomposition schemes, reporting a 4.5speedup with 1% achieve convergence when compared to the original model. + drop in accuracy in text recognition. + The low-rank approximation was done layer by layer. The IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS + parameters of one layer were fixed after it was done, and the CNNs are parameter efficient due to exploring the trans-layers above were fine-tuned based on a reconstruction error lation invariant property of the representations to the inputcriterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-fitting. Although a strong theorylowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant propertyused nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu-decomposition for training low-rank constrained CNNs from tional filters to compress CNN models is motivated by recentscratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input,()be a network or layer andT()be theboth the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is defined as:Low-rank) can be used to train CNNs from scratch. However, + there are few differences between them. For example, finding T‘ (x) = (Tx) (3)the best low-rank approximation in CP decomposition is an ill- + posed problem, and the best rank-K(Kis the rank number) indicating that transforming the inputxby the transformT() + approximation may not exist sometimes. While for the BN and then passing it through the network or layer()should + scheme, the decomposition always exists. We perform a simple give the same result as first mappingxthrough the network + comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq. + speedup and the compression rates are used to measure their (10), the transformsT()andT0 ()are not necessarily the + performances. same as they operate on different objects. According to this + As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or filters + be viewed as a 2D matrix and thus the above mentioned ()to compress the whole network models. From empirical + methods can also be applied there. There are several classical observation, deep CNNs also benefit from using a large set of + works on exploiting low-rankness in fully connected layers. convolutional filters by applying certain transformT()to a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5 + + + small set of base filters since it acts as a regularizer for the TABLE III + model. ASIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND + Following this direction, there are many recent reworks CIFAR-100. + proposed to build a convolutional layer from a set of base Model CIFAR-100 CIFAR-10 Compression Rate + filters [43]–[46]. What they have in common is that the VGG-16 34.26% 9.85% 1. + transformT()lies in the family of functions that only operate MBA [46] 33.66% 9.76% 2. + CRELU [45] 34.57% 9.92% 2. in the spatial domain of the convolutional filters. For example, CIRC [43] 35.15% 10.23% 4. + the work in [45] found that the lower convolution layers of DCNN [44] 33.57% 9.65% 1.62 + CNNs learned redundant filters to extract both positive and + negative phase information of an input signal, and definedT() Drawbacks: there are few issues to be addressed for ap-to be the simple negation function: proaches that apply transform constraints to convolutional fil- + T(Wx ) =W (4) ters. First, these methods can achieve competitive performance x for wide/flat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional filter andW is the filter x ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2compression Using a compact filter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric filters with compact blocks to improve the classification accuracy. The intuition is that the learning the speed, which significantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing33convolution into two11to useful convolutional filters instead of redundant ones. convolutions was used in [48], which achieved significantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace33convolution with11convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The + transformT()was define as: V. K NOWLEDGE DISTILLATION T‘ (x) =Wx + (5) To the best of our knowledge, exploiting knowledge transfer + wherewere the multi-bias factors. The work in [47] con- (KT) to compress model was first proposed by Caruanaet + sidered a combination of rotation by a multiple of90 and al.[50]. They trained a compressed/ensemble model of strong + horizontal/vertical flipping with: classifiers with pseudo-data labeled, and reproduced the output + of the original larger network. But the work is limited toT‘ (x) =WT (6) shallow models. The idea has been recently adopted in [51] + whereWT was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide + original filters with angle2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model + transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The + was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from + can achieve good classification performance. a large teacher model into a small one by learning the class + The work in [44] definedT()as the set of translation distributions output via softmax. + functions applied to 2D filters: The work in [52] introduced a KD compression framework, + which eased the training of deep networks by following aT‘ (x) =T(;x;y)x;y2fk;:::;kg;(x;y)6=(0;0) (7) student-teacher paradigm, in which the student was penalized + whereT(;x;y)denoted the translation of the first operand by according to a softened version of the teacher’s output. The + (x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into + at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained + can be used to 1) improve the classification accuracy as a to predict the output and the classification labels. Despite + regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various + parameter efficiency by flexibly varying their architectures to image classification tasks. The work in [53] aimed to address + compress networks. the network compression problem by taking advantage of + Table III briefly compares the performance of different depth neural networks. It proposed an approach to train thin + methods with transferred convolutional filters, using VGGNet but deep networks, called FitNets, to compress wide and + (16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended + on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In + observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher + little or no drop in classification accuracy. network, FitNet made the student mimic the full feature maps IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 6 + + + of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec- + the capacities of teacher and student may differ greatly. ture such as GoogleNet or Network in Network, can achieve + All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting + 10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully + experimental results show that these methods match or outper- optimized the utilization of the computing resources inside + form the teacher’s performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62] + parameters and multiplications. and motivated them to increase the depth and width of the + There are several extension along this direction of dis- network while keeping the computational budget constant. + tillation knowledge. The work in [54] trained a parametric The work in [63] targeted the Residual Network based + student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called + proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory + neural networks for the student model. Different from previous setup to train short networks and used deep networks at test + works which represented the knowledge using the soften label time. It started with very deep networks, while during training, + probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers + neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this + information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual + The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed + instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers + network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best + are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional + mations between neural network specifications. Zagoruyko networks with adaptive inference graphs to adaptively define + et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66]. + assumption of FitNet. They transferred the attention maps that Other approaches to reduce the convolutional overheads in-are summaries of the full activations. clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help significantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classification tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral filters [70]. Those worksperformance competitive with other type of approaches. only aim to speed up the computation but not reduce the + memory storage.VI. O THER TYPES OF APPROACHES + We first summarize the works utilizing attention-based + methods. Note that attention-based mechanism [58] can reduce VII. B ENCHMARKS , E VALUATION AND DATABASES + computations significantly by learning to selectively focus or In the past five years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to first standard models include LeNets [71], All-CNN-nets [72] andfind the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been significantly reduced. layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected + a sparse combination of the experts to process each input. In The standard criteria to measure the quality of model + [61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the + which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters + selected and executed a subset of D2NN neurons based on the in the original modelManda is that of the compressed + input. modelM , then the compression rate(M;M )ofM over + There have been other attempts to reduce the number of Mis aparameters of neural networks by replacing the fully connected (M;M ) = : (8)a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7 + + + TABLE IV or low rank factorization based methods. If you need + SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION . and transferred convolutional filters approaches could be + Baseline Models Representative Works considered. + Alexnet [1] structural matrix [29], [30], [32] For applications in some specific domains, methods with low-rank factorization [40] human prior (like the transferred convolutional filters, Network in network [73] low-rank factorization [40] + VGG nets [74] transferred filters [44] structural matrix) sometimes have benefits. For example, + low-rank factorization [40] when doing medical images classification, transferred Residual networks [75] compact filters [49], stochastic depth [63] convolutional filters could work well as medical images parameter sharing [24] + All-CNN-nets [72] transferred filters [45] (like organ) do have the rotation transformation property. + LeNets [71] parameter sharing [24] Usually the approaches of pruning & sharing could give parameter pruning [20], [22] reasonable compression rate while not hurt the accuracy. + Thus for applications which requires stable model accu- + Another widely used measurement is the index space saving racy, it is better to utilize pruning & sharing. + defined in several papers [30], [35] as If your problem involves small/medium size datasets, you + can try the knowledge distillation approaches. The com-aa + (M;M ) = ; (9) pressed student model can take the benefit of transferringa knowledge from teacher model, making it robust datasets + whereaandaare the number of the dimension of the index which are not large. + space in the original model and that of the compressed model, As we mentioned before, techniques of the four groups + respectively. are orthogonal. It is reasonable to combine two or three + Similarly, given the running timesofMands ofM , of them to maximize the performance. For some spe- + the speedup rate(M;M )is defined as: cific applications, like object detection, which requires + s both convolutional and fully connected layers, you can(M;M ) = : (10)s compress the convolutional layers with low rank based + Most work used the average training time per epoch to measure method and the fully connected layers with a pruning + the running time, while in [30], [35], the average testing time technique. + was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster + computation for both the training and the testing stages. Techniques for deep model compression and acceleration + Good compression methods are expected to achieve almost are still in the early stage and the following challenges still + the same performance as the original model with much smaller need to be addressed. + parameters and less computational time. However, for different Most of the current state-of-the-art approaches are built + applications with different CNN designs, the relation between on well-designed CNN models, which have limited free- + parameter size and computational time may be different. dom to change the configuration (e.g., network structural, + For example, it is observed that for deep CNNs with fully hyper-parameters). To handle more complicated tasks, + connected layers, most of the parameters are in the fully it should provide more plausible ways to configure the + connected layers; while for image classification tasks, float compressed models. + point operations are mainly in the first few convolutional layers Pruning is an effective way to compress and acceler- + since each filter is convolved with the whole image, which is ate CNNs. The current pruning techniques are mostly + usually very large at the beginning. Thus compression and designed to eliminate connections between neurons. On + acceleration of the network should focus on different type of the other hand, pruning channel can directly reduce the + layers for different applications. feature map width and shrink the model into a thinner + one. It is efficient but also challenging because removing + VIII. D ISCUSSION AND CHALLENGES channels might dramatically change the input of the + following layer.In this paper, we summarized recent efforts on compressing + and accelerating deep neural networks (DNNs). Here we dis- As we mentioned before, methods of structural matrix + and transferred convolutional filters impose prior humancuss more details about how to choose different compression knowledge to the model, which could significantly affectapproaches, and possible challenges/solutions on this area. the performance and stability. It is critical to investigate + how to control the impact of those prior knowledge.A. General Suggestions The methods of knowledge distillation provide many ben- + There is no golden rule to measure which approach is the efits such as directly accelerating model without special + best. How to choose the proper method is really depending hardware or implementations. It is still worthy developing + on the applications and requirements. Here are some general KD-based approaches and exploring how to improve their + guidance we can provide: performances. + If the applications need compacted models from pre- Hardware constraints in various of small platforms (e.g., + trained models, you can choose either pruning & sharing mobile, robotic, self-driving car) are still a major problem IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8 + + + to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g., + use of the limited computational source and how to design video and image frames [88], [89]). + special compression methods for such platforms are still + challenges that need to be addressed. IX. ACKNOWLEDGMENTS + Despite the great achievements of these compression ap- + proaches, the black box mechanism is still the key barrier The authors would like to thank the reviewers and broader + to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular, + is still an important problem. we would like to thank Hong Zhao from the Department of + Automation of Tsinghua University for her help on modifying + C. Possible Solutions the paper. This research is supported by National Science + Foundation of China with Grant number 61401169.To solve the hyper-parameters configuration problem, we + can rely on the recent learning-to-learn strategies [76], [77]. + This framework provides a mechanism allowing the algorithm REFERENCES + to automatically learn how to exploit structure in the problem [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with of interest. Very recently, leveraging reinforcement learning deep convolutional neural networks,” inNIPS, 2012. + to efficiently sample the design space and improve the model [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the + compression has also been tried [78]. gap to human-level performance in face verification,” inCVPR, 2014. + [3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efficiency benefit on both adaptive feature sharing in multi-task networks with applications in + CPU and GPU because no special implementation is required. person attribute classification,”CoRR, vol. abs/1611.05377, 2016. + But it is also challenging to handle the input configuration. [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, + M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel distributed deep networks,” inNIPS, 2012. + pruning methods [79], which focus on imposing sparse con- [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image + straints on weights during training. However, training from recognition,”CoRR, vol. abs/1512.03385, 2015. + [6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In deep convolutional networks using vector quantization,”CoRR, vol. + [80], the authors provided an iterative two-step algorithm to abs/1412.6115, 2014. + effectively prune channels in each layer. [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized + convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models on Computer Vision and Pattern Recognition (CVPR), 2016. + and transferring it to the student models is useful for the [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of + knowledge distillation (KD) approaches. Instead of directly re- neural networks on cpus,” inDeep Learning and Unsupervised Feature + Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl- [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep + edge of neurons could be helpful. One can derive a way to learning with limited numerical precision,” inProceedings of the + select essential neurons related to the task [81], [82]. The 32Nd International Conference on International Conference on Machine + Learning - Volume 37, ser. ICML’15, 2015, pp. 1737–1746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing + or samples, that implies these regions or samples share some deep neural networks with pruning, trained quantization and huffman + common properties that may relate to the task. coding,”International Conference on Learning Representations (ICLR), + 2016. For methods with the convolutional filters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network + matrix, we can conclude that the transformation lies in the quantization,”CoRR, vol. abs/1612.01543, 2016. + family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep + neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is in Neural Information Processing Systems 28: Annual Conference on + to provide a generalization of the aforementioned approaches Neural Information Processing Systems 2015, December 7-12, 2015, + in two aspects: 1) instead of limiting the transformation to Montreal, Quebec, Canada, 2015, pp. 3123–3131. + [13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predefined transformations, let it be the works with weights and activations constrained to +1 or -1,”CoRR, vol. + whole family of spatial transformations applied on 2D filters abs/1602.02830, 2016. + or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: + Imagenet classification using binary convolutional neural networks,” in model parameters. ECCV, 2016. + Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, + some general/unified approaches is one direction. Wanget al. “Deep neural networks are robust to weight binarization and other non- + [83] presented a feature map dimensionality reduction method linear distortions,”CoRR, vol. abs/1606.01981, 2016. + [16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen- networks,”CoRR, vol. abs/1611.01600, 2016. + erated from different filters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks + information of the original network. The idea can be applied with few multiplications,”CoRR, vol. abs/1510.03009, 2015. + [18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The construction with back-propagation,” inAdvances in Neural Information + work in [84] proposed a one-shot whole network compression Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177–185. + scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information + processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and fine-tuning to make deep Damage, pp. 598–605. + CNNs work in mobile devices. [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives + Despite the classification task, people are also adapting the for network pruning: Optimal brain surgeon,” inAdvances in Neural + Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164– compacted models in other tasks [85]–[87]. We would like to 171. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9 + + + + [21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net- + networks,” inProceedings of the British Machine Vision Conference works,”arXiv preprint arXiv:1602.07576, 2016. + 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp. [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural + 31.1–31.12. networks,” inAdvances In Neural Information Processing Systems, 2016, + [22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and pp. 1082–1090. + connections for efficient neural networks,” inProceedings of the 28th [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and + International Conference on Neural Information Processing Systems, ser. improving convolutional neural networks via concatenated rectified + NIPS’15, 2015. linear units,”arXiv preprint arXiv:1603.05201, 2016. + [23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com- [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in + pressing neural networks with the hashing trick.” JMLR Workshop and deep neural networks,”arXiv preprint arXiv:1604.00676, 2016. + Conference Proceedings, 2015. [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic + [24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural symmetry in convolutional neural networks,” inProceedings of the + network compression,”CoRR, vol. abs/1702.04008, 2017. 33rd International Conference on International Conference on Machine + [25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain Learning - Volume 48, ser. ICML’16, 2016. + damage,” in2016 IEEE Conference on Computer Vision and Pattern [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception- + Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, resnet and the impact of residual connections on learning.”CoRR, vol. + pp. 2554–2564. abs/1602.07261, 2016. + [26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, + cnns,” inEuropean Conference on Computer Vision, Amsterdam, the small, low power fully convolutional neural networks for real-time object + Netherlands, October 2016, pp. 662–677. detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016. + [27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ + sparsity in deep neural networks,” inAdvances in Neural Information inProceedings of the 12th ACM SIGKDD International Conference on + Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, Knowledge Discovery and Data Mining, ser. KDD ’06, 2006, pp. 535– + I. Guyon, and R. Garnett, Eds., 2016, pp. 2074–2082. 541. + [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in + filters for efficient convnets,”CoRR, vol. abs/1608.08710, 2016. Advances in Neural Information Processing Systems 27: Annual Confer- + [29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for ence on Neural Information Processing Systems 2014, December 8-13 + small-footprint deep learning,” inAdvances in Neural Information Pro- 2014, Montreal, Quebec, Canada, 2014, pp. 2654–2662. + cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a + and R. Garnett, Eds., 2015, pp. 3088–3096. neural network,”CoRR, vol. abs/1503.02531, 2015. + [30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and + Chang, “An exploration of parameter redundancy in deep networks with Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550, + circulant projections,” inInternational Conference on Computer Vision 2014. + (ICCV), 2015. [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling, + [31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and “Bayesian dark knowledge,” inAdvances in Neural Information Process- + S. Chang, “Fast neural networks with circulant projections,”CoRR, vol. ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, + abs/1502.03436, 2015. and R. Garnett, Eds., 2015, pp. 3420–3428. + [32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression + and Z. Wang, “Deep fried convnets,” inInternational Conference on by distilling knowledge from neurons,” inProceedings of the Thirtieth + Computer Vision (ICCV), 2015. AAAI Conference on Artificial Intelligence, February 12-17, 2016, + [33]J. Chun and T. Kailath,Generalized Displacement Structure for Block- Phoenix, Arizona, USA., 2016, pp. 3560–3566. + Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel- [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning + berg: Springer Berlin Heidelberg, 1991, pp. 215–236. via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015. + [34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention: + in low-rank tensor formats via cross approximation,”SIAM J. Scientific Improving the performance of convolutional neural networks via atten- + Computing, vol. 37, no. 2, 2015. tion transfer,”CoRR, vol. abs/1612.03928, 2016. + [35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc: [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by + A structured efficient linear layer,” inInternational Conference on jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014. + Learning Representations (ICLR), 2016. [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and + [36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable A. C. Courville, “Dynamic capacity networks,” inProceedings of the + filters,” in2013 IEEE Conference on Computer Vision and Pattern 33nd International Conference on Machine Learning, ICML 2016, New + Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754– York City, NY, USA, June 19-24, 2016, 2016, pp. 2549–2558. + 2761. [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, + [37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, and J. Dean, “Outrageously large neural networks: The sparsely-gated + “Exploiting linear structure within convolutional networks for efficient mixture-of-experts layer,” 2017. + evaluation,” inAdvances in Neural Information Processing Systems 27, [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and + Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. J. Odobez, “Deep dynamic neural networks for multimodal gesture + Weinberger, Eds., 2014, pp. 1269–1277. segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell., + [38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional vol. 38, no. 8, pp. 1583–1597, 2016. + neural networks with low rank expansions,” inProceedings of the British [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, + Machine Vision Conference. BMVA Press, 2014. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” + [39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit- inComputer Vision and Pattern Recognition (CVPR), 2015. + sky, “Speeding-up convolutional neural networks using fine-tuned cp- [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep + decomposition,”CoRR, vol. abs/1412.6553, 2014. Networks with Stochastic Depth, 2016. + [40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual + with low-rank regularization,” vol. abs/1511.06067, 2015. networks with separated stochastic depth,”CoRR, vol. abs/1612.01230, + [41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, 2016. + “Predicting parameters in deep learning,” in Advances in Neural [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and + Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, R. Feris, “Blockdrop: Dynamic inference paths in residual networks,” + Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 2148–2156. inCVPR, 2018. + [Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer- + files/nips26/1053.pdf ence graphs,” 2018. + [42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab- [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional + hadran, “Low-rank matrix factorization for deep neural network training networks through FFTs, 2014. + with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net- + Acoustics, Speech and Signal Processing, 2013. works,” in2016 IEEE Conference on Computer Vision and Pattern IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10 + + + + Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong, + pp. 4013–4021. M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X. + [69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S. Yu, “Ibm research and columbia university trecvid-2012 multimedia + Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol. event detection (med), multimedia event recounting (mer), and semantic + abs/1611.05138, 2016. indexing (sin) systems,” 2012. + [70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving + pooling in deep networks,” inProceedings of the IEEE Conference on + Computer Vision and Pattern Recognition, 2018. Yu Cheng(yu.cheng@microsoft.com) currently is a + [71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Researcher at Microsoft. Before that, he was a Re- + applied to document recognition,” inProceedings of the IEEE, 1998, pp. search Staff Member at IBM T.J. Watson Research + 2278–2324. Center. Yu got his Ph.D. from Northwestern Univer- + [72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried- sity in 2015 and bachelor from Tsinghua University + miller, “Striving for simplicity: The all convolutional net,”CoRR, vol. in 2010. His research is about deep learning in + abs/1412.6806, 2014. general, with specific interests in the deep generative + [73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014. model, model compression, and transfer learning. + [74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for He regularly serves on the program committees of + large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014. top-tier AI conferences such as NIPS, ICML, ICLR, + [75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image CVPR and ACL. + recognition,”arXiv preprint arXiv:1512.03385, 2015. + [76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, + D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient + descent by gradient descent,” inNeural Information Processing Systems + (NIPS), 2016. Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference ceived the B.S. degree in automation from theon Learning Representations 2016, 2016. Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018. Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in are about deep learning, particularly in few-shotdeep networks,” pp. 2270–2278, 2016. learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on robotics vision.Computer Vision (ICCV), Oct 2017. + [81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep + neural networks,”ECCV, 2018. + [82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric + learning via cross sample similarities transfer,” inProceedings of the + Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), + New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852– Pan Zhou(panzhou@hust.edu.cn) is currently an + 2859. associate professor with School of Electronic In- + [83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature formation and Communications, Wuhan, China. He + map for portable deep model,” inProceedings of the 34th International received his Ph.D. in the School of Electrical and + Conference on Machine Learning, ser. Proceedings of Machine Learning Computer Engineering at the Georgia Institute of + Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Technology in 2011. Before that, he received his + Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. B.S. degree in theAdvanced Classof HUST, and + 3703–3711. a M.S. degree in the Department of Electronics + [84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression and Information Engineering from HUST, Wuhan, + of deep convolutional neural networks for fast and low power mobile China, in 2006 and 2008, respectively. His current + applications,”CoRR, vol. abs/1511.06530, 2015. research interest includes big data analytics and + [85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient machine learning, security and privacy, and information networks. + object detection models with knowledge distillation,” inAdvances in + Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, + S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, + Eds., 2017, pp. 742–751. + [86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob- + “Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE tained his B.S., M.S., and Ph.D. degrees from Ts- + Conference on Computer Vision and Pattern Recognition (CVPR), June inghua University, Beijing, China, in 1993, 1995, + 2018. and 1999, respectively, and another Ph.D. degree + [87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, from Saga University, Saga, Japan, in 2002, all in + Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy control engineering. He is currently a Professor with + trade-offs for modern convolutional object detectors,” in2017 IEEE the Department of Automation, Tsinghua University. + Conference on Computer Vision and Pattern Recognition, CVPR 2017, He serves the Associate Dean, School of Information + Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296–3297. Science and Technology and Head of the Department + [88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence of Automation. His current research interests include + modeling for video event detection,” in The IEEE Conference on artificial intelligence, robotics, image processing, + Computer Vision and Pattern Recognition (CVPR), June 2014. control theory, and control of spacecraft. \ No newline at end of file diff --git a/Corpus/A guide to convolution arithmetic for deep learning.txt b/Corpus/A guide to convolution arithmetic for deep learning.txt new file mode 100644 index 0000000..a47ff7f Binary files /dev/null and b/Corpus/A guide to convolution arithmetic for deep learning.txt differ diff --git a/Corpus/Analysis and Design of Echo State Networks.txt b/Corpus/Analysis and Design of Echo State Networks.txt new file mode 100644 index 0000000..ec72712 --- /dev/null +++ b/Corpus/Analysis and Design of Echo State Networks.txt @@ -0,0 +1,1298 @@ + LETTER Communicated by Herbert Jaeger + + + + Analysis and Design of Echo State Networks + + + Mustafa C. Ozturk + can@cnel.ufl.edu + Dongming Xu + dmxu@cnel.ufl.edu + JoseC.Pr´ ´ıncipe + principe@cnel.ufl.edu + Computational NeuroEngineering Laboratory, Department of Electrical and + Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A. + + + The design of echo state network (ESN) parameters relies on the selec- + tion of the maximum eigenvalue of the linearized system around zero + (spectral radius). However, this procedure does not quantify in a sys- + tematic manner the performance of the ESN in terms of approximation + error. This article presents a functional space approximation framework + to better understand the operation of ESNs and proposes an information- + theoretic metric, the average entropy of echo states, to assess the richness + of the ESN dynamics. Furthermore, it provides an interpretation of the + ESN dynamics rooted in system theory as families of coupled linearized + systems whose poles move according to the input signal dynamics. With + this interpretation, a design methodology for functional approximation + is put forward where ESNs are designed with uniform pole distributions + covering the frequency spectrum to abide by the richness metric, irre- + spective of the spectral radius. A single bias parameter at the ESN input, + adapted with the modeling error, configures the ESN spectral radius to + the input-output joint space. Function approximation examples compare + the proposed design methodology versus the conventional design. + + + 1 Introduction + + Dynamic computational models require the ability to store and access the + time history of their inputs and outputs. The most common dynamic neural + architecture is the time-delay neural network (TDNN) that couples delay + lines with a nonlinear static architecture where all the parameters (weights) + are adapted with the backpropagation algorithm. The conventional delay + line utilizes ideal delay operators, but delay lines with local first-order re- + cursive filters have been proposed by Werbos (1992) and extensively stud- + ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera, + 1993). Chains of first-order integrators are interesting because they effec- + tively decrease the number of delays necessary to create time embeddings + + + Neural Computation19, 111–138(2007) C 2006 Massachusetts Institute of Technology 112 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + (Principe, 2001). Recurrent neural networks (RNNs) implement a differ- + ent type of embedding that is largely unexplored. RNNs are perhaps the + most biologically plausible of the artificial neural network (ANN) models + (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990), + but are not well understood theoretically (Siegelmann & Sontag, 1991; + Siegelmann, 1993; Kremer, 1995). One of the main practical problems with + RNNs is the difficulty to adapt the system weights. Various algorithms, + such as backpropagation through time (Werbos, 1990) and real-time recur- + rent learning (Williams & Zipser, 1989), have been proposed to train RNNs; + however, these algorithms suffer from computational complexity, resulting + in slow training, complex performance surfaces, the possibility of instabil- + ity, and the decay of gradients through the topology and time (Haykin, + 1998). The problem of decaying gradients has been addressed with spe- + cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter- + native second-order training methods based on extended Kalman filtering + (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov, + Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp + et al., 1998) provide more reliable performance and have enabled practical + applications in identification and control of dynamical systems (Kechri- + otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado, + Kambhampati, & Warwick, 1995). + Recently,twonewrecurrentnetworktopologieshavebeenproposed:the + echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and + the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨ + 2002). ESNs possess a highly interconnected and recurrent topology of + nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001) + and contain information about the history of input and output patterns. + The outputs of these internal PEs (echo states) are fed to a memoryless but + adaptive readout network (generally linear) that produces the network out- + put. The interesting property of ESN is that only the memoryless readout is + trained, whereas the recurrent topology has fixed connection weights. This + reduces the complexity of RNN training to simple linear regression while + preserving a recurrent topology, but obviously places important constraints + in the overall architecture that have not yet been fully studied. Similar ideas + have been explored independently by Maass and formalized in the LSM + architecture. LSMs, although formulated quite generally, are mostly im- + plemented as neural microcircuits of spiking neurons (Maass et al., 2002), + whereas ESNs are dynamical ANN models. Both attempt to model biolog- + ical information processing using similar principles. We focus on the ESN + formulation in this letter. + The echo state condition is defined in terms of the spectral radius (the + largest among the absolute values of the eigenvalues of a matrix, denoted + by·) of the reservoir’s weight matrix (W<1). This condition states + that the dynamics of the ESN is uniquely controlled by the input, and the + effect of the initial states vanishes. The current design of ESN parameters Analysis and Design of Echo State Networks 113 + + + relies on the selection of spectral radius. However, there are many possible + weight matrices with the same spectral radius, and unfortunately they do + not all perform at the same level of mean square error (MSE) for functional + approximation. A similar problem exists in the design of the LSM. LSMs + have been shown to possess universal approximation given the separation + property (SP) for the liquid (reservoir in ESNs) and the approximation + property (AP) for the readout (Maass et al., 2002). SP is quantified by a + kernel-quality measure proposed in Maass, Legenstein, and Bertschinger + (2005) that is based on the rank of a matrix formed by the system states + corresponding to different input signals. The kernel quality is a measure + for the complexity and diversity of nonlinear operations carried out by the + liquid on its input stream in order to boost the classification power of a + subsequent linear decision hyperplane (Maass et al., 2005). A variation of + SP has been proposed in Bertschinger and Natschlager (2004), and it has¨ + been argued that complex calculations can be best carried out by networks + on the boundary between ordered and chaotic dynamics. + Inthisletter,weareinterestedinstudyingtheESNforfunctionalapprox- + imation (filters that map input functionsu(·) of time on output functionsy(·) + of time). We see two major shortcomings with the current ESN approach + that uses echo state condition as a design principle. First, the impact of fixed + reservoir parameters for function approximation means that the informa- + tion about the desired response is conveyed only to the output projection. + This is not optimal, and strategies to select different reservoirs for different + applications have not been devised. Second, imposing a constraint only on + the spectral radius is a weak condition to properly set the parameters of + the reservoir, as experiments show (different randomizations with the same + spectral radius perform differently for the same problem; see Figure 2). + This letter aims to address these two problems by proposing a frame- + work, a metric, and a design principle for ESNs. The framework is a signal + processing interpretation of basis and projections in functional spaces to + describe and understand the ESN architecture. According to this interpre- + tation, the ESN states implement a set of basis functionals (representation + space) constructed dynamically by the input, while the readout simply + projects the desired response onto this representation space. The metric + to describe the richness of the ESN dynamics is an information-theoretic + quantity, the average state entropy (ASE). Entropy measures the amount of + information contained in a given random variable (Shannon, 1948). Here, + the random variable is the instantaneous echo state from which the en- + tropy for the overall state (vector) is estimated. The probability density + function (pdf) in a differential geometric framework should be thought of + as a volume form; that is, in our case, the pdf of the state vector describes + the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946) + established information as a coordinate free metric in the state manifold. + Therefore, entropy becomes a global descriptor of information that quanti- + fies the volume of the manifold defined by the random variable. Due to the 114 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + time dependency of the states, the state entropy averaged over time (ASE) + is an appropriate estimate of the volume of the state manifold. + The design principle specifies that one should consider independently + thecorrelationamongthebasisandthespectralradius.Intheabsenceofany + information about the desired response, the ESN states should be designed + with the highest ASE, independent of the spectral radius. We interpret the + ESN dynamics as a combination of time-varying linear systems obtained + from the linearization of the ESN nonlinear PE in a small, local neighbor- + hood of the current state. The design principle means that the poles of the + linearized ESN reservoir should have uniform pole distributions to gener- + ate echo states with the most diverse pole locations (which correspond to + the uniformity of time constants). Effectively, this will create the least cor- + related bases for a given spectral radius, which corresponds to the largest + volume spanned by the basis set. When the designer has no other informa- + tion about the desired response to set the basis, this principle distributes + the system’s degrees of freedom uniformly in space. It approximates for + ESNs the well-known property of orthogonal basis. The unresolved issue + that ASE does not quantify is how to set the spectral radius, which depends + again on the desired mapping. The concept of memory depth as explained + in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the + issues associated with the spectral radius. The correlation time of the de- + siredresponse(asestimatedbythefirstzerooftheautocorrelationfunction) + gives an indication of the type of spectral radius required (long correlation + time requires high spectral radius). Alternatively, a simple adaptive bias is + added at the ESN input to control the spectral radius integrating the infor- + mation from the input-output joint space in the ESN bases. For sigmoidal + PEs, the bias adjusts the operating points of the reservoir PEs, which has + the net effect of adjusting the volume of the state manifold as required to + approximate the desired response with a small error. This letter shows that + ESNs designed with this strategy obtain systematically better results in a + set of experiments when compared with the conventional ESN design. + + + 2 Analysis of Echo State Networks + + 2.1 Echo States as Bases and Projections.Let us consider the ar- + chitecture and recursive update equation of a typical ESN more closely. + Consider the recurrent discrete-time neural network given in Figure 1 + withMinput units,Ninternal PEs, andLoutput units. The value of + the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal + units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)= + [y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M + weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal + PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij Analysis and Design of Echo State Networks 115 + + + Input Layer Dynamical Reservoir Read-out + + Win WW out + + + + + + + + x(n) u(n) + + . + + . y(n) + + + + + + + + Wback + + + Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed- + weight (W<1) recurrent network and a linear readout. The recurrent net- + work is a reservoir of highly interconnected dynamical components, states of + which are called echo states. The memoryless linear readout is trained to pro- + duce the output. + + + output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The + activation of the internal PEs (echo state) is updated according to + + x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)), (2.1) + + wheref=(f1 ,f2 ,...,fN )aretheinternalPEs’activationfunctions.Here,all + f e−x + i ’s are hyperbolic tangent functions ( ex − ). The output from the readout ex +e−x + network is computed according to + + y(n+1)=fout (Wout x(n+1)), (2.2) + + wherefout =(fout ,fout ,...,fout ) are the output unit’s nonlinear functions 1 2 L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity. + ESNs resemble the RNN architecture proposed in Puskorius and + Feldkamp (1996) and also used by Sanchez (2004) in brain-machine 116 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + interfaces. The critical difference is the dimensionality of the hidden re- + current PE layer and the adaptation of the recurrent weights. We submit + that the ideas of approximation theory in functional spaces (bases and pro- + jections), so useful in adaptive signal processing (Principe, 2001), should + be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued + function of a real-valued vector + + u(t)=[u1 (t),u2 (t),...,uM (t)] T . + + In functional approximation, the goal is to estimate the behavior ofh(u(t)) + as a combination of simpler functionsϕi (t), called the basis functionals, + such that its approximant,hˆ(u(t)), is given by + + N + hˆ(u(t))= ai ϕi (t). + i=1 + + Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of + the central questions in practical functional approximation is how to choose + the set of bases to approximate a given desired signal. In signal processing, + thechoicenormallygoesforacompletesetoforthogonalbasis,independent + of the input. When the basis set is complete and can be made as large + as required, fixed bases work wonders (e.g., Fourier decompositions). In + neural computing, the basic idea is to derive the set of bases from the + input signal through a multilayered architecture. For instance, consider a + single hidden layer TDNN withNPEs and a linear output. The hidden- + layer PE outputs can be considered a set of nonorthogonal basis functionals + dependent on the input, +   + + ϕi (u(t))=g bij uj (t). + j + + bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi- + mation produced by the TDNN is then + + N + h ˆ(u(t))= ai ϕi (u(t)), (2.3) + i=1 + + whereai ’s are the weights of the output layer. Notice that thebij ’s adapt + the bases and theai ’s adapt the projection in the projection space. Here the + goal is to restrict the number of bases (number of hidden layer PEs) because + their number is coupled with the number of parameters to adapt, which + has an impact on generalization and training set size, for example. Usually, Analysis and Design of Echo State Networks 117 + + + since all of the parameters of the network are adapted, the best basis in the + joint (input and desired signals) space as well as the best projection can be + achieved and represents the optimal solution. The output of the TDNN is + a linear combination of its internal representations, but to achieve a basis + set (even if nonorthogonal), linear independence among theϕi (u(t))’s must + be enforced. Ito, Shah and Pon, and others have shown that this is indeed + the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside + the scope of this article. + The ESN (and the RNN) architecture can also be studied in this frame- + work. The states of equation 2.1 correspond to the basis set, which are + recursively computed from the input, output, and previous states through + Win ,W,andWback . Notice, however, that none of these weight matrices is + adapted, that is, the functional bases in the ESN are uniquely defined by the + input and the initial selection of weights. In a sense, ESNs are trading the + adaptive connections in the RNN hidden layer by a brute force approach + of creating fixed diversified dynamics in the hidden layer. + For an ESN with a linear readout network, the output equation (y(n+ + 1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and + ai ’s are replaced by the echo states and the readout weights, respectively. + The readout weights are adapted in the training data, which means that the + ESN is able to find the optimal projection in the projection space, just like + the RNN or the TDNN. + A similar perspective of basis and projections for information processing + in biological networks has been proposed by Pouget and Sejnowski (1997). + They explored the possibility that the response of neurons in parietal cortex + serves as basis functions for the transformations from the sensory input + to the motor responses. They proposed that “the role of spatial represen- + tations is to code the sensory inputs and posture signals in a format that + simplifies subsequent computation, particularly in the generation of motor + commands”. + The central issue in ESN design is exactly the nonadaptive nature of + the basis set. Parameter sets in the reservoir that provide linearly inde- + pendent states and possess a given spectral radius may define drastically + different projection spaces because the correlation among the bases is not + constrained. A simple experiment was designed to demonstrate that the se- + lection of the ESN parameters by constraining the spectral radius is not the + most suitable for function approximation. Consider a 100-unit ESN where + the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let + the ESN generate the seventh power of the input signal. Different realiza- + tions of a randomly connected 100-unit ESN were constructed where the + entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025, + and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input + weights are set to+1or,−1 with equal probabilities, andWback is set to + zero. Input is applied for 300 time steps, and the echo states are calculated + using equation 2.1. The next step is to train the linear readout. One method 118 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + MSE for different realizations10 4 + + + + + + + + + 10 6 + + + + + + + + + 10 8 + + + + + 10 9 + 0 10 20 30 40 50 + Different realizations + + Figure 2: Performances of ESNs for different realizations ofWwith the same + weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba- + bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius + of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results + show that for each set of random weights that provide the same spectral ra- + dius, the correlation or degree of redundancy among the bases will change, and + different performances are encountered in practice. + + + to determine the optimal output weight matrix,Wout , in the mean square + error (MSE) sense (where MSE is defined byO=1 (d−y)T (d−y)) is to use 2 the Wiener solution given by Haykin (2001): + + −1 1 + Wout =E[xx T ]−1 E[xd]∼ 1 + = x(n)x(n)T x(n)d(n) . (2.4) N Nn n + + Here,E[.] denotes the expected value operator, andddenotes the desired + signal. Figure 2 depicts the MSE values for 50 different realizations of + the ESNs. As observed, even though each ESN has the same sparseness + and spectral radius, the MSE values obtained vary greatly among differ- + ent realizations. The minimum MSE value obtained among the 50 realiza- + tions is 5.9x10 −9 , whereas the maximum MSE is 8.9x10 −5 . This experiment Analysis and Design of Echo State Networks 119 + + + demonstrates that a design strategy that is based solely on the spectral + radius is not sufficient to specify the system architecture for function ap- + proximation. This shows that for each set of random weights that provide + thesamespectralradius,thecorrelationordegreeofredundancyamongthe + bases will change, and different performances are encountered in practice. + + 2.2 ESN Dynamics as a Combination of Linear Systems.It is well + known that the dynamics of a nonlinear system can be approximated by + that of a linear system in a small neighborhood of an equilibrium point + (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis + with hyperbolic tangent nonlinearities and approximate the ESN dynam- + ics by the dynamics of the linearized system in the neighborhood of the + current system state. Hence, when the system operating point varies over + time, the linear system approximating the ESN dynamics changes. We are + particularly interested in the movement of the poles of the linearized ESN. + Consider the update equation for the ESN without output feedback given + by + + x(n+1)=f(Win u(n+1)+Wx(n)). + + Linearizing the system around the current statex(n), one obtains the + Jacobian matrix,J(n+1), defined by +  f˙(net 1 (n))w ˙11 f(net 1 (n))w12 ··· f˙(net 1 (n))w1N   f˙(net J(n+1)= 2 (n))w ˙21 f(net 2 (n))w22 ··· f˙(net 2 (n))w2N   ··· ··· ··· ···  + f˙(net N (n))w ˙N1 f(net N (n))wN2 ···f˙(net N (n))wNN +  f˙(net 1 (n)) 0 ··· 0 +   0 f ˙(net  = 2 (n))··· 0   ·W=F(n)·W. (2.5) +  ··· ··· ··· ···  + 00···f˙ (net N (n)) + + + Here,net i (n)istheith entry of the vector (Win u(n+1)+Wx(n)), andwij + denotes the (i,j)th entry ofW. The poles of the linearized system at time + n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the + amplitude of each PE changes, the local slope changes, and so the poles of + + + + 1 The transfer function of a linear systemx(n+1)=Ax(n)+Bu(n)is X(z) =(zI−U(z) + A)−1 B=Adjoint(zI−A) B. The poles of the transfer function can be obtained by solving det(zI−A) + det(zI−A)=0. The solution corresponds to the eigenvalues ofA. 120 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + the linearized system are time varying, although the parameters of ESN are + fixed. + In order to visualize the movement of the poles, consider an ESN with + 100 states. The entries of the internal weight matrix are chosen to be 0, + 0.4 and−0.4 with probabilities 0.9, 0.05, and 0.05.Wis scaled such that a + spectral radius of 0.95 is obtained. Input weights are set to+1or−1 with + equal probabilities. A sinusoidal signal with a period of 100 is fed to the + system, and the echo states are computed according to equation 2.1. Then + the Jacobian matrix and the eigenvalues are calculated using equation 2.5. + Figure 3 shows the pole tracks of the linearized ESN for different input + values. A single ESN with fixed parameters implements a combination of + many linear systems with varying pole locations, hence many different + time constants that modulate the richness of the reservoir of dynamics as a + function of input amplitude. Higher-amplitude portions of the signal tend + to saturate the nonlinear function and cause the poles to shrink toward + the origin of thez-plane (decreases the spectral radius), which results in a + system with a large stability margin. When the input is close to zero, the + poles of the linearized ESN are close to the maximal spectral radius chosen, + decreasing the stability margin. When compared to their linear counterpart, + an ESN with the same number of states results in a detailed coverage of + thez-plane dynamics, which illustrates the power of nonlinear systems. + Similar results can be obtained using signals of different shapes at the ESN + input. + A key corollary of the above analysis is that the spectral radius of an + ESN can be adjusted using a constant bias signal at the ESN input without + changing the recurrent connection matrix,W. The application of a nonzero + constant bias will move the operating point to regions of the sigmoid func- + tion closer to saturation and always decrease the spectral radius due to the + shape of the nonlinearity. 2 The relevance of bias in terms of overall system + performance has also been discussed in Jaeger (2002b) and Bertschinger + and Natschlager (2004), but here we approach it from a system theory per-¨ + spective and explain its effect on reservoir dynamics. + + 3 Average State Entropy as a Measure of the Richness of ESN Reservoir + + Previous research was aware of the influence of diversity of the recurrent + layer outputs on the overall performance of ESNs and LSMs. Several met- + rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al., + + + 2 AssumeWhas nondegenerate eigenvalues and corresponding linearly independent + eigenvectors. Then consider the eigendecomposition ofW,whereW=PDP −1 ,Pis the + eigenvectormatrixandDisthediagonalmatrixofeigenvalues(Dii )ofW.SinceF(n)andD + are diagonal,J(n+1)=F(n)W=F(n)(PDP −1 )=P(F(n)D)P−1 is the eigendecomposition + ofJ(n+1). Here, each entry ofF(n)D,f (net(n))Dii , is an eigenvalue ofJ. Therefore, + |f (net(n))Dii |≤|Dii |sincef (net i )≤f (0). Analysis and Design of Echo State Networks 121 + + + (A) 1 (B) 1 + D0.8 0.8 + 0.6 C 0.6 + 0.4 0.4 + + + + Imaginary + Amplitude 0.2 0.2 + 0 B E 0 + -0.2 -0.2 + -0.4 -0.4 + -0.6 -0.6 + -0.8 -0.8 + -1 -1 0 20 40 60 80 100 -1 -0.5 Real 0 0.5 1 Time + (C) 1 (D) 1 + 0.8 0.8 + 0.6 0.6 + 0.4 0.4 + + + + Imaginary 0.2 + + + Imaginary 0.2 + 0 0 + -0.2 -0.2 + -0.4 -0.4 + -0.6 -0.6 + -0.8 -0.8 + -1 -1-1 -0.5 Real 0 0.5 1 -1 -0.5 Real 0 0.5 1 + + (E) 1 (F) 1 + 0.8 0.8 + 0.6 0.6 + 0.4 0.4 + + + + Imaginary 0.2 + + + Imaginary 0.2 + 0 0 + -0.2 -0.2 + -0.4 -0.4 + -0.6 -0.6 + -0.8 -0.8 + -1 -1-1 -0.5 Real 0 0.5 1 -1 -0.5 Real 0 0.5 1 + + Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input + goes through a cycle. An ESN with fixed parameters implements a combination + of linear systems with varying pole locations. (A) One cycle of sinusoidal signal + with a period of 100. (B–E) The positions of poles of the linearized systems + when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative + pole locations show the movement of the poles as the input changes. Due to + the varying pole locations, different time constants modulate the richness of + the reservoir of dynamics as a function of input amplitude. Higher-amplitude + signals tend to saturate the nonlinear function and cause the poles to shrink + toward the origin of thez-plane (decreases the spectral radius), which results in + a system with a large stability margin. When the input is close to zero, the poles + ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing + the stability margin. An ESN with more states results in a detailed coverage of + thez-plane dynamics, which illustrates the power of nonlinear systems, when + compared to their linear counterpart. 122 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + 2005). Here, our approach of bases and projections leads to a new metric. + We propose the instantaneous state entropy to quantify the distribution of + instantaneous amplitudes across the ESN states. Entropy of the instanta- + neous ESN states is appropriate to quantify performance in function ap- + proximation because the ESN output is a mere weighted combination of + the instantaneous value of the ESN states. If the echo state’s instantaneous + amplitudes are concentrated on only a few values across the ESN state dy- + namic range, the ability to approximate an arbitrary desired response by + weighting the states is limited (and wasteful due to redundancy between + the different states), and performance will suffer. On the other hand, if the + ESN states provide a diversity of instantaneous amplitudes, it is much eas- + ier to achieve the desired mapping. Hence, the instantaneous entropy of the + states appears as a good measure to quantify the richness of dynamics with + instantaneous mappers. Due to the time structure of signals, the average + state entropy (ASE), defined as the state entropy averaged over time, will be + the parameter used to quantify the diversity in the dynamical reservoir of + the ESN. Moreover, entropy has been proposed as an appropriate measure + of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE + measures the volume of the echo state manifold spanned by trajectories. + Renyi’squadraticentropyisemployedherebecauseitisaglobalmeasure + of information. In addition, an efficient nonparametric estimator of Renyi’s + entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe, + Xu, & Fisher, 2000). Renyi’s entropy with parameterγfor a random variable + Xwith a pdffX (x) is given by Renyi (1970): + + + 1Hγ (X)= logE[fγ−1 (X)].1−γ X + + + Renyi’s quadratic entropy is obtained forγ=2 (forγ→1, Shannon’s en- + tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un- + known pdf to be estimated, Parzen windowing approximates the underly- + ing pdf by + + + 1N + fX (x)= KN σ (x−xi ), + i=1 + + whereKσ is the kernel function with the kernel sizeσ. Then the Renyi’s + quadratic entropy can be estimated by (Principe et al., 2000) + +   + + H2 (X)=−log1 + KN2 σ (xj −xi ) . (3.1) + j i Analysis and Design of Echo State Networks 123 + + + The instantaneous state entropy is estimated using equation 3.1 where + thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T + of an ESN withNinternal PEs. Results will be shown with a gaussian kernel + with kernel size chosen to be 0.3 of the standard deviation of the entries + of the state vector. We will show that ASE is a more sensitive parameter to + quantify the approximation properties of ESNs by experimentally demon- + strating that ESNs with different spectral radius and even with the same + spectral radius display different ASEs. + Let us consider the same 100-unit ESN that we used in the previous + section built with three different spectral radii 0.2, 0.5, 0.8 with an input + signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks. + The instantaneous state entropy is also calculated at each time step using + equation 3.1 and plotted in Figure 4B. First, note that the instantaneous + state entropy changes over time with the distribution of the echo states as + we would expect, since state entropy is dependent on the input signal that + also changes in this case. Second, as the spectral radius increases in the + simulation, the diversity in the echo states increases. For the spectral radius + of 0.2, echo state’s instantaneous amplitudes are concentrated on only a + few values, which is wasteful due to redundancy between different states. + In practice, to quantify the overall representation ability over time, we will + use ASE, which takes values−0.735,−0.007, and 0.335 for the spectral + radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral + radius, several ASEs are possible. Figure 4C shows ASEs from 50 different + realizations of ESNs with the same spectral radius of 0.5, which means that + ASE is a finer descriptor of the dynamics of the reservoir. Although we + have presented an experiment with sinusoidal signal, similar results are + obtained for other inputs as long as the input dynamic range is properly + selected. + Maximizing ASE means that the diversity of the states over time is the + largest and should provide a basis set that is as uncorrelated as possible. + This condition is unfortunately not a guarantee that the ESN so designed + will perform the best, because the basis set in ESNs is created independent + of the desired response and the application may require a small spectral + radius. However, we maintain that when the desired response is not ac- + cessible for the design of the ESN bases or when the same reservoir is + to be used for a number of problems, the default strategy should be to + maximize the ASE of the state vector. The following section addresses + the design of ESNs with high ASE values and a simple mechanism to + adjust the reservoir dynamics without changing the recurrent connection + weights. + + 4 Designing Echo State Networks + + 4.1 Design of the Echo State Recurrent Connections.According to the + interpretation of ESNs as coupled linear systems, the design of the internal 124 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + connection matrix,W, will be based on the distribution of the poles of the + linearized system around zero state. Our proposal is to design the ESN + such that the linearized system has uniform pole distribution inside the + unit circle of thez-plane. With this design scenario, the system dynamics + will include uniform coverage of time constants arising from the uniform + distribution of the poles, which also decorrelates as much as possible the + basis functionals. This principle was chosen by analogy to the identification + oflinearsystemsusingKautzfilters(Kautz,1954),whichshowsthatthebest + approximation of a given transfer function by a linear system with finite + order is achieved when poles are placed in the neighborhood of the spectral + resonances. When no information is available about the desired response, + we should uniformly spread the poles to anticipate good approximation to + arbitrary mappings. + We again use a maximum entropy principle to distribute the poles inside + the unit circle uniformly. The constraints of a circle as boundary conditions + for discrete linear systems and complex conjugate locations are easy to + include for the pole distribution (Thogula, 2003). The poles are first initial- + ized at random locations; the quadratic Renyi’s entropy is calculated by + equation 3.1, and poles are moved such that the entropy of the new dis- + tribution is increased over iterations (Erdogmus & Principe, 2002). This + method is efficient to find uniform coverage of the unit circle with an arbi- + trary number of poles. The system with the uniform pole locations can be + interpreted using linear system theory. The poles that are close to the unit + circle correspond to many sharp bandpass filters specializing in different + frequency regions, whereas the inner poles realize filters of larger frequency + support. Moreover, different orientations (angles) of the poles create filters + of different center frequencies. + Now the problem is to construct an internal weight matrix from the pole + locations (eigenvalues ofW). In principle, we would like to create a sparse + + + + + Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs + ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8, + from top to bottom, respectively. The diversity of echo states increases when the + spectral radius increases. Within the dynamic range of the echo states, systems + with smaller spectral radius can generate only uneven representations, while + forW=0.8, outputs of echo states almost uniformly distribute within their + dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1. + Information contained in the echo states is changing over time according to the + input amplitude. Therefore, the richness of representation is controlled by the + input amplitude. Moreover, the value of ASE increases with spectral radius. + (C) ASEs from 50 different realizations of ESNs with the same spectral radius + of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the + reservoir than the spectral radius. Analysis and Design of Echo State Networks 125 + + + (A) Echo States1 + 0 + - 10 20 40 60 801001201401601802001 + 0 + - 10 20 40 60 801001201401601802001 + 0 + - 10 20 40 60 80100120140160180200Time + (B) State Entropy1.5 Spectral Radius = 0.2 + 1 Spectral Radius = 0.5 Spectral Radius = 0.8 + 0.5 + 0 + - 0.5 + - 1 + - 1.5 + - 2 + - 2.50 50 100 150 200Time + (C) Different ASEs for the same spectral radius0.3 + + 0.25 + + 0.2 + + ASE0.15 + + 0.1 + + 0.050 10 20 30 40 50 + Trials 126 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + matrix, so we started with the sparsest matrix (with an inverse), which is + the direct canonical structure given by (Kailath, 1980) + +  −a1 −a2 ···−aN−1 −aN +  10··· 00  W= 01··· 00   . (4.1) + ··· ··· ··· ··· ··· + 00··· 10 + + The characteristic polynomial ofWis + + l(s)=det(sI−W)=sN +a N−11 s +a2 sN−2 +aN + =(s−p1 )(s−p2 )···(s−pN ), (4.2) + + wherepi ’s are the eigenvalues andai ’s are the coefficients of the character- + istic polynomial ofW. Here, we know the pole locations of the linear system + obtained from the linearization of the ESN, so using equation 4.2, we can + obtain the characteristic polynomial and constructWmatrix in the canon- + ical form using equation 4.1. We will call the ESN constructed based on + the uniform pole principle ASE-ESN. All other possible solutions with the + same eigenvalues can be obtained byQ−1 WQ,whereQis any nonsingular + matrix. + To corroborate our hypothesis, we would like to show that the linearized + ESN designed with the recurrent weight matrix having the eigenvalues + uniformly distributed inside the unit circle creates higher ASE values for a + given spectral radius compared to other ESNs with random internal con- + nection weight matrices. We will consider an ESN with 30 states and use our + procedure to create theWmatrix for ASE-ESN for different spectral radii + between [0.1, 0.95]. Similarly, we constructed ESNs with sparse randomW + matrices with different sparseness constraints. This corresponds to a weight + distribution having the values 0,cand−cwith probabilitiesp1 ,(1−p1 )/2, + and (1−p1 )/2, wherep1 defines the sparseness ofWandcis a constant + that takes a specific value depending on the spectral radius. We also created + Wmatrices with values uniformly distributed between−1 and 1 (U-ESN) + and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then, + for differentWin matrices, we run the ASE-ESNs with the sinusoidal input + given in section 3 and calculate ASE. Figure 5 compares the ASE values + averaged over 1000 realizations. As observed from the figure, the ASE-ESN + with uniform pole distribution generates higher ASE on average for all + spectral radii compared to ESNs with sparse and uniform random connec- + tions. This approach is indeed conceptually similar to Jeffreys’ maximum + entropy prior (Jeffreys, 1946): it will provide a consistently good response + for the largest class of problems. Concentrating the poles of the linearized Analysis and Design of Echo State Networks 127 + + + 1 + ASEESN + 0.8 UESN + sparseness=0.2 + 0.6 sparseness=0.1 + sparseness=0.07 + 0.4 + + ASE 0.2 + + 0 + + - 0.2 + + - 0.40 0.2 0.4 0.6 0.8 1 + Spectral radius + + Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith + uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN + with uniformly distributed weights between−1 and 1. Randomly generated + weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the + networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole + distribution generates a higher ASE on average for all spectral radii compared + to ESNs with random connections. + + + system in certain regions of the space provides good performance only if + the desired response has energy in this part of the space, as is well known + from the theory of Kautz filters (Kautz, 1954). + + 4.2 Design of the Adaptive Bias.In conventional ESNs, only the out- + put weights are trained, optimizing the projections of the desired response + onto the basis functions (echo states). Since the dynamical reservoir is fixed, + the basis functions are only input dependent. However, since function ap- + proximation is a problem in the joint space of the input and desired signals, + a penalty in performance will be incurred. From the linearization analysis + that shows the crucial importance of the operating point of the PE non- + linearity in defining the echo state dynamics, we propose to use a single + external adaptive bias to adjust the effective spectral radius of an ESN. No- + tice that according to linearization analysis, bias can reduce only spectral + radius. The information for adaptation of bias is the MSE in training, which + modulates the spectral radius of the system with the information derived + from the approximation error. With this simple mechanism, some informa- + tionfromtheinput-outputjointspaceisincorporatedinthedefinitionofthe + projection space of the ESN. The beauty of this method is that the spectral 128 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + radius can be adjusted by a single parameter that is external to the system + without changing reservoir weights. + The training of bias can be easily accomplished. Indeed, since the pa- + rameter space is only one-dimensional, a simple line search method can be + efficiently employed to optimize the bias. Among different line search al- + gorithms, we will use a search that uses Fibonacci numbers in the selection + of points to be evaluated (Wilde, 1964). The Fibonacci search method min- + imizes the maximum number of evaluations needed to reduce the interval + of uncertainty to within the prescribed length. In our problem, a bias value + is picked according to Fibonacci search. For each value of bias, training + data are applied to the ESN, and the echo states are calculated. Then the + corresponding optimal output weights and the objective function (MSE) + are evaluated to pick the next bias value. + Alternatively, gradient-based methods can be utilized to optimize the + bias, due to simplicity and low computational cost. System update equation + with an external bias signal,b,isgivenby + + x(n+1)=f(Win u(n+1)+Win b+Wx(n)). + + The update equation forbis given by + + ∂O(n+1) ∂x(n+1)=−e·Wout × (4.3)∂b ∂b ∂x(n)=−e·Wout × f˙(net n+1 )· W× +Win . (4.4)∂b + + Here,Ois the MSE defined previously. This algorithm may suffer from + similar problems observed in gradient-based methods in recurrent net- + works training. However, we observed that the performance surface is + rather simple. Moreover, since the search parameter is one-dimensional, + the gradient vector can assume only one of the two directions. Hence, im- + precision in the gradient estimation should affect the speed of convergence + but normally not change the correct gradient direction. + + 5 Experiments + + This section presents a variety of experiments in order to test the validity + of the ESN design scheme proposed in the previous section. + + 5.1 Short-TermMemoryCapacity.Thisexperimentcomparestheshort- + term memory (STM) capacity of ESNs with the same spectral radius using + the framework presented in Jaeger (2002a). Consider an ESN with a sin- + gle input signal,u(n), optimally trained with the desired signalu(n−k), + for a given delayk. Denoting the optimal output signalyk (n), thek-delay Analysis and Design of Echo State Networks 129 + + + STM capacity of a network,MC k , is defined as a squared correlation coef- + ficient betweenu(n−k)andyk (n) (Jaeger, 2002a). The STM capacity,MC, + of the network is defined as ∞ MC k=1 k . STM capacity measures how accu- + rately the delayed versions of the input signal are recovered with optimally + trained output units. Jaeger (2002a) has shown that the memory capacity + for recalling an independent and identically distributed (i.i.d.) input by an + Nunit RNN with linear output units is bounded byN. + We use ESNs with 20 PEs and a single input unit. ESNs are driven + by an i.i.d. random input signal,u(n), that is uniformly distributed over + [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions + of the input,u(n−1),...,u(n−40). We used four different ESNs: R-ESN, + U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN + used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47, + −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a + sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof + U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spec- + tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed + with uniform poles. BASE-ESN has the same recurrent weight matrix as + ASE-ESN and an adaptive bias at its input. In each ESN, the input weights + are set to 0.1 or−0.1 with equal probability, and direct connections from the + input to the output are allowed, whereasWback is set to0(Jaeger, 2002a). + The echo states are calculated using equation 2.1 for 200 samples of the + input signal, and the first 100 samples corresponding to initial transient + are eliminated. Then the output weight matrix is calculated using equation + 2.4. For the BASE-ESN, the bias is trained for each task. All networks are + run with a test input signal, and the corresponding output andMC k are + calculated. Figure 6 shows thek-delay STM capacity (averaged over 100 + trials) of each ESN for delays 1,...,40 for the test signal. The STM capac- + ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, + and 16.90, respectively. First, ESNs with uniform pole distribution (ASE- + ESN and BASE-ESN) haveMCs that are much longer than the randomly + generated ESN given in Jaeger (2002a) in spite of all having the same spec- + tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical + maximumvalueofN=20.AcloserlookatthefigureshowsthatR-ESNper- + forms slightly better than ASE-ESN for delays less than 9. In fact, for small + k, large ASE degrades the performance because the tasks do not need long + memory depth. However, the drawback of high ASE for smallkis recov- + ered in BASE-ESN, which reduces the ASE to the appropriate level required + for the task. Overall, the addition of the bias to the ASE-ESN increases the + STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly + better STM compared to R-ESN with only three different weight values, + although it has more distinct weight values compared to R-ESN. It is also + significant to note that theMCwill be very poor for an ESN with smaller + spectral radius even with an adaptive bias, since the problem requires large + ASE and bias can only reduce ASE. This experiment demonstrates the 130 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + 1 RESN + UESN + ASEESN0.8 BASEESN + + + + + + + Memory Capacity 0.6 + + + 0.4 + + + 0.2 + + + 0 + 0 10 20 30 40 + Delay + + Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed + using the test signal. The results are averaged over 100 different realizations of + each ESN type with the specifications given in the text for differentWandWin + matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are + 13.09, 13.55, 16.70, and 16.90, respectively. + + + suitability of maximizing ASE in tasks that require a substantial memory + length. + + 5.2 Binary Parity Check.The effect of the adaptive bias was marginal + in the previous experiment since the nature of the problem required large + ASE values. However, there are tasks in which the optimal solutions re- + quire smaller ASE values and smaller spectral radius. Those are the tasks + where the adaptive bias becomes a crucial design parameter in our design + methodology. + Consider an ESN with 100 internal units and a single input unit. ESN is + drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal + is to train an ESN to generate them-bit parity corresponding to lastmbits + received, wheremis 3,...,8. Similar to the previous experiments, we used + the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly + connected ESN where the entries ofWmatrix are set to 0, 0.06,−0.06 + with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse + connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN + are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 + with equal probability, and direct connections from the input to the output + are allowed whereasWback is set to 0. The echo states are calculated using + equation 2.1 for 1000 samples of the input signal, and the first 100 samples + correspondingtotheinitialtransientareeliminated.Thentheoutputweight Analysis and Design of Echo State Networks 131 + + + 350 + + 300 + + 250 + + + + + + + Wrong Decisions 200 + + 150 + + 100 + ASEESN50 RESN + BASEESN0 + 3 4 5 6 7 8 + m + + Figure 7: The number of wrong decisions made by each ESN form=3,...,8 + in the binary parity check problem. The results are averaged over 100 differ- + ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin + matrices with the specifications given in the text. The total numbers of wrong + decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and + 699. + + + + matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias + is trained for each task. The binary decision is made by a threshold detector + that compares the output of the ESN to 0.5. Figure 7 shows the number of + wrong decisions (averaged over 100 different realizations) made by each + ESN form=3,...,8. + The total numbers of wrong decisions form=3,...,8 of R-ESN, ASE- + ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs + poorly since the nature of the problem requires a short time constant for + fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the + R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions. + BASE-ESN performs a lot better than ASE-ESN and slightly better than + the R-ESN since the adaptive bias reduces the spectral radius effectively. + Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN, + since the task requires access to longer input history, which compromises + the need for fast response. Indeed, the bias in the BASE-ESN takes effect + when there are errors (m>4) and when the task benefits from smaller + spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and + 2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide + range of bias values that result in similar MSE values (between 0 and 3). In 132 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + summary, this experiment clearly demonstrates the power of the bias signal + to configure the ESN reservoir according to the mapping task. + + 5.3 System Identification.This section presents a function approxima- + tion task where the aim is to identify a nonlinear dynamical system. The + unknown system is defined by the difference equation + + y(n+1)=0.3y(n)+0.6y(n−1)+f(u(n)), + + where + + f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu). + + The input to the system is chosen to be sin(2πn/25). + We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with + 30 internal units and a single input unit. TheWmatrix of each ESN is scaled + suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN + where the entries ofWmatrix are set to 0, 0.35,−0.35 with probabilities 0.8, + 0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or−1 with + equal probability, and direct connections from the input to the output are + allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated + using equation 2.4. The MSE values (averaged over 100 realizations) for R- + ESN and ASE-ESN are 1.23x10 −5 and 1.83x10 −6 , respectively. The addition + of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10 −6 + to 3.27x10 −9 . + + 6 Discussion + + The great appeal of echo state networks (ESNs) and liquid state machine + (LSM) is their ability to construct arbitrary mappings of signals with rich + and time-varying temporal structures without requiring adaptation of the + free parameters of the recurrent layer. The echo state condition allows the + recurrent connections to be fixed with training limited to the linear output + layer. However, the literature did not elucidate on how to properly choose + the recurrent parameters for system identification applications. Here, we + provide an alternate framework that interprets the echo states as a set + of functional bases formed by fixed nonlinear combinations of the input. + The linear readout at the output stage simply computes the projection of + the desired output space onto this representation space. We further in- + troduce an information-theoretic criterion, ASE, to better understand and + evaluate the capability of a given ESN to construct such a representation + layer. The average entropy of the distribution of the echo states quantifies + thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest + to achieve the smallest correlation among the bases and be able to cope with Analysis and Design of Echo State Networks 133 + + + arbitrary mappings. However, not all function approximation problems re- + quire the same memory depth, which is coupled to the spectral radius. The + effective spectral radius of an ESN can be optimized for the given problem + with the help of an external bias signal that is adapted using the joint input- + output space information. The interesting property of this method when + applied to ESN built from sigmoidal nonlinearities is that it allows the fine + tuning of the system dynamics for a given problem with a single external + adaptive bias input and without changing internal system parameters. In + our opinion, the combination of the largest possible ASE and the adapta- + tion of the spectral radius by the bias produces the most parsimonious pole + location of the linearized ESN when no knowledge about the mapping is + available to optimally locate the bass functionals. Moreover, the bias can be + easily trained with either a line search method or a gradient-based method + since it is one-dimensional. We have illustrated experimentally that the de- + sign of the ESN using the maximization of ASE with the adaptation of the + spectral radius by the bias has provided consistently better performance + across tasks that require different memory depths. This means that these + two parameters’ design methodology is preferred to the spectral radius + criterion proposed by Jaeger, and it is still easily incorporated in the ESN + design. + Experiments demonstrate that the ASE for ESN with uniform linearized + poles is maximized when the spectral radius of the recurrent weight matrix + approaches one (instability). It is interesting to relate this observation with + the computational properties found in dynamical systems “at the edge of + chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993; + Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨ + tomata rules are evolved to perform a complex computation, evolution will + tend to select rules with “critical” parameter values, which correlate with + a phase transition between ordered and chaotic regimes. Recently, similar + conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨ + Langton’s interpretation of edge of chaos was questioned by Mitchell et al. + (1993). Here, we provide a system-theoretic view and explain the computa- + tional behavior with the diversity of dynamics achieved with linearizations + that have poles close to the unit circle. According to our results, the spectral + radiusoftheoptimalESNinfunctionapproximationisproblemdependent, + and in general it is impossible to forecast the computational performance + as the system approaches instability (the spectral radius of the recurrent + weight matrix approaches one). However, allowing the system to modu- + late the spectral radius by either the output or internal biasing may allow + a system close to instability to solve various problems requiring different + spectral radii. + Our emphasis here is mostly on ESNs without output feedback connec- + tions. However, the proposed design methodology can also be applied to + ESNs with output feedback. Both feedforward and feedback connections + contribute to specify the bases to create the projection space. At the same 134 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + time, there are applications where the output feedback contributes to the + system dynamics in a different fashion. For example, it has been shown that + a fixed weight (fully trained) RNN with output feedback can implement a + family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992). + In meta-learning, the role of output feedback in the network is to bias the + system to different regions of dynamics, providing multiple input-output + mappings required (Santiago & Lendaris, 2004). However, results could not + be replicated with ESNs (Prokhorov, 2005). We believe that more work has + to be done on output feedback in the context of ESNs but also suspect that + the echo state condition may be a restriction on the system dynamics for + this type of problem. + There are many interesting issues to be researched in this exciting new + area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s + representation layer in an unsupervised fashion. In fact, we can easily adapt + withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild, + and Principe (2003): extra weights linking the outputs of recurrent states to + maximize output entropy. Output entropy maximization is a well-known + metric to create independent components (Bell & Sejnowski, 1995), and + here it means that the echo states will become as independent as possible. + This would circumvent the linearization of the dynamical system to set the + recurrent weights and would fine-tune continuously in an unsupervised + manner the parameters of the ESN among different inputs. However, it + goes against the idea of a fixed ESN reservoir. + The reservoir of recurrent PEs can be thought of as a new form of a time- + to-space mapping. Unlike the delay line that forms an embedding (Takens, + 1981), this mapping may have the advantage of filtering noise and produce + representations with better SNRs to the peaks of the input, which is very + appealing for signal processing and seems to be used in biology. However, + further theoretical work is necessary in order to understand the embedding + capabilities of ESNs. One of the disadvantages of the ESN correlated basis + is in the design of the readout. Gradient-based algorithms will be very + slow to converge (due to the large eigenvalue spread of modes), and even + if recursive methods are used, their stability may be compromised by the + condition number of the matrix. However, our recent results incorporating + anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of + solving this problem. + Finally we would like to briefly comment on the implications of these + models to neurobiology and computational neuroscience. The work by + Pouget and Sejnowski (1997) has shown that the available physiological + data are consistent with the hypothesis that the response of a single neuron + in the parietal cortex serves as a basis function generated by the sensory + input in a nonlinear fashion. In other words, the neurons transform the + sensory input into a format (representation space) such that the subsequent + computation is simplified. Then, whenever a motor command (output of + the biological system) needs to be generated, this simple computation to Analysis and Design of Echo State Networks 135 + + + read out the neuronal activity is done. There is an intriguing similarity + betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski + and our interpretation of echo states in ESN. We believe that similar ideas + can be applied to improve the design of microcircuit implementations of + LSMs. First, the framework of functional space interpretation (bases and + projections) is also applicable to microcircuits. Second, the ASE measure + may be directly utilized for LSM states because the states are normally low- + pass-filtered before the readout. However, the control of ASE by changing + the liquid dynamics is unclear. Perhaps global control of thresholds or bias + current will be able to accomplish bias control as in ESN with sigmoid + PEs. + + + Acknowledgments + + ThisworkwaspartiallysupportedbyNSFECS-0422718,NSFCNS-0540304, + and ONR N00014-1-1-0405. + + + References + + Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer. + Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor- + ical perception, and probability learning: Some applications of a neural model. + Psychological Review, 84, 413–451. + Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach + to blind separation and blind deconvolution.Neural Computation, 7(6), 1129– + 1159. + Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨ + in recurrent neural networks.Neural Computation, 16(7), 1413–1436. + Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal + of Physics, 14(1), 1–13. + de Vries, B. (1991).Temporal processing with neural networks—the development of the + gamma model. Unpublished doctoral dissertation, University of Florida. + Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural + network for system identification and control.IEEE Proceedings of Control Theory + and Applications, 142(4), 307–314. + Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179–211. + Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation: + Stochastic information gradient.Signal Processing Letters, 10(8), 242–245. + Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for + adaptive system training.IEEE Transactions on Neural Networks, 13(5), 1035–1044. + Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream + Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle + (Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 29–53). Dordrecht, + Netherlands: Kluwer. 136 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle + River, NJ. Prentice Hall. + Haykin, S. (2001).Adaptive filter theory(4th ed.). Upper Saddle River, NJ: Prentice + Hall. + Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa- + tion, 9(8), 1735–1780. + Hopfield, J. (1984). Neurons with graded response have collective computational + properties like those of two-state neurons.Proceedings of the National Academy of + Sciences, 81, 3088–3092. + Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math- + ematics, 5(1), 189–203. + Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural + networks(Tech. Rep. No. 148). Bremen: German National Research Center for + Information Technology. + Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152). + Bremen: German National Research Center for Information Technology. + Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL, + EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German + National Research Center for Information Technology. + Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems + and saving energy in wireless communication.Science, 304(5667), 78–80. + Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems. + Proceedings of the Royal Society of London, A 196, 453–461. + Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall. + Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit + Theory, 1(3), 29–39. + Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks + for adaptive communication channel equalization.IEEE Transactions on Neural + Networks, 5(2), 267–278. + Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks. + IEEE Transactions on Neural Networks, 6(5), 1000–1004. + Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation + theory(2nd ed.). New York: Springer-Verlag. + Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 12–37. + Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the + computational power and generalization capability of neural microcircuits. In + L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing + systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press. + Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨ + stable states: A new framework for neural computation based on perturbations. + Neural Computation, 14(11), 2531–2560. + Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos: + Evolving cellular automata to perform computations.Complex Systems, 7, 89– + 130. + Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. + Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293– + 301). Singapore: World Scientific. Analysis and Design of Echo State Networks 137 + + + Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex + using basis functions.Journal of Cognitive Neuroscience, 9(2), 222–237. + Principe, J. (2001). Dynamic neural networks and optimal signal processing. In + Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6– + 28). Boca Raton, FL: CRC Press. + Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new + class of adaptive IIR filters with restricted feedback.IEEE Transactions on Signal + Processing, 41(2), 649–656. + Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin + (Ed.),Unsupervised adaptive filtering(pp. 265–319). Hoboken, NJ: Wiley. + Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter- + national Joint Conference on Neural Networks(pp. 1463–1466). Montreal, Canada. + Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed + weights in recurrent neural networks: An overview. InProc. of International Joint + Conference on Neural Networks(pp. 2018–2022). Honolulu, Hawaii. + Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys- + tems with Kalman filter trained recurrent networks.IEEE Transactions on Neural + Networks, 5(2), 279–297. + Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap- + plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 1407–1420. + Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, + M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with + echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and + Signal Processing. Philadelphia. + Renyi, A. (1970).Probability theory. New York: Elsevier. + Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis. + Unpublished doctoral dissertation, University of Florida. + Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net- + works: Reformulating fixed weight neural networks. InProc. of International Joint + Conference on Neural Networks(pp. 189–194). Budapest, Hungary. + Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in + multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 10–18. + Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical + Journal, 27, 623–656. + Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc- + toral dissertation, Rutgers University. + Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied + Mathematics Letters, 4(6), 77–80. + Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended + Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process- + ing systems, 1(pp. 133–140). San Mateo, CA: Morgan Kaufmann. + Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. + Young (Eds.),Dynamical systems and turbulence(pp. 366–381). Berlin: Springer. + Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub- + lished master’s thesis, University of Florida. + Werbos, P. (1990). Backpropagation through time: What it does and how to do it. + Proceedings of IEEE, 78(10), 1550–1560. 138 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua- + tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 65–89). New + York: Van Nostrand Reinhold. + Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall. + Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running + fully recurrent neural networks.Neural Computation, 1, 270–280. + + + Received December 28, 2004; accepted June 1, 2006. \ No newline at end of file diff --git a/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt b/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt new file mode 100644 index 0000000..430d70b Binary files /dev/null and b/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt differ diff --git a/Corpus/Neural_Ordinary_Differential_Equations.txt b/Corpus/CORPUS.txt similarity index 100% rename from Corpus/Neural_Ordinary_Differential_Equations.txt rename to Corpus/CORPUS.txt diff --git a/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt b/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt new file mode 100644 index 0000000..9906917 --- /dev/null +++ b/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt @@ -0,0 +1,391 @@ + Channel Pruning for Accelerating Very Deep Neural Networks + + + Yihui He * Xiangyu Zhang Jian Sun + Xi’an Jiaotong University Megvii Inc. Megvii Inc. + Xi’an, 710049, China Beijing, 100190, China Beijing, 100190, China + heyihui@stu.xjtu.edu.cn zhangxiangyu@megvii.com sunjian@megvii.com + + + + Abstract W1 + + In this paper, we introduce a new channel pruning number of channels + nonlinear method to accelerate very deep convolutional neural net- + works. Given a trained CNN model, we propose an it- + erative two-step algorithm to effectively prune each layer, W2 + by a LASSO regression based channel selection and least nonlinear + square reconstruction. We further generalize this algorithm + to multi-layer and multi-branch cases. Our method re- W3 + duces the accumulated error and enhance the compatibility + with various architectures. Our pruned VGG-16 achieves (a) (b) (c) (d) + the state-of-the-art results by5×speed-up along with only Figure 1. Structured simplification methods that accelerate CNNs: 0.3% increase of error. More importantly, our method is (a) a network with 3 conv layers. (b) sparse connection deacti- + able to accelerate modern networks like ResNet, Xception vates some connections between channels. (c) tensor factorization + and suffers only 1.4%, 1.0% accuracy loss under2×speed- factorizes a convolutional layer into several pieces. (d) channel + up respectively, which is significant. pruning reduces number of channels in each layer (focus of this + paper). + + 1. Introduction a network into thinner one, as shown in Fig.1(d). It is effi- + Recent CNN acceleration works fall into three cate- cient on both CPU and GPU because no special implemen- + gories: optimized implementation (e.g., FFT [47]), quan- tation is required. + tization (e.g., BinaryNet [8]), and structured simplification Pruning channels is simple but challenging because re- + that convert a CNN into compact one [22]. This work fo- moving channels in one layer might dramatically change + cuses on the last one. the input of the following layer. Recently,training-based + Structured simplification mainly involves: tensor fac- channel pruning works [1,48] have focused on imposing + torization [22], sparse connection [17], and channel prun- sparse constrain on weights during training, which could + ing [48]. Tensor factorization factorizes a convolutional adaptively determine hyper-parameters. However, training + layer into several efficient ones (Fig.1(c)). However, fea- from scratch is very costly and results for very deep CNNs + ture map width (number of channels) could not be reduced, on ImageNet have been rarely reported.Inference-timeat- + which makes it difficult to decompose1×1convolutional tempts [31,3] have focused on analysis of the importance + layer favored by modern networks (e.g., GoogleNet [45], of individual weight. The reported speed-up ratio is very + ResNet [18], Xception [7]). This type of method also intro- limited. + duces extra computation overhead. Sparse connection deac- In this paper, we propose a new inference-time approach + tivates connections between neurons or channels (Fig.1(b)). for channel pruning, utilizing redundancy inter channels. + Though it is able to achieves high theoretical speed-up ratio, Inspired by tensor factorization improvement by feature + the sparse convolutional layers have an ”irregular” shape maps reconstruction [52], instead of analyzing filter weights + which is not implementation friendly. In contrast, channel [22,31], we fully exploits redundancy within feature maps. + pruning directly reduces feature map width, which shrinks Specifically, given a trained CNN model, pruning each layer + is achieved by minimizing reconstruction error on its output + * This work was done when Yihui He was an intern at Megvii Inc. feature maps, as showned in Fig.2. We solve this mini- + + + + 1389 A B C maps. There are several training-based approaches. [1,48] + W regularize networks to improve accuracy. Channel-wise + SSL [48] reaches high compression ratio for first few conv + layers of LeNet [30] and AlexNet [26]. However,training- kh kc w basedapproaches are more costly, and the effectiveness for + c n very deep networks on large datasets is rarely exploited. nonlinear nonlinear + Figure 2. Channel pruning for accelerating a convolutional layer. Inference-time channel pruning is challenging, as re- + We aim to reduce the width of feature map B, while minimizing ported by previous works [2,39]. Some works [44,34,19] + the reconstruction error on feature map C. Our optimization algo- focus on model size compression, which mainly operate the + rithm (Sec. 3.1) performs within the dotted box, which does not fully connected layers. Data-free approaches [31,3] results + involve nonlinearity. This figure illustrates the situation that two for speed-up ratio (e.g.,5×) have not been reported, and + channels are pruned for feature map B. Thus corresponding chan- requires long retraining procedure. [3] select channels via + nels of filtersWcan be removed. Furthermore, even though not over 100 random trials, however it need long time to evalu- directly optimized by our algorithm, the corresponding filters in ate each trial on a deep network, which makes it infeasible the previous layer can also be removed (marked by dotted filters). to work on very deep models and large datasets. [31] is even c,n: number of channels for feature maps B and C,kh ×kw : worse than naive solution from our observation sometimes kernel size. (Sec.4.1.1). + + mization problem by two alternative steps: channels selec- 3. Approach + tion and feature map reconstruction. In one step, we figure In this section, we first propose a channel pruning al-out the most representative channels, and prune redundant gorithm for a single layer, then generalize this approach toones, based on LASSO regression. In the other step, we multiple layers or the whole model. Furthermore, we dis-reconstruct the outputs with remaining channels with linear cuss variants of our approach for multi-branch networks.least squares. We alternatively take two steps. Further, we + approximate the network layer-by-layer, with accumulated 3.1. Formulation + error accounted. We also discuss methodologies to prune + multi-branch networks (e.g., ResNet [18], Xception [7]). Fig.2illustrates our channel pruning algorithm for a sin- + For VGG-16, we achieve4×acceleration, with only gle convolutional layer. We aim to reduce the width of + 1.0%increase of top-5 error. Combined with tensor factor- feature map B, while maintaining outputs in feature map + ization, we reach5×acceleration but merely suffer0.3% C. Once channels are pruned, we can remove correspond- + increase of error, which outperforms previous state-of-the- ing channels of the filters that take these channels as in- + arts. We further speed up ResNet-50 and Xception-50 by put. Also, filters that produce these channels can also be + 2×with only1.4%, 1.0%accuracy loss respectively. removed. It is clear that channel pruning involves two key + points. The first is channel selection, since we need to select + 2. Related Work most representative channels to maintain as much informa- + tion. The second is reconstruction. We need to reconstruct + There has been a significant amount of work on acceler- the following feature maps using the selected channels. + ating CNNs. Many of them fall into three categories: opti- Motivated by this, we propose an iterative two-step al- + mized implementation [4], quantization [40], and structured gorithm. In one step, we aim to select most representative + simplification [22]. channels. Since an exhaustive search is infeasible even for + Optimized implementation based methods [35,47,27,4] tiny networks, we come up with a LASSO regression based + accelerate convolution, with special convolution algorithms method to figure out representative channels and prune re- + like FFT [47]. Quantization [8,40] reduces floating point dundant ones. In the other step, we reconstruct the outputs + computational complexity. with remaining channels with linear least squares. We alter- + Sparse connection eliminates connections between neu- natively take two steps. + rons [17,32,29,15,14]. [51] prunes connections based on Formally, to prune a feature map withcchannels, we + weights magnitude. [16] could accelerate fully connected consider applyingn×c×kh ×kw convolutional filtersWon + layers up to50×. However, in practice, the actual speed-up N×c×kh ×kw input volumesXsampled from this feature + maybe very related to implementation. map, which producesN×noutput matrixY. Here,Nis + Tensor factorization [22,28,13,24] decompose weights the number of samples,nis the number of output channels, + into several pieces. [50,10,12] accelerate fully connected andkh ,k w are the kernel size. For simple representation, + layers with truncated SVD. [52] factorize a layer into3×3 bias term is not included in our formulation. To prune the + and1×1combination, driven by feature map redundancy. input channels fromcto desiredc′ (0≤c′ ≤c), while + Channel pruning removes redundant channels on feature minimizing reconstruction error, we formulate our problem + + + + 1390 as follow: penalty, andβ =c. We gradually increaseλ. For each 0 change ofλ, we iterate these two steps untilβ is stable. + 1 2 0 c Afterβ ≤c′ satisfies, we obtain the final solutionWarg min Y− β 0i Xi W⊤ i from{ββ,W 2N (1) i Wi }. In practice, we found that the two steps it- i=1 F eration is time consuming. So we apply (i) multiple times,subject toβ ≤c′ + 0 untilβ ≤c′ satisfies. Then apply (ii) just once, to obtain 0 + · is Frobenius norm.X the final result. From our observation, this result is compa- + F i isN×kh kw matrix sliced + fromith channel of input volumesX,i= 1,...,c.W rable with two steps iteration’s. Therefore, in the following i is + n×k experiments, we adopt this approach for efficiency. h kw filter weights sliced fromith channel ofW.βis + coefficient vector of lengthcfor channel selection, andβ Discussion: Some recent works [48,1,17] (though train- i + isith entry ofβ. Notice that, ifβ ing based) also introduceℓ1 -norm or LASSO. However, we i = 0,Xi will be no longer + useful, which could be safely pruned from feature map.W must emphasis that we use different formulations. Many of i + could also be removed. them introduced sparsity regularization into training loss, + Optimization instead of explicitly solving LASSO. Other work [1] solved + Solving thisℓ LASSO, while feature maps or data were not considered 0 minimization problem in Eqn.1is NP-hard. + Therefore, we relax theℓ during optimization. Because of these differences, our ap- 0 toℓ1 regularization: proach could be applied at inference time. + 1 c 2 + arg min Y− β 3.2. Whole Model Pruning i Xi W⊤ + i +λβ1β,W 2N (2) i=1 F Inspired by [52], we apply our approach layer by layersubject toβ ≤c′ ,∀iW = 1 0 iF sequentially. For each layer, we obtain input volumes from + the current input feature map, and output volumes from theλis a penalty coefficient. By increasingλ, there will be output feature map of the un-pruned model. This could bemore zero terms inβand one can get higher speed-up ratio. formalized as:We also add a constrain∀iWi = 1to this formulation, F which avoids trivial solution. + Now we solve this problem in two folds. First, we fixW, 1 c 2 + arg min Y′ − βsolveβfor channel selection. Second, we fixβ, solveWto i Xi W⊤ i + β,W 2N (5) + reconstruct error. i=1 F + (i) The subproblem ofβ. In this case,Wis fixed. We subject toβ ≤c′ + 0 + solveβfor channel selection. This problem can be solved Different from Eqn.1,Yis replaced byY′ , which is fromby LASSO regression [46,5], which is widely used for feature map of the original model. Therefore, the accumu-model selection. lated error could be accounted during sequential pruning. 2 c βˆLASSO 1(λ) = argmin Y− β +λβ 3.3. Pruning Multi­Branch Networks + β 2N i Zi 1 + i=1 F The whole model pruning discussed above is enough for + subject toβ ≤c′ + 0 single-branch networks like LeNet [30], AlexNet [26] and(3) VGG Nets [43]. However, it is insufficient for multi-branch HereZi = X i W⊤ i (sizeN×n). We will ignoreith channels networks like GoogLeNet [ 45] and ResNet [18]. We mainlyifβi = 0. focus on pruning the widely used residual structure (e.g.,(ii) The subproblem ofW. In this case,βis fixed. We ResNet [18], Xception [7]). Given a residual block shownutilize the selected channels to minimize reconstruction er- in Fig.3(left), the input bifurcates into shortcut and residualror. We can find optimized solution by least squares: branch. On the residual branch, there are several convolu- + tional layers (e.g., 3 convolutional layers which have spatialarg minY−X′ (W ′ )⊤ 2 (4) F size of1×1,3×3,1×1, Fig.3, left). Other layers ex- W′ cept the first and last layer can be pruned as is described + HereX′ = [β1 X1 β2 X2 ... β i Xi ... β c Xc ](size previously. For the first layer, the challenge is that the large + N×ck h kw ). W′ isn×ck h kw reshapedW,W′ = input feature map width (for ResNet, 4 times of its output) + [W 1 W2 ...Wi ...Wc ]. After obtained resultW′ , it is re- can’t be easily pruned, since it’s shared with shortcut. For + shaped back toW. Then we assignβi ←βi Wi ,W the last layer, accumulated error from the shortcut is hard to F i ← + Wi /Wi . Constrain∀iW be recovered, since there’s no parameter on the shortcut. To F i = 1satisfies. F We alternatively optimize (i) and (ii). In the beginning, address these challenges, we propose several variants of our + Wis initialized from the trained model,λ= 0, namely no approach as follows. + + + + 1391 c ers, which need special library implementation support. We + Input (c) sampled (c') 0 do not adopt it in the following experiments. c 0 0 + 0 + channel sampler + sampler 1x1,c c'0 4. Experiment 1 + c 1x1 1 relu c' 3x3,c 1 relu We evaluation our approach for the popular VGG Nets 2 + c 3x3 2 relu [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR- c'1x1 2 relu 10 [25] and PASCAL VOC 2007 [11]. 1x1 For Batch Normalization [21], we first merge it into con- Y2 Y volutional weights, which do not affect the outputs of the Y+Y 1 + 1 2 networks. So that each convolutional layer is followed by + Figure 3. Illustration of multi-branch enhancement for residual ReLU [36]. We use Caffe [23] for deep network evalua- + block.Left: original residual block.Right: pruned residual block tion, and scikit-learn [38] for solvers implementation. For + with enhancement,cx denotes the feature map width. Input chan- channel pruning, we found that it is enough to extract 5000 nels of the first convolutional layer are sampled, so that the large images, and 10 samples per image. On ImageNet, we eval- input feature map width could be reduced. As for the last layer, uate the top-5 accuracy with single view. Images are re- rather than approximateY2 , we try to approximateY1 + Y 2 di- sized such that the shorter side is 256. The testing is on rectly (Sec.3.3Last layer of residual branch). center crop of224×224pixels. We could gain more per- + formance with fine-tuning. We use a batch size of 128 and + learning rate1e−5 . We fine-tune our pruned models for 10Last layer of residual branch: Shown in Fig.3, the epoches. The augmentation for fine-tuning is random cropoutput layer of a residual block consists of two inputs: fea- of224×224and mirror.ture mapY1 andY2 from the shortcut and residual branch. + We aim to recoverY1 + Y 2 for this block. Here,Y1 ,Y2 4.1. Experiments with VGG­16 are the original feature maps before pruning.Y2 could be + approximated as in Eqn.1. However, shortcut branch is VGG-16 [43] is a 16 layers single path convolutional + parameter-free, thenY neural network, with 13 convolutional layers. It is widely 1 could not be recovered directly. To + compensate this error, the optimization goal of the last layer used in recognition, detection and segmentation,etc. Single + is changed fromY view top-5 accuracy for VGG-16 is 89.9% 1 .2 toY1 −Y′ +Y, which does not change 1 2 + our optimization. Here,Y′ is the current feature map after 1 previous layers pruned. When pruning, volumes should be 4.1.1 Single Layer Pruning + sampled correspondingly from these two branches. In this subsection, we evaluate single layer acceleration per-First layer of residual branch: Illustrated in formance using our algorithm in Sec.3.1. For better under-Fig.3(left), the input feature map of the residual block standing, we compare our algorithm with two naive chan-could not be pruned, since it is also shared with the short- nel selection strategies.first kselects the firstkchannels.cut branch. In this condition, we could performfeature max responseselects channels based on corresponding fil-map samplingbefore the first convolution to save compu- ters that have high absolute weights sum [31]. For fair com-tation. We still apply our algorithm as Eqn.1. Differently, parison, we obtain the feature map indexes selected by eachwe sample the selected channels on the shared feature maps of them, then perform reconstruction (Sec. 3.1(ii)). We to construct a new input for the later convolution, shown hope that this could demonstrate the importance of channelin Fig.3(right). Computational cost for this operation could selection. Performance is measured by increase of error af-be ignored. More importantly, after introducingfeature map ter a certain layer is pruned without fine-tuning, shown insampling, the convolution is still ”regular”. Fig.4.Filter-wise pruningis another option for the first con- As expected, error increases as speed-up ratio increases.volution on the residual branch. Since the input channels Our approach is consistently better than other approaches inof parameter-free shortcut branch could not be pruned, we different convolutional layers under different speed-up ra-apply our Eqn.1to each filter independently (each fil- tio. Unexpectedly, sometimesmax responseis even worseter chooses its own representative input channels). Under thanfirst k. We argue thatmax responseignores correla-single layer acceleration,filter-wise pruningis more accu- tions between different filters. Filters with large absoluterate than our original one. From our experiments, it im- weight may have strong correlation. Thus selection based proves 0.5% top-5 accuracy for2×ResNet-50 (applied on on filter weights is less meaningful. Correlation on featurethe first layer of each residual branch) without fine-tuning. maps is worth exploiting. We can find that channel selectionHowever, after fine-tuning, there’s no noticeable improve- + ment. In addition, it outputs ”irregular” convolutional lay- 1 http://www.vlfeat.org/matconvnet/pretrained/ + + + + 1392 conv1_1 conv2_1 conv3_1 5 + first k first k first k + max response max response max response 4 ours ours ours + + + + + + increase of error (%) 3 + + 2 + + 1 + + 0 + + conv3_2 conv4_1 conv4_2 5 + first k first k first k + max response max response max response 4 ours ours ours + + + + + + increase of error (%) 3 + + 2 + + 1 + + 01.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 + speed-up ratio speed-up ratio speed-up ratio + Figure 4. Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by increase of error. To verify + the importance of channel selection refered in Sec.3.1, we considered two naive baselines.first kselects the firstkfeature maps.max + responseselects channels based on absolute sum of corresponding weights filter [31]. Our approach is consistently better (smaller is + better). + + + Increase of top-5 error (1-view, baseline 89.9%) periments above, we pruning more aggressive for shal- + Solution 2× 4× 5× lower layers. Remaining channels ratios for shallow lay- + Jaderberget al. [22] ([52]’s impl.) - 9.7 29.7 ers (conv1_xtoconv3_x) and deep layers (conv4_x) + Asym. [52] 0.28 3.84 - is1 : 1.5.conv5_xare not pruned, since they only con- + Filter pruning [31] tribute 9% computation in total and are not redundant.0.8 8.6 14.6(fine-tuned, our impl.) After fine-tuning, we could reach2×speed-up without + Ours (without fine-tune) 2.7 7.9 22.0 losing accuracy. Under4×, we only suffers 1.0% drops. + Ours (fine-tuned) 0 1.0 1.7 Consistent with single layer analysis, our approach outper- + Table 1. Accelerating the VGG-16 model [43] using a speedup forms previous channel pruning approach (Liet al. [31]) by + ratio of2×,4×, or5×(smaller is better). large margin. This is because we fully exploits channel re- + dundancy within feature maps. Compared with tensor fac- + affects reconstruction error a lot. Therefore, it is important torization algorithms, our approach is better than Jaderberg + for channel pruning. et al. [22], without fine-tuning. Though worse than Asym. + Also notice that channel pruning gradually becomes [52], our combined model outperforms its combined Asym. + hard, from shallower to deeper layers. It indicates that shal- 3D (Table2). This may indicate that channel pruning is + lower layers have much more redundancy, which is consis- more challenging than tensor factorization, since removing + tent with [52]. We could prune more aggressively on shal- channels in one layer might dramatically change the input + lower layers in whole model acceleration. of the following layer. However, channel pruning keeps the + original model architecture, do not introduce additional lay- + ers, and the absolute speed-up ratio on GPU is much higher4.1.2 Whole Model Pruning (Table 3). + Shown in Table1, whole model acceleration results under Since our approach exploits a new cardinality, we further + 2×,4×,5×are demonstrated. We adopt whole model combine our channel pruning with spatial factorization [22] + pruning proposed in Sec.3.2. Guided by single layer ex- and channel factorization [52]. Demonstrated in Table2, + + + + 1393 Increase of top-5 error (1-view, 89.9%) scratch. This coincides with architecture design researches + Solution 4× 5× [20,1] that the model could be easier to train if there are + Asym. 3D [52] 0.9 2.0 more channels in shallower layers. However, channel prun- + Asym. 3D (fine-tuned) [52] 0.3 1.0 ing favors shallower layers. + Our 3C 0.7 1.3 For from scratch (uniformed), the filters in each layers + Our 3C (fine-tuned) 0.0 0.3 is reduced by half (eg. reduceconv1_1from 64 to 32). + Table 2. Performance of combined methods on the VGG-16 model We can observe that normal setting networks of the same + [43] using a speed-up ratio of4×or5×. Our 3C solution outper- complexity couldn’t reach same accuracy either. This con- + forms previous approaches (smaller is better). solidates our idea that there’s much redundancy in networks + while training. However, redundancy can be opt out at + inference-time. This maybe an advantage of inference-timeour 3 cardinalities acceleration (spatial, channel factoriza- acceleration approaches over training-based approaches.tion, and channel pruning, denoted by 3C) outperforms pre- Notice that there’s a 0.6% gap between the from scratchvious state-of-the-arts. Asym. 3D [52] (spatial and chan- model and uniformed one, which indicates that there’s roomnel factorization), factorizes a convolutional layer to three for model exploration. Adopting our approach is muchparts:1×3,3×1,1×1. faster than training a model from scratch, even for a thin-We apply spatial factorization, channel factorization, and ner one. Further researches could alleviate our approach to our channel pruning together sequentially layer-by-layer. do thin model exploring.We fine-tune the accelerated models for 20 epoches, since + they are 3 times deeper than the original ones. After fine- + tuning, our4×model suffers no degradation. Clearly, a 4.1.5 Acceleration for Detection + combination of different acceleration techniques is better VGG-16 is popular among object detection tasks [42,41,than any single one. This indicates that a model is redun- 33]. We evaluate transfer learning ability of our2×/4×dant in each cardinality. pruned VGG-16, for Faster R-CNN [42] object detections. + PASCAL VOC 2007 object detection benchmark [11] con- + 4.1.3 Comparisons of Absolute Performance tains 5k trainval images and 5k test images. The per- + formance is evaluated by mean Average Precision (mAP).We further evaluate absolute performance of acceleration In our experiments, we first perform channel pruning foron GPU. Results in Table3are obtained under Caffe [23], VGG-16 on the ImageNet. Then we use the pruned modelCUDA8 [37] and cuDNN5 [6], with a mini-batch of 32 as the pre-trained model for Faster R-CNN.on a GPU (GeForce GTX TITAN X). Results are averaged The actual running time of Faster R-CNN is 220ms / im-from 50 runs. Tensor factorization approaches decompose age. The convolutional layers contributes about 64%. Weweights into too many pieces, which heavily increase over- got actual time of 94ms for4×acceleration. From Table5,head. They could not gain much absolute speed-up. Though we observe 0.4% mAP drops of our2×model, which is notour approach also encountered performance decadence, it harmful for practice consideration.generalizes better on GPU than other approaches. Our re- + sults for tensor factorization differ from previous research 4.2. Experiments with Residual Architecture Nets + [52,22], maybe because current library and hardware pre- For Multi-path networks [45,18,7], we further explorefer single large convolution instead of several small ones. the popular ResNet [18] and latest Xception [7], on Ima- + geNet and CIFAR-10. Pruning residual architecture nets is + 4.1.4 Comparisons with Training from Scratch more challenging. These networks are designed for both ef- + ficiency and high accuracy. Tensor factorization algorithmsThough training a compact model from scratch is time- [52,22] have difficult to accelerate these model. Spatially,consuming (usually 120 epoches), it worths comparing our 1×1convolution is favored, which could hardly be factor-approach and from scratch counterparts. To be fair, we eval- ized.uated both from scratch counterpart, and normal setting net- + work that has the same computational complexity and same 4.2.1 ResNet Pruningarchitecture. + Shown in Table4, we observed that it’s difficult for ResNet complexity uniformly drops on each residual block. + from scratch counterparts to reach competitive accuracy. Guided by single layer experiments (Sec. 4.1.1), we still + our model outperforms from scratch one. Our approach prefer reducing shallower layers heavier than deeper ones. + successfully picks out informative channels and constructs Following similar setting as Filter pruning [31], we + highly compact models. We can safely draw the conclu- keep 70% channels for sensitive residual blocks (res5 + sion that the same model is difficult to be obtained from and blocks close to the position where spatial size + + + + 1394 Model Solution Increased err. GPU time/ms + VGG-16 - 0 8.144 + Jaderberget al. [22] ([52]’s impl.) 9.7 8.051(1.01×) + Asym. [52] 3.8 5.244(1.55×) + VGG-16 (4×) Asym. 3D [52] 0.9 8.503(0.96×) + Asym. 3D (fine-tuned) [52] 0.3 8.503(0.96×) + Ours (fine-tuned) 1.0 3.264 (2.50×) + Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is + better). + + + Original (acc. 89.9%) Top-5 err. Increased err. Solution Increased err. + From scratch 11.9 1.8 Filter pruning [31] (our impl.) 92.8 + From scratch (uniformed) 12.5 2.4 Filter pruning [31] 4.3Ours 18.0 7.9 (fine-tuned, our impl.) + Ours (fine-tuned) 11.1 1.0 Ours 2.9 + Table 4. Comparisons with training from scratch, under4×accel- Ours (fine-tuned) 1.0 + eration. Our fine-tuned model outperforms scratch trained coun- Table 7. Comparisons for Xception-50, under2×acceleration ra- + terparts (smaller is better). tio. The baseline network’s top-5 accuracy is 92.8%. Our ap- + proach outperforms previous approaches. Most structured sim- + plification methods are not effective on Xception architecture + Speedup mAP ∆mAP (smaller is better). + Baseline 68.7 - + 2× 68.3 0.4 + 4× 66.9 1.8 4.2.2 Xception Pruning + Table 5.2×,4×acceleration for Faster R-CNN detection. + Since computational complexity becomes important in + model design, separable convolution has been payed muchSolution Increased err. attention [49,7]. Xception [7] is already spatially optimizedOurs 8.0 and tensor factorization on1×1convolutional layer is de-Ours 4.0 structive. Thanks to our approach, it could still be acceler-(enhanced) ated with graceful degradation. For the ease of comparison,Ours 1.4 we adopt Xception convolution on ResNet-50, denoted by(enhanced, fine-tuned) Xception-50. Based on ResNet-50, we swap all convolu- Table 6.2×acceleration for ResNet-50 on ImageNet, the base- tional layers with spatial conv blocks. To keep the same line network’s top-5 accuracy is 92.2% (one view). We improve computational complexity, we increase the input channels performance with multi-branch enhancement (Sec.3.3,smaller is of allbranch2blayers by2×. The baseline Xception- better). 50 has a top-5 accuracy of 92.8% and complexity of 4450 + MFLOPs. + We apply multi-branch variants of our approach as de-change, e.g. res3a,res3d). As for other blocks, scribed in Sec.3.3, and adopt the same pruning ratio settingwe keep 30% channels. With multi-branch enhance- as ResNet in previous section. Maybe because of Xcep-ment, we prunebranch2amore aggressively within tion block is unstable, Batch Normalization layers must beeach residual block. The remaining channels ratios for maintained during pruning. Otherwise it becomes nontrivialbranch2a,branch2b,branch2cis2 : 4 : 3(e.g., to fine-tune the pruned model.Given 30%, we keep 40%, 80%, 60% respectively). Shown in Table7, after fine-tuning, we only suffer1.0% + We evaluate performance of multi-branch variants of our increase of error under2×. Filter pruning [31] could also + approach (Sec. 3.3). From Table6, we improve 4.0% apply on Xception, though it is designed for small speed- + with our multi-branch enhancement. This is because we up ratio. Without fine-tuning, top-5 error is 100%. After + accounted the accumulated error from shortcut connection training 20 epochs which is like training from scratch, in- + which could broadcast to every layer after it. And the large creased error reach 4.3%. Our results for Xception-50 are + input feature map width at the entry of each residual block not as graceful as results for VGG-16, since modern net- + is well reduced by ourfeature map sampling. works tend to have less redundancy by design. + + + + 1395 Solution Increased err. [4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn: + Filter pruning [31] Lookup-based convolutional neural network.arXiv preprint 1.3(fine-tuned, our impl.) arXiv:1611.06473, 2016.2 + From scratch 1.9 [5] L. Breiman. Better subset regression using the nonnegative + Ours 2.0 garrote.Technometrics, 37(4):373–384, 1995.3 + Ours (fine-tuned) 1.0 [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, + Table 8.2×speed-up comparisons for ResNet-56 on CIFAR-10, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives + the baseline accuracy is 92.8% (one view). We outperforms previ- for deep learning.CoRR, abs/1410.0759, 2014.6 + ous approaches and scratch trained counterpart (smaller is better). [7] F. Chollet. Xception: Deep learning with depthwise separa- + ble convolutions.arXiv preprint arXiv:1610.02357, 2016. 1, + 2,3,4,6,7 + 4.2.3 Experiments on CIFAR-10 [8] M. Courbariaux and Y. Bengio. Binarynet: Training deep + neural networks with weights and activations constrained to+ + Even though our approach is designed for large datasets, it 1 or-1.arXiv preprint arXiv:1602.02830, 2016.1,2 + could generalize well on small datasets. We perform ex- [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- + periments on CIFAR-10 dataset [25], which is favored by Fei. Imagenet: A large-scale hierarchical image database. + many acceleration researches. It consists of 50k images for InComputer Vision and Pattern Recognition, 2009. CVPR + training and 10k for testing in 10 classes. 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 4 + We reproduce ResNet-56, which has accuracy of 92.8% [10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- + (Serve as a reference, the official ResNet-56 [18] has ac- gus. Exploiting linear structure within convolutional net- + curacy of 93.0%). For2×acceleration, we follow similar works for efficient evaluation. InAdvances in Neural In- + formation Processing Systems, pages 1269–1277, 2014.2 setting as Sec.4.2.1(keep the final stage unchanged, where + the spatial size is8×8). Shown in Table8, our approach [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, + and A. Zisserman. The PASCAL Visual Object Classes is competitive with scratch trained one, without fine-tuning, Challenge 2007 (VOC2007) Results. http://www.pascal- under2×speed-up. After fine-tuning, our result is signif- network.org/challenges/VOC/voc2007/workshop/index.html. icantly better than Filter pruning [31] and scratch trained 4,6 + one. [12] R. Girshick. Fast r-cnn. InProceedings of the IEEE Inter- + national Conference on Computer Vision, pages 1440–1448, + 5. Conclusion 2015.2 + [13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress- + To conclude, current deep CNNs are accurate with high ing deep convolutional networks using vector quantization. + inference costs. In this paper, we have presented an arXiv preprint arXiv:1412.6115, 2014.2 + inference-time channel pruning method for very deep net- [14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for + works. The reduced CNNs are inference efficient networks efficient dnns. InAdvances In Neural Information Process- + while maintaining accuracy, and only require off-the-shelf ing Systems, pages 1379–1387, 2016.2 + libraries. Compelling speed-ups and accuracy are demon- [15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, + strated for both VGG Net and ResNet-like networks on Im- and W. J. Dally. Eie: efficient inference engine on com- + ageNet, CIFAR-10 and PASCAL VOC. pressed deep neural network. InProceedings of the 43rd + International Symposium on Computer Architecture, pages In the future, we plan to involve our approaches into 243–254. IEEE Press, 2016. 2 training time, instead of inference time only, which may [16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- also accelerate training procedure. pressing deep neural network with pruning, trained quantiza- + tion and huffman coding.CoRR, abs/1510.00149, 2, 2015. + References 2 + [17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights + [1] J. M. Alvarez and M. Salzmann. Learning the number of and connections for efficient neural network. InAdvances in + neurons in deep networks. InAdvances in Neural Informa- Neural Information Processing Systems, pages 1135–1143, + tion Processing Systems, pages 2262–2270, 2016. 1,2,3, 2015.1,2,3 + 6 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- + [2] S. Anwar, K. Hwang, and W. Sung. Structured prun- ing for image recognition.arXiv preprint arXiv:1512.03385, + ing of deep convolutional neural networks. arXiv preprint 2015. 1,2,3,4,6,8 + arXiv:1512.08571, 2015.2 [19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim- + [3] S. Anwar and W. Sung. Compact deep convolutional ming: A data-driven neuron pruning approach towards effi- + neural networks with coarse pruning. arXiv preprint cient deep architectures. arXiv preprint arXiv:1607.03250, + arXiv:1610.09639, 2016.1,2 2016.2 + + + + + 1396 [20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, + A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, + Speed/accuracy trade-offs for modern convolutional object V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, + detectors.arXiv preprint arXiv:1611.10012, 2016. 6 M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma- + [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating chine learning in Python.Journal of Machine Learning Re- + deep network training by reducing internal covariate shift. search, 12:2825–2830, 2011.4 + arXiv preprint arXiv:1502.03167, 2015.4 [39] A. Polyak and L. Wolf. Channel-level acceleration of deep + [22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up face representations.IEEE Access, 3:2163–2175, 2015.2 + convolutional neural networks with low rank expansions. [40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor- + arXiv preprint arXiv:1405.3866, 2014.1,2,5,6,7 net: Imagenet classification using binary convolutional neu- + [23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- ral networks. InEuropean Conference on Computer Vision, + shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- pages 525–542. Springer, 2016. 2 + tional architecture for fast feature embedding.arXiv preprint [41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. arXiv:1408.5093, 2014. 4,6 You only look once: Unified, real-time object detection. + [24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. CoRR, abs/1506.02640, 2015. 6 + Compression of deep convolutional neural networks for [42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:fast and low power mobile applications. arXiv preprint towards real-time object detection with region proposal net-arXiv:1511.06530, 2015.2 works.CoRR, abs/1506.01497, 2015.6 [25] A. Krizhevsky and G. Hinton. Learning multiple layers of [43] K. Simonyan and A. Zisserman. Very deep convolutionalfeatures from tiny images. 2009.4,8 networks for large-scale image recognition. arXiv preprint[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet arXiv:1409.1556, 2014.3,4,5,6classification with deep convolutional neural networks. In [44] S. Srinivas and R. V. Babu. Data-free parameter pruningAdvances in neural information processing systems, pages for deep neural networks.arXiv preprint arXiv:1507.06149,1097–1105, 2012.2,3 2015.2[27] A. Lavin. Fast algorithms for convolutional neural networks. [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,arXiv preprint arXiv:1509.09308, 2015.2 D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and Going deeper with convolutions. InProceedings of the IEEEV. Lempitsky. Speeding-up convolutional neural net- Conference on Computer Vision and Pattern Recognition,works using fine-tuned cp-decomposition. arXiv preprint pages 1–9, 2015.1,3,6arXiv:1412.6553, 2014.2 [46] R. Tibshirani. Regression shrinkage and selection via the[29] V. Lebedev and V. Lempitsky. Fast convnets using group- lasso. Journal of the Royal Statistical Society. Series Bwise brain damage.arXiv preprint arXiv:1506.02515, 2015. (Methodological), pages 267–288, 1996.32 [47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- antino, and Y. LeCun. Fast convolutional nets withbased learning applied to document recognition. Proceed- fbfft: A gpu performance evaluation. arXiv preprintings of the IEEE, 86(11):2278–2324, 1998.2,3 arXiv:1412.7580, 2014.1,2[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. + Graf. Pruning filters for efficient convnets. arXiv preprint [48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning + arXiv:1608.08710, 2016.1,2,4,5,6,7,8 structured sparsity in deep neural networks. InAdvances In + [32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Neural Information Processing Systems, pages 2074–2082, + Sparse convolutional neural networks. InProceedings of the 2016.1,2,3 + IEEE Conference on Computer Vision and Pattern Recogni- [49] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated´ + tion, pages 806–814, 2015.2 residual transformations for deep neural networks. arXiv + [33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, preprint arXiv:1611.05431, 2016.7 + C. Fu, and A. C. Berg. SSD: single shot multibox detector. [50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural + CoRR, abs/1512.02325, 2015.6 network acoustic models with singular value decomposition. + [34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint InINTERSPEECH, pages 2365–2369, 2013.2 + arXiv:1511.05077, 2015.2 [51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy- + [35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training efficient convolutional neural networks using energy-aware + of convolutional networks through ffts. arXiv preprint pruning.arXiv preprint arXiv:1611.05128, 2016.2 + arXiv:1312.5851, 2013.2 [52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very + [36] V. Nair and G. E. Hinton. Rectified linear units improve deep convolutional networks for classification and detection. + restricted boltzmann machines. InProceedings of the 27th IEEE transactions on pattern analysis and machine intelli- + international conference on machine learning (ICML-10), gence, 38(10):1943–1955, 2016.1,2,3,5,6,7 + pages 807–814, 2010.4 + [37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable + parallel programming with CUDA.ACM Queue, 6(2):40–53, + 2008.6 + + + + + 1397 \ No newline at end of file diff --git a/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt b/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt new file mode 100644 index 0000000..a4ec71b Binary files /dev/null and b/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt differ diff --git a/Corpus/DEEP DOUBLE DESCENT - Preetum Nakkiran.txt b/Corpus/DEEP DOUBLE DESCENT - Preetum Nakkiran.txt new file mode 100644 index 0000000..282e671 Binary files /dev/null and b/Corpus/DEEP DOUBLE DESCENT - Preetum Nakkiran.txt differ diff --git a/Corpus/Deep Residual Learning for Image Recognition.txt b/Corpus/Deep Residual Learning for Image Recognition.txt new file mode 100644 index 0000000..6cb144d Binary files /dev/null and b/Corpus/Deep Residual Learning for Image Recognition.txt differ diff --git a/Corpus/Direct Feedback Alignment Scales toModern Deep Learning Tasks and Architectures.txt b/Corpus/Direct Feedback Alignment Scales toModern Deep Learning Tasks and Architectures.txt new file mode 100644 index 0000000..8b3ad5c --- /dev/null +++ b/Corpus/Direct Feedback Alignment Scales toModern Deep Learning Tasks and Architectures.txt @@ -0,0 +1,1161 @@ + Direct Feedback Alignment Scales to + Modern Deep Learning Tasks and Architectures + + + + + Julien Launay 1;2 Iacopo Poli 1 François Boniface 1 Florent Krzakala 1;2 + + 1 LightOn 2 École Normale Supérieure + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + arXiv:2006.12878v1 [stat.ML] 23 Jun 2020 {julien, iacopo, francois, florent}@lighton.ai + + + + Abstract + + Despite being the workhorse of deep learning, the backpropagation algorithm is + no panacea. It enforces sequential layer updates, thus preventing efficient paral- + lelization of the training process. Furthermore, its biological plausibility is being + challenged. Alternative schemes have been devised; yet, under the constraint of + synaptic asymmetry, none have scaled to modern deep learning tasks and architec- + tures. Here, we challenge this perspective, and study the applicability of Direct + Feedback Alignment to neural view synthesis, recommender systems, geometric + learning, and natural language processing. In contrast with previous studies lim- + ited to computer vision tasks, our findings show that it successfully trains a large + range of state-of-the-art deep learning architectures, with performance close to + fine-tuned backpropagation. At variance with common beliefs, our work supports + that challenging tasks can be tackled in the absence of weight transport. + + + 1 Introduction + + While the backpropagation algorithm (BP) [1,2] is at the heart of modern deep learning achievements, + it is not without pitfalls. For one, its weight updates are non-local and rely on upstream layers. Thus, + they cannot be easily parallelized [3], incurring important memory and compute costs. Moreover, + its biological implementation is problematic [4,5]. For instance, BP relies on the transpose of the + weights to evaluate updates. Hence, synaptic symmetry is required between the forward and backward + path: this is implausible in biological brains, and known as the weight transport problem [6]. + Consequently, alternative training algorithms have been developed. Some of these algorithms are + explicitly biologically inspired [7–13], while others focus on making better use of available compute + resources [3,14–19]. Despite these enticing characteristics, none has been widely adopted, as they + are often demonstrated on a limited set of tasks. Moreover, as assessed in [20], their performance on + challenging datasets under the constraint of synaptic asymmetry is disappointing. + We seek to broaden this perspective, and demonstrate the applicability of Direct Feedback Alignment + (DFA) [19] in state-of-the-art settings: from applications of fully connected networks such as neural + view synthesis and recommender systems, to geometric learning with graph convolutions, and natural + language processing with Transformers. Our results define new standards for learning without weight + transport and show that challenging tasks can indeed be tackled under synaptic asymmetry. + All code needed to reproduce our experiments is available athttps://github.com/lightonai/ + dfa-scales-to-modern-deep-learning. + + + + + + + Preprint. Under review. 1.1 Related work + + Training a neural network is a credit assignment problem: an update is derived for each parameter + from its contribution to a cost function. To solve this problem, a spectrum of algorithms exists [21]. + + Biologically motivated methods Finding a training method applicable under the constraints of + biological brains remains an open problem. End-to-end propagation of gradients is unlikely to occur + [22], implying local learning is required. Furthermore, the weight transport problem enforces synaptic + asymmetry [6]. Inspired by auto-encoders, target propagation methods (TP) [10–12] train distinct + feedback connections to invert the feedforward ones. Feedback alignment (FA) [13] replaces the + transpose of the forward weights used in the backward pass by a random matrix. Throughout training, + the forward weights learn toalignwith the arbitrary backward weights, eventually approximating BP. + + Beyond biological considerations As deep learning models grow bigger, large-scale distributed + training is increasingly desirable. Greedy layer-wise training [14] allows networks to be built layer + by layer, limiting the depth of backpropagation. To enable parallelization of the backward pass, + updates must only depend on local quantities. Unsupervised learning is naturally suited for this, + as it relies on local losses such as Deep InfoMax [17] and Greedy InfoMax [18]. More broadly, + synthetic gradient methods, like decoupled neural interfaces [3,15] and local error signals (LES) + [16], approximate gradients using layer-wise trainable feedback networks. DFA [19] expands on FA + and directly projects a global error to each layer. A shared feedback path is still needed, but it only + depends on a simple random projection operation. + + Performance of alternative methods Local training methods are successful in unsupervised learn- + ing [18]. Even in a supervised setting, they scale to challenging datasets like CIFAR-100 or ImageNet + [14,16]. Thus, locality is not too penalizing. However, TP, FA, and DFA are unable to scale to these + tasks [20]. In fact, DFA is unable to train convolutional layers [23]. To enable feedback alignment + techniques to perform well on challenging datasets, some form of weight transport is necessary: + either by explicitly sharing sign information [24–26], or by introducing dedicated phases of alignment + for the forward and backward weights where some information is shared [27]. To the best of our + knowledge, no method compatible with the weight transport problem has ever been demonstrated on + challenging tasks. + + 1.2 Motivations and contributions + + We focus on DFA, a compromise between biological and computational considerations. Notably, + DFA is compatible with synaptic asymmetry: this asymmetry raises important challenges, seemingly + preventing learning in demanding settings. Moreover, it allows for asynchronous weight updates, + and puts a single operation at the center of the training stage. This enables new classes of training + co-processors [28, 29], leveraging dedicated hardware to perform the random projection. + + Extensive survey We apply DFA in a large variety of settings matching current trends in machine + learning. Previous works have found that DFA is unsuitable for computer vision tasks [20,23]; but + computer vision alone cannot be the litmus test of a training method. Instead, we consider four vastly + different domains, across eight tasks, and with eleven different architectures. This constitutes a survey + of unprecedented scale for an alternative training method, and makes a strong case for the possibility + of learning without weight transport in demanding scenarios. + + Challenging settings We demonstrate the ability of DFA to tackle challenging tasks. We success- + fully learn and render real-world 3D scenes (section 3.1.1); we perform recommendation at scale + (section 3.1.2); we explore graph-based citation networks (section 3.2); and we consider language + modelling with a Transformer (section 3.3). We study tasks at the state-of-the-art level, that have + only been recently successfully tackled with deep learning. + + Modern architectures We prove that the previously established failure of DFA to train convolutions + does not generalize. By evaluating performance metrics, comparing against a shallow baseline, + measuring alignment, and visualizing t-SNE embeddings, we show that learning indeed occurs in + layers involving graph convolutions and attention. This significantly broadens the applicability of + DFA–previously thought to be limited to simple problems like MNIST and CIFAR-10. + + 2 2 Methods + + Forward pass In a fully connected network, at layeriout ofN, neglecting its biases, withWi its + weight matrix,fi its non-linearity, andhi its activations, the forward pass is: + 8i2[i;:::;N] :ai =Wi hi1 ;hi =fi (ai ): (1) + h0 =Xis the input data, andhN =f(aN ) =^yare the predictions. A task-specific cost function + L(^y;y)is computed to quantify the quality of the predictions with respect to the targetsy. + + Backward pass with BP The weight updates are computed by backpropagation of the error vector. + Using the chain-rule of derivatives, each neuron is updated based on its contribution to the cost + function. Leaving aside the specifics of the optimizer used, the equation for the weight updates is: + @L @LW Ti = =[(W a (2)@W i+1 i+1 )f0 (ai i )]hT ;ai1 i = + i @ai + + Backward pass with DFA The gradient signalWT ai+1 i+1 of the (i+1)-th layer violates synaptic + asymmetry. DFA replaces it with a random projection of the topmost derivative of the loss,ay . + For common classification and regression losses such as the mean squared error or the negative log + likelihood, this corresponds to a random projection of the global errore=^yy. WithBi , a fixed + random matrix of appropriate shape drawn at initialization for each layers: + @LWi =[(Bi ay )f0 (a ai i )]hT ; i1 y = (3)@ay + + 3 Experiments + + We study the applicability of DFA to a diverse set of applications requiring state-of-the-art architec- + tures. We start with fully connected networks, where DFA has already been demonstrated, and address + new challenging settings. We then investigate geometric learning: we apply DFA to graph neural net- + works in classification tasks on citation networks, as well as graph autoencoders. These architectures + feature graph convolutions and attention layers. Finally, we use DFA to train a transformer-based + Natural Language Processing (NLP) model on a dataset of more than 100 million tokens. + + 3.1 Fully connected architectures + + DFA has been successful at training fully connected architectures, with performance on-par with + backpropagation [19,20]. However, only computer vision tasks have been considered, where fully + connected networks considerably underperform their convolutional counterpart. Here, we focus on + tasks where fully connected architectures are state-of-the-art. Moreover, the architectures considered + are deeper and more complex than those necessary to solve a simple task like MNIST. + + 3.1.1 Neural view synthesis with Neural Radiance Fields + The most recent state-of-the-artneural view synthesismethods are based on large fully connected + networks: this is an ideal setting for a first evaluation of DFA on a challenging task. + + Background There has been growing interest in methods capable of synthesising novel renders of + a 3D scene using a dataset of past renders. The network is trained to learn an inner representation of + the scene, and a classical rendering system can then query the model to generate novel views. With + robust enough methods, real-world scenes can also be learned from a set of pictures. + Until recently, most successful neural view synthesis methods were based on sampled volumetric + representations [30–32]. In this context, Convolutional Neural Networks (CNNs) can be used to + smooth out the discrete sampling of 3D space [33,34]. However, these methods scale poorly to + higher resolutions, as they still require finer and finer sampling. Conversely, alternative schemes + based on a continuous volume representation have succeeded in generating high-quality renders [35], + even featuring complex phenomenons such as view-dependant scattering [36]. These schemes make + point-wise predictions, and use fully connected neural networks to encode the scene. + + 3 Figure 1: Comparisons of NeRF-DFA with state-of-the-art methods trained with BP on the most + challenging synthetic and real-world scenes. While NeRF-DFA generates render of lower quality, + they maintain multi-view consistency and exhibit no geometric artefacts. BP results from [36]. + + + Setting We employ Neural Radiance Fields (NeRF) [36], the state-of-the-art for neural view + synthesis. NeRF represents scenes as a continuous 5D function of space–three spatial coordinates, + two viewing angles–and outputs a point-wise RGB radiance and opacity. A ray-casting renderer can + then query the network to generate arbitrary views of the scene. The network modeling the continuous + function is 10 layers deep. Two identical networks are trained: thecoarsenetwork predictions inform + the renderer about the spatial coordinates that thefinenetwork should preferentially evaluate to avoid + empty space and occluded regions. + + Results We report quantitative results of training NeRF with DFA in Table 1. Neural view synthesis + methods are often better evaluated qualitatively: we showcase some renders in Figure 1. + On a dataset of renders featuring complex scenes with non-Lambertian materials (NeRF-Synthetic + [36]), NeRF-DFA outperforms two previous fine-tuned state-of-the-art methods–Scene Representation + Networks (SRN) [35] and Local Light Field Fusion (LLFF) [32]–and nearly matches the performance + of Neural Volumes (NV) [34]. While DFA underperforms alternative methods trained with BP on + the real world view dataset (LLFF-Real [32]), its performance remains significant: real world view + synthesis is a challenging tasks, and this level of PSNR indicates that learning is indeed happening. + In particular, we find that NeRF-DFA retains the key characteristics of NeRF-BP: it can render view- + dependant effects, and is multi-view consistent. The last point is an especially important achievement, + and most visible in videos, as it is a challenge for most algorithms [30–32,35]. The main drawback + of NeRF-DFA appears to be a seemingly lower render definition. The NeRF architecture has not + + + Table 1: Peak Signal to Noise Ratio (PSNR, higher is better) of neural view synthesis methods + trained with backpropagation against NeRF trained with DFA. Even when trained with DFA, NeRF + outperforms two state-of-the-art methods on a synthetic dataset (NeRF-Synthetic), and achieves fair + performance on a challenging real world views datasets (LLFF-Real). BP results from [36]. + + NV SRN LLFF NeRF + BP BP BP BP DFA + NeRF-Synthetic 26.05 22.26 24.88 31.01 25.41 + LLFF-Real / 22.84 24.13 26.50 20.77 + + + 4 been fine-tuned to achieve these results: DFA works out-of-the-box on this advanced method. Future + research focusing on architectural changes to NeRF could improve performance with DFA; some + preliminary results are included in the supplementary material. + + 3.1.2 Click-through rate prediction with recommender systems + We have demonstrated that DFA can train large fully connected networks on the difficult task of neural + view synthesis. We now seek to use DFA in more complex heterogeneous architectures, combining + the use of fully connected networks with other machine learning methods.Recommender systemsare + an ideal application for such considerations. + + Background Recommender systems are used to model the behavior of users and predict future + interactions. In particular, in the context of click-through rate (CTR) prediction, these systems model + the probability of a user clicking on a given item. Building recommender systems is hard [37]: their + input is high-dimensional and sparse, and the model must learn to extract high-order combinatorial + features from the data. Moreover, they need to do so efficiently, as they are used to make millions of + predictions and the training data may contain billions of examples. + Factorization Machines (FM) [38] use inner-products of latent vectors between features to extract + pairwise feature interactions. They constitute an excellent baseline for shallow recommender systems, + but fail to efficiently transcribe higher-level features. To avoid extensive feature engineering, it has + been suggested that deep learning can be used in conjunction with wide shallow models to extract + these higher-level features [39]. In production, these systems are regularly retrained on massive + datasets: the speedup allowed by backward unlocking in DFA is thus of particular interest. + + Setting Deep Factorization Machines (DeepFM) [40] combine FM and a deep fully connected + neural network, which we train with DFA. The input embedding is still trained directly via gradient + descent, as weight transport is not necessary to backpropagate through the FM. Deep & Cross + Networks (DCN) [41] replace the FM with a Cross Network, a deep architecture without non- + linearities capable of extracting high-degree interactions across features. We train the fully connected + network, the deep cross network, and the embeddings with DFA. Finally, Adaptative Factorization + Network (AFN) [42] uses Logarithmic Neural Networks [43] to enhance the representational power + of its deep component. We evaluate these methods on the Criteo dataset [44], which features nearly + 46 million samples of one million sparse features. This is a difficult task, where performance + improvements of the AUC on the0.001-levelcan enhance CTR significantly [39]. + + Results Performance metrics are reported in Table 2. To obtain these results, a simple hyperpa- + rameter grid search over optimization and regularization parameters was performed for BP and DFA + independently. DFA successfully trains all methods above the FM baseline, and in fact matches BP + performance in both DeepFM and AFN. Because of their complexity, recommender systems require + intensive tuning and feature engineering to perform at the state-of-the-art level–and reproducing + existing results can be challenging [45]. Hence, it is not surprising that a performance gap exists with + Deep&Cross–further fine-tuning may be necessary for DFA to reach BP performance. + Alignment measurements corroborate that learning is indeed occurring in the special layers of + Deep&Cross and AFN–see supplementary for details. Our results on recommender systems support + that DFA can learn in a large variety of settings, and that weight transport is not necessary to solve a + difficult recommendation task. + + + Table 2: AUC (higher is better) and log loss (lower is better) of recommender systems trained on the + Criteo dataset [44]. Even in complex heterogeneous architectures, DFA performance is in line with + BP. Values inboldindicate DFA AUC within 0.001 from the BP AUC or better. + + FM DeepFM Deep&Cross AFN + BP DFA BP DFA BP DFA + AUC 0.7915 0.7954 0.7956 0.8104 0.8009 0.7933 0.7924 + Loss 0.4687 0.4610 0.4624 0.4414 0.4502 0.4630 0.4621 + + + 5 3.2 Geometric Learning with Graph Convolutional Networks + + The use of sophisticated architectures beyond fully connected layers is necessary for certain tasks, + such asgeometric learning[46], where information lies in a complex structured domain. To address + geometric learning tasks, methods capable of handling graph-based data are commonly needed. + Graph convolutional neural networks (GCNNs) [47–50] have demonstrated the ability to process + large-scale graph data efficiently. We study the applicability of DFA to these methods, including + recent architectures based on an attention mechanism. Overall, this is an especially interesting setting, + as DFA fails to train more classic 2D image convolutional layers [23]. + + Background Complex data like social networks or brain connectomes lie on irregular or non- + Euclidean domains. They can be represented as graphs, and efficient processing in the spectral + domain is possible. Non-spectral techniques to apply neural networks to graphs have also been + developed [51–53], but they exhibit unfavorable scaling properties. The success of CNNs in deep + learning can be attributed to their ability to efficiently process structured high-dimensional data + by sharing local filters. Thus, a generalization of the convolution operator to the graph domain is + desirable: [47] first proposed a spectral convolution operation for graphs, and [48] introduced a form + of regularization to enforce spatial locality of the filters. We use DFA to train different such GCNNs + implementations. We study both spectral and non-spectral convolutions, as well as methods inspired + by the attention mechanism. We consider the task of semi-supervised node classification: nodes from + a graph are classified using their relationship to other nodes as well as node-wise features. + + Setting Fast Localized Convolutions (ChebConv) [49] approximate the graph convolution kernel + with Chebyshev polynomials, and are one of the first scalable convolution methods on graph. Graph + Convolutions (GraphConv) [50] remove the need for an explicit parametrization of the kernel by + enforcing linearity of the convolution operation on the graph Laplacian spectrum. It is often considered + as the canonical graph convolution. More recent methods do not operate in the spectral domain. Spline + Convolutions (SplineConv) [54] use a spline-based kernel, enabling the inclusion of information + about the relative positioning of nodes, enhancing their representational power–for instance in the + context of 3D meshes. Graph Attention Networks (GATConv) [55] use self-attention [56] layers to + enable predictions at a given node toattendmore specifically to certain parts of its neighborhood. + Finally, building upon Jumping Knowledge Network [57], Just Jump (DNAConv) [58] uses multi- + head attention [59] to enhance the aggregation process in graph convolutions and enable deeper + architectures. We use PyTorch Geometric [60] for reference implementation of all of these methods. + We evaluate performance on three citation network datasets: Cora, CiteSeer, and PubMed [61]. + + Results We report classification accuracy in Table 3. BP and DFA regularization and optimiza- + tion hyperparameters are fine-tuned separately on the Cora dataset. In general, we find that less + regularization and lower learning rates are needed with DFA. DFA successfully trains all graph + methods, independent of whether they use the spectral domain or not, and even if they use attention. + Furthermore, for GraphConv, SplineConv, and GATConv DFA performance nearly matches BP. + As GCNNs struggle with learning meaningful representations when stacking many layers [62], all + architectures but DNAConv are quite shallow (two layers). However, DFA performance is still + significantly higher than that of a shallow training method–see supplementary for details. The lower + performance on DNAConv is not a failure to learn: alignment measurements show that learning is + indeed occurring. It may be explained instead by a need for more in-depth fine-tuning, as this is a + deep architecture with 5 successive attention layers. + + Table 3: Classification accuracy (%, higher is better) of graph convolution methods trained with BP + and DFA, on citation networks [61]. But for ChebConv and DNAConv, DFA performance nearly + matches BP performance. Values inboldwhen DFA is within 2.5% of BP. + + ChebConv GraphConv SplineConv GATConv DNAConv + BP DFA BP DFA BP DFA BP DFA BP DFA + Cora 79.2 75.4 80.1 79.9 81.0 77.7 82.6 80.6 84.6 82.9 + CiteSeer 69.5 67.6 71.6 69.4 70.0 69.8 72.0 71.2 73.4 70.8 + PubMed 79.5 75.7 78.8 77.8 77.5 77.2 77.7 77.1 87.2 79.9 + + + 6 GAE + BP DFA + AUC 0.918 0.900Cora AP 0.918 0.900 + AUC 0.886 0.879CiteSeer AP 0.895 0.889 + AUC 0.967 0.945PubMed AP 0.966 0.945 + + Table 4: AUC and Average Precision Figure 2: t-SNE visualization of the hidden layer + (AP, higher is better) for a Graph- activations of a two-layer GraphConv trained on + Conv GAE trained with BP or DFA Cora with DFA. Classes forms clear clusters, in- + on citation networks. DFA repro- dicating that a useful intermediary representation + duces BP performance. is learned. Colors represent different classes. + + + We further demonstrate that DFA helps graph convolutions learn meaningful representations by + aplying t-SNE [63,64] to the hidden layer activations in GraphConv (Figure 2). Cluster of classes + are well-separated, indicating that a useful intermediary representation is derived by the first layer. + + Graph autoencoders We consider one last application of graph convolutions, in the context of + graph autoencoders (GAE). We train a non-probabilistic GAE [65] based on GraphConv with DFA, + and report results in Table 4. DFA performance is always in line with BP. + + 3.3 Natural Language Processing with Transformers + + We complete our study by training a Transformer [59] on a language modelling task. Transformers + have proved successful in text, image, music generation, machine translation, and many supervised + NLP tasks [59,66–69]. Here, we demonstrate that DFA can train them, and we show the influence of + tuning the optimizer hyperparameters in narrowing the gap with BP. + + Background NLP has largely benefited from advances in deep learning. Recurrent Neural Net- + works were responsible for early breakthroughs, but their sequential nature prevented efficient + parallelization of data processing. Transformers are attention-based models that do not rely on + recurrence or convolution. Their ability to scale massively has allowed the training of models with + several billion parameters [70,71], obtaining state-of-the-art results on all NLP tasks: Transformers + now top the prominent SQuAD 2.0 [72,73] and SuperGLUE [74] benchmarks. In parallel, transfer + learning in NLP has leaped forward thanks to language modelling, the unsupervised task of predicting + the next word. It can leverage virtually unlimited data from web scraping [75]. This enabled the + training ofuniversal language models[76] on extremely large and diversified text corpora. These + models are useful across a wide range of domains, and can solve most NLP tasks after fine-tuning. + + Setting The prominence of both language modelling and Transformers gives us the ideal candidate + for our NLP experiments: we train a Transformer to predict the next word on the WikiText-103 + dataset [77], a large collection ofgoodandfeaturedWikipedia articles. We use byte-pair-encoding + [78] with 32,000 tokens. Our setup is similar to GPT [66]: we adapt the Transformer, originally an + encoder-decoder model designed for machine translation, to language modelling. We keep only the + encoder and mask the tokens to predict. Our architecture consists in 6 layers, 8 attention heads, a + model dimension of 512, and a hidden size of 2048 in the feed-forward blocks. The text is sliced + in chunks of 128 tokens and batches of 64 such chunks, resulting in 8192 tokens per batch. Our + baseline is trained with BP using the optimization setup of [59]. We found perplexity after 20 epochs + to be an excellent indicator of perplexity at convergence; to maximize the number of experiments + we could perform, we report the best validation perplexity after 20 epochs. We study two ways of + implementing DFA: applying the feedback after every encoder block (macro) or after every layer in + those blocks (micro). The input embedding layer receives gradients from the next feedback point + through BP. This leaves some amount of weight transport even in themicrocase. + + 7 Table 5: Best validation perplexity after 20 epochs of a Transformer trained on WikiText-103 (lower + is better). The BP and DFA baselines share all hyper-parameters. InMacrothe feedback is applied + after every transformer layer, while inMicrothe feedback is applied after every sub-layer. The + learning rate of Adam without the learning rate scheduler is5:10 5 . With the scheduler, the initial + learning rate is1:10 4 and it is multiplied by 0.2 when performance plateaus, with a patience of 1. + * score after 22 epochs to let the learning rate scheduler take effect + + DFA BP + Baseline + Adam +2 = 0:999 + LR schedule Baseline +2 = 0:999 + Macro 95.0 77.1 55.0 52.0 34.4 29.8Micro 182 166 99.9 93.3* + + + Results Our results are summarized in Table 5. Hyper-parameters fine-tuned for BP did not fare + well with DFA, but changes in the optimizer narrowed the gap between BP and DFA considerably. + The learning rate schedule used on top of Adam [79] in [59] proved detrimental. Using Adam alone + required reducing the learning rate between BP and DFA. Increasing2 from 0.98 [59] to 0.999 + improved performance significantly. Finally, a simple scheduler that reduces the learning rate when + the validation perplexity plateaus helped reducing it further. Considering that the perplexity of the + shallow baseline is over 400, DFA is clearly able to train Transformers. However, our results are not + on par with BP, especially in themicrosetting. A substantial amount of work remains to make DFA + competitive with BP, even more so in a minimal weight transport scenario. The large performance + improvements brought by small changes in the optimizer indicate that intensive fine-tuning, common + in publications introducing state-of-the-art results, could close the gap between BP and DFA. + + 4 Conclusion and outlooks + + We conducted an extensive study demonstrating the ability of DFA to train modern architectures. We + considered a broad selection of domains and tasks, with complex models featuring graph convolutions + and attention. Our results on large networks like NeRF and Transformers are encouraging, suggesting + that with further tuning, such leading architectures can be effectively trained with DFA. Future work + on principled training with DFA–in particular regarding the influence of common practices and + whether new procedures are required–will help close the gap with BP. + More broadly, we verified for the first time that learning under synaptic asymmetry is possible beyond + fully-connected layers, and in tasks significantly more difficult than previously considered. This + addresses a notable concern in biologically-plausible architectures. DFA still requires an implausible + global feedback pathway; however, local training has already been demonstrated at scale. The next + step towards biologically-compatible learning is a local method without weight transport. + While the tasks and architectures we have considered are not biologically inspired, they constitute + a good benchmark forbehavioural realism[20]. Any learning algorithm claiming to approximate + the brain should reproduce its ability to solve complex and unseen task. Furthermore, even though + the current implementation of mechanisms like attention is devoid of biological considerations, they + represent broader concepts applicable to human brains [80]. Understanding how our brain learns is a + gradual process, and future research could incorporate further realistic elements, like spiking neurons. + Finally, unlocking the backward pass in large architectures like Transformers is promising. More opti- + mized implementation of DFA–built at a lower-level of existing ML libraries–could unlock significant + speed-up. Leveraging the use of a single random projection as the cornerstone of training, dedicated + accelerators may employ more exotic hardware architectures. This will open new possibilities in the + asynchronous training of massive models. + + + + + + + + + + + 8 Broader Impact + + Of our survey This study is the first experimental validation of DFA as an effective training method + in a wide range of challenging tasks and neural networks architectures. This significantly broadens the + applications of DFA, and more generally brings new insight on training techniques alternative to back- + propagation. From neural rendering and recommender systems, to natural language processing or + geometric learning, each of these applications has its own potential impact. Our task selection process + was motivated by current trends in deep learning, as well as by technically appealing mechanisms + (graph convolutions, attention). A limit of our survey is that our–arguably biased–selection of tasks + cannot be exhaustive. Our experiments required substantial cloud compute resources, with state-of- + the-art GPU hardware. Nevertheless, as this study provides new perspectives for hardware accelerator + technologies, it may favor the application of neural networks in fields previously inaccessible because + of computational limits. Future research on DFA should continue to demonstrate its use in novel + contexts of interest as they are discovered. + + Of the considered applications Each of the applications considered in our study has a wide + potential impact, consider for example the impact of textual bias in pretrained word embeddings [81]. + We refer to [82] and references therein for a discussion of ethical concerns of AI applications. + + Of DFA as a training method DFA enables parallelization of the backward pass and places a + single operation at the center of the training process, opening the prospect of reducing the power + consumption of training chips by an order of magnitude [28]. Not only is more efficient training a + path to more environmentally responsible machine learning [83], but it may lower the barrier of entry, + supporting equality and sustainable development goals. A significant downside of moving from BP to + DFA is a far more limited understanding of how to train models and how the trained models behave. + There is a clear empirical understanding of the impact of techniques such as batch normalization + or skip connections on the performance of BP; new insights need to be obtained for DFA. BP also + enjoys decades of works on topics like adversarial attacks, interpretability, and fairness. Much of + this work has to be cross-checked for alternative training methods, something we encourage further + research to consider as the next step towards safely and responsively scaling up DFA. + + Of biologically motivated method Finally, a key motivation for this study was to demonstrate that + learning challenging tasks was possible without weight transport. Biologically motivated methods + are a more foundational research direction, and as such the possible long-term impact of our findings + is harder to estimate under this light. However, fundamental research of this kind is important to open + new pathways for ML and neuroscience. + + Acknowledgments and Disclosure of Funding + + We thank Igor Carron and Laurent Daudet for the general guidance on the subject of this investigation + and the insightful comments, as well as the larger LightOn team for their support. + + References + [1]P. J. Werbos.Beyond Regression: New Tools for Prediction and Analysis in the Behavioral + Sciences. PhD thesis, Harvard University, 1974. + [2]D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error + propagation. InParallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986. + [3]Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, + David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. + InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages + 1627–1635, 2017. + [4]Francis Crick. The recent excitement about neural networks.Nature, 337(6203):129–132, 1989. + [5]Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep + learning and neuroscience.Frontiers in computational neuroscience, 10:94, 2016. + + 9 [6]Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. + Cognitive science, 11(1):23–63, 1987. + [7]Javier R Movellan. Contrastive hebbian learning in the continuous hopfield model. InConnec- + tionist models, pages 10–17. Elsevier, 1991. + [8]Randall C O’Reilly. Biologically plausible error-driven learning using local activation differ- + ences: The generalized recirculation algorithm.Neural computation, 8(5):895–938, 1996. + [9]Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. InArtificial intelligence + and statistics, pages 448–455, 2009. + [10]Yann Le Cun. Learning process in an asymmetric threshold network. InDisordered systems + and biological organization, pages 233–240. Springer, 1986. + [11]Yoshua Bengio. How auto-encoders could provide credit assignment in deep networks via target + propagation.arXiv preprint arXiv:1407.7906, 2014. + [12]Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propaga- + tion. InJoint european conference on machine learning and knowledge discovery in databases, + pages 498–515. Springer, 2015. + [13]Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synap- + tic feedback weights support error backpropagation for deep learning.Nature communications, + 7(1):1–10, 2016. + [14]Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can + scale to imagenet. InInternational Conference on Machine Learning, pages 583–593, 2019. + [15]Wojciech M Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan + Pascanu. Sobolev training for neural networks. InAdvances in Neural Information Processing + Systems, pages 4278–4287, 2017. + [16]Arild Nøkland and Lars Hiller Eidnes. Training neural networks with local error signals. In + International Conference on Machine Learning, pages 4839–4850, 2019. + [17]R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, + Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information + estimation and maximization. InInternational Conference on Learning Representations, 2019. + URLhttps://openreview.net/forum?id=Bklr3j0cKX. + [18]Sindy Löwe, Peter O’Connor, and Bastiaan Veeling. Putting an end to end-to-end: Gradient- + isolated learning of representations. InAdvances in Neural Information Processing Systems, + pages 3033–3045, 2019. + [19] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In + Advances in neural information processing systems, pages 1037–1045, 2016. + [20]Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy + Lillicrap. Assessing the scalability of biologically-motivated deep learning algorithms and + architectures. InAdvances in Neural Information Processing Systems, pages 9368–9378, 2018. + [21]Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. + Backpropagation and the brain.Nature Reviews Neuroscience, pages 1–12, 2020. + [22]Natalia Caporale and Yang Dan. Spike timing–dependent plasticity: a hebbian learning rule. + Annu. Rev. Neurosci., 31:25–46, 2008. + [23]Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with + direct feedback alignment.arXiv preprint arXiv:1906.04554, 2019. + [24]Qianli Liao, Joel Z Leibo, and Tomaso Poggio. How important is weight symmetry in back- + propagation? InThirtieth AAAI Conference on Artificial Intelligence, 2016. + + 10 [25]Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep + convolutional networks.arXiv preprint arXiv:1812.06488, 2018. + + [26]Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. Biologically-plausible learning + algorithms can scale to large datasets. InInternational Conference on Learning Representations, + 2019. URLhttps://openreview.net/forum?id=SygvZ209F7. + + [27]Mohamed Akrout, Collin Wilson, Peter C Humphreys, Timothy Lillicrap, and Douglas Tweed. + Using weight mirrors to improve feedback alignment.arXiv preprint arXiv:1904.05391, 2019. + + [28]Julien Launay, Iacopo Poli, Kilian Müller, Igor Carron, Laurent Daudet, Florent Krzakala, and + Sylvain Gigan. Light-in-the-loop: using a photonics co-processor for scalable training of neural + networks, 2020. + + [29]Charlotte Frenkel.Bottom-Up and Top-Down Neuromorphic Processor Design: Unveiling + Roads to Embedded Cognition. PhD thesis, UCL-Université Catholique de Louvain, 2020. + + [30]Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis.ACM Transactions on + Graphics (TOG), 36(6):1–11, 2017. + + [31]John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck, + Noah Snavely, and Richard Tucker. Deepview: View synthesis with learned gradient descent. + InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages + 2367–2376, 2019. + + [32]Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi + Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis + with prescriptive sampling guidelines.ACM Transactions on Graphics (TOG), 38(4):1–14, + 2019. + + [33]Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael + Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. InProceedings of the IEEE + Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019. + + [34]Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and + Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.ACM + Transactions on Graphics (TOG), 38(4):65, 2019. + + [35]Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: + Continuous 3d-structure-aware neural scene representations. InAdvances in Neural Information + Processing Systems, pages 1119–1130, 2019. + + [36]Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, + and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.arXiv + preprint arXiv:2003.08934, 2020. + + [37]H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, + Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view + from the trenches. InProceedings of the 19th ACM SIGKDD international conference on + Knowledge discovery and data mining, pages 1222–1230, 2013. + + [38]Steffen Rendle. Factorization machines. In2010 IEEE International Conference on Data + Mining, pages 995–1000. IEEE, 2010. + + [39]Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, + Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for + recommender systems. InProceedings of the 1st workshop on deep learning for recommender + systems, pages 7–10, 2016. + + [40]Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a + factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247, + 2017. + + 11 [41]Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click + predictions. InProceedings of the ADKDD’17, ADKDD’17, New York, NY, USA, 2017. + Association for Computing Machinery. ISBN 9781450351942. doi: 10.1145/3124749.3124754. + URLhttps://doi.org/10.1145/3124749.3124754. + [42]Weiyu Cheng, Yanyan Shen, and Linpeng Huang. Adaptive factorization network: Learning + adaptive-order feature interactions. InThirty-Fourth AAAI Conference on Artificial Intelligence, + 2020. + [43]J Wesley Hines. A logarithmic neural network architecture for unbounded non-linear function + approximation. InProceedings of International Conference on Neural Networks (ICNN’96), + volume 2, pages 1245–1250. IEEE, 1996. + [44]Criteo. Kaggle contest dataset is now available for academic use!http://labs.criteo.com/ + 2014/09/kaggle-contest-dataset-now-available-academic-use/, 2014. accessed + on the 2020-05-20. + [45]Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much + progress? a worrying analysis of recent neural recommendation approaches. InProceedings of + the 13th ACM Conference on Recommender Systems, pages 101–109, 2019. + [46]Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. + Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Magazine, 34 + (4):18–42, 2017. + [47]Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally + connected networks on graphs. InInternational Conference on Learning Representations, pages + http–openreview, 2014. + [48]Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured + data.arXiv preprint arXiv:1506.05163, 2015. + [49]Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks + on graphs with fast localized spectral filtering. InAdvances in neural information processing + systems, pages 3844–3852, 2016. + [50]Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional + networks. InInternational Conference on Learning Representations (ICLR), 2017. + [51]Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph + domains. InProceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., + volume 2, pages 729–734. IEEE, 2005. + [52]Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. + The graph neural network model.IEEE Transactions on Neural Networks, 20(1):61–80, 2008. + [53]Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural + networks. InInternational Conference on Learning Representations, 2016. + [54]Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric + deep learning with continuous b-spline kernels. InProceedings of the IEEE Conference on + Computer Vision and Pattern Recognition, pages 869–877, 2018. + [55]Petar Velickoviˇ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua´ + Bengio. Graph attention networks. InInternational Conference on Learning Representations, + 2018. URLhttps://openreview.net/forum?id=rJXMpikCZ. + [56] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly + learning to align and translate. In3rd International Conference on Learning Representations, + ICLR 2015, 2015. + [57]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural + networks? InInternational Conference on Machine Learning, 2018. + + 12 [58]Matthias Fey. Just jump: Dynamic neighborhood aggregation in graph neural networks. In + ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. + + [59]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, + Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information + processing systems, pages 5998–6008, 2017. + + [60]Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. + InICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. + + [61]Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi- + Rad. Collective classification in network data.AI magazine, 29(3):93–93, 2008. + + [62]Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural + networks? InInternational Conference on Learning Representations, 2019. URLhttps: + //openreview.net/forum?id=ryGs6iA5Km. + + [63]Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine + learning research, 9(Nov):2579–2605, 2008. + + [64]David M Chan, Roshan Rao, Forrest Huang, and John F Canny. Gpu accelerated t-distributed + stochastic neighbor embedding.Journal of Parallel and Distributed Computing, 131:1–13, + 2019. + + [65]Thomas N Kipf and Max Welling. Variational graph auto-encoders.NIPS Workshop on Bayesian + Deep Learning, 2016. + + [66]Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language + understanding with unsupervised learning.Technical report, OpenAI, 2018. + + [67]Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, + and Dustin Tran. Image transformer.ArXiv, abs/1802.05751, 2018. + + [68]Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya + Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341, 2020. + + [69]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of + deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- + ence of the North American Chapter of the Association for Computational Linguistics: Human + Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, + Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. + URLhttps://www.aclweb.org/anthology/N19-1423. + + [70]Mohammad Shoeybi, Mostofa Ali Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and + Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model + parallelism.ArXiv, abs/1909.08053, 2019. + + [71]Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, + Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are + few-shot learners.arXiv preprint arXiv:2005.14165, 2020. + + [72]Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ + questions for machine comprehension of text. InProceedings of the 2016 Conference on + Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Novem- + ber 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL + https://www.aclweb.org/anthology/D16-1264. + + [73]Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable + questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for + Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia, + July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL + https://www.aclweb.org/anthology/P18-2124. + + 13 [74]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix + Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose + language understanding systems. InAdvances in Neural Information Processing Systems, pages + 3261–3275, 2019. + [75]The Common Crawl Team. Common Crawl.https://commoncrawl.org, 2020. + [76]Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifica- + tion. InACL. Association for Computational Linguistics, 2018. URLhttp://arxiv.org/ + abs/1801.06146. + [77]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture + models.ArXiv, abs/1609.07843, 2017. + [78]Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare + words with subword units. InProceedings of the 54th Annual Meeting of the Association + for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, + August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL + https://www.aclweb.org/anthology/P16-1162. + [79]Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International + Conference on Learning Representations, 12 2014. + [80]Grace W Lindsay. Attention in psychology, neuroscience, and machine learning.Frontiers in + Computational Neuroscience, 14:29, 2020. + [81]Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. + Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In + Advances in neural information processing systems, pages 4349–4357, 2016. + [82]Alexandra Luccioni and Yoshua Bengio. On the morality of artificial intelligence.arXiv preprint + arXiv:1912.11945, 2019. + [83]Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for + deep learning in nlp.arXiv preprint arXiv:1906.02243, 2019. + [84]Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: + Rethinking self-attention in transformer models.arXiv preprint arXiv:2005.00743, 2020. + [85]Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, + and Jiawei Han. On the variance of the adaptive learning rate and beyond.arXiv preprint + arXiv:1908.03265, 2019. + [86]Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. Fixed encoder self-attention patterns + in transformer-based machine translation.arXiv preprint arXiv:2002.10260, 2020. + [87]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, + Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas + Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, + Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high- + performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché- + Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32, + pages 8024–8035. Curran Associates, Inc., 2019. URLhttp://papers.neurips.cc/paper/ + 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. + pdf. + + + + + + + + + + + 14 Appendix + + + We first provide additional elements to corroborate our findings: alignment measurement (Section + A), and shallow baselines (Section B). We then discuss the process of adapting the considered + architectures for DFA (Section C), and the issue of weight transport in attention layers (Section D). + We provide some supplementary results for NeRF (Section E), including details of performance on + each scene of each datatset, and a discussion on possible mitigation of DFA shortcomings. Finally, + we outline steps necessary for reproduction of this work (Section F). + + A Alignment + + Alignment measurement In feedback alignment methods, the forward weights learn toalignwith + the random backward weights, making the delivered updates useful. This alignment can be quantified + by measuring the cosine similarity between the gradient signal delivered by DFABi ay and the + gradient signal BP would have deliveredWT ai+1 i+1 . For learning to occur and DFA to work as + a training method, there must be alignment. This can be measured numerically [23]. Measuring + alignments allows to check whether or not the layers are effectively being trained by DFA, regardless + of performance metrics. We note that any alignment value superior to 0 signifies that learning is + occuring. Values closer to 1 indicate a better match with BP, but small alignment values are sufficient + to enable learning. We report values measured at the deepest DFA layer. + + Recommender systems We measure alignment on the Criteo dataset, in the two architectures + featuring non-conventional fully-connected layers: Deep & Cross and AFN. Alignment is measured + after 15 epochs of training, and averaged over a random batch of 512 samples. Results are reported in + table A.1. These alignment measurements indicate that learning is indeed occurring in the cross and + logarithmic layers. High-variance of alignment in the cross layers is unique: it may be explained by + the absence of non-linearity, and account for the difference in performance between BP and DFA on + this architecture–which is higher than on the others. + + Table A.1: Alignment cosine similarity (higher is better, standard deviation in parenthesis) of + recommender systems as measured on the Criteo dataset. Learning occurs in both architectures, and + high variance may explain the larger performance gap on Deep & Cross compared to other methods. + + Deep & Cross AFN + Alignment 0.40 (0.91) 0.49 (0.08) + + + Graph convolutions We measure alignment on the Cora dataset, after 250 epochs of training, + averaging values over every sample available–train, validation, and test split included. Results are + reported in Table A.2. We observe high alignment values in all architectures, indicative that learning + is indeed occuring. Slightly lower values in SplineConv and GATConv may be explained by the use + of the Exponential Linear Unit (ELU) instead of the Rectified Linear Unit (ReLU) used as activation + in other architectures. + Table A.2: Alignment cosine similarity (standard deviation in parenthesis) of various graph convolu- + tions architectures as measured on the Cora dataset. These values corroborate that DFA successfully + trains all architectures considered. + + ChebConv GraphConv SplineConv GATConv DNAConv + Alignment 0.87 (0.12) 0.77 (0.25) 0.56 (0.22) 0.63 (0.18) 0.92 (0.30) + + + B Shallow baselines + + Shallow learning We compare DFA to BP, but also to shallow learning–where only the topmost + layer is trained. While DFA may not reach the performance level of BP, it should still vastly + + 15 Figure A.1: Comparisons of Tiny-NeRF trained with BP, DFA, and a shallow approach. Shallow + training is insufficient to learn scene geometry. Lego scene from the NeRF synthetic dataset. + + + outperform shallow learning: failure to do so would mean that the weight updates delivered by DFA + are useless. On a simple task like MNIST, a shallow baseline may be as high as 90%. However, given + the difficulty of the tasks we consider, the shallow baseline is here usually much lower. + + NeRF Because NeRF models are expensive to train–up to 15 hours on a V100–we consider a + simplified setup for the shallow baseline, NeRF-Tiny. This setup operates at half the full resolution + of the training images available, runs for 5000 iterations only, and does away with view-dependant + characteristics. Furthermore, the network is cut down to 3 layers of half the width of NeRF, and + no coarse network is used to inform the sampling. We train this network on the Lego scene of the + NeRF-Synthetic dataset, and compare results. + Figure A.1 presents renders generated by NeRF-Tiny trained with BP, DFA, and a shallow approach. + While BP and DFA delivers similar renders, shallow training fails to reproduce even basic scene + geometry, instead outputting a diffuse cloud of colors. This highlights that while DFA may not reach + a level of performance on-par with BP on NeRF, it nonetheless delivers meaningful updates enabling + the learning of complex features. + + Recommender systems Because recommender systems require fine-tuning, we perform the same + hyperparameter search for shallow learning than for DFA and BP. Results are detailed in Table A.3. + Performance of shallow training is always well under BP and DFA–remember that0.001-levelmatter + in recommender systems. In particular, in Deep & Cross, where there was the biggest gap between + BP and DFA, the performance of the shallow method is extremely poor, well below the FM baseline. + Finally, it is expected to see that DeepFM recovers more or less the performance of FM even with a + shallow baseline. + + Table A.3: Shallow baseline for recommender system models on the Criteo dataset. Performance is + always well below BP and DFA, as expected. + + DeepFM Deep&Cross AFN + AUC 0.7920 0.7324 0.7859 + Loss 0.4682 0.5010 0.4685 + + + Graph convolutions We use the same hyperparameters as for DFA to produce the shallow baseline + on graph datasets. Results are reported in Table A.4. Performance is always much worse than BP + and DFA. GATConv recovers the best performance: random attention layers may still deliver useful + features [84], as do random convolutions. + + Transformers In the baseline setting (optimizer and hyper-parameters of [59]), a Transformer + trained in the shallow regime yields a perplexity of 428 on WikiText-103. We do not consider + + 16 Table A.4: Shallow baseline for GCNNs on Cora, CiteSeer, and PubMed [61]. Performance is always + well below BP and DFA. + + ChebConv GraphConv SplineConv GATConv DNAConv + Cora 23.3 37.0 39.6 59.4 30.2 + CiteSeer 27.4 33.8 30.1 49.8 24.0 + PubMed 37.6 44.8 44.2 67.8 42.2 + + + + other settings, as the cost of training a Transformer is high and we do not expect any meaningful + improvements–as with NeRF above. + + + C Adapting architectures to DFA + + NeRF We use an architecture identical to the one used in [36], but based on the effective code + implementation rather than the description in the paper 1 . During our tests, we have found that + lowering the learning rate to1:10 4 rather than5:10 4 works best with DFA. + + + Recommender systems For all training methods (BP, DFA, and shallow), we have conducted + independent hyperparameter searches. We performed a grid search over the learning rate, from + 1:10 4 to1:10 3 in1:10 4 steps, as well as over the dropout probability, from0:1to0:5in0:1steps + (where applicable). On DeepFM, this search leads to reduce the learning rate from3:10 4 with BP + to5:10 5 with DFA, but to keep the 0.5 dropout rate. On Deep & Cross, we reduce learning rate + from2:10 4 to5:10 5 , with no dropout in both cases. In AFN, we reduce dropout from4:10 4 to + 3:10 4 and dropout from 0.3 to 0. + + + Graph convolutions We manually test for a few hyperparameters configuration on the Cora dataset, + focusing on learning rate, weight decay, and dropout. We do not consider architectural changes, such + as changing the number of filters or of attention heads. For ChebConv and GraphConv, we reduce + weight decay to1:10 4 instead of5:10 4 , and set the dropout rate to0and0:1respectively, instead + of0:5with BP. For SplineConv, we find that no change in the hyperparameters are necessary. For + GATConv, we reduce weight decay to1:10 4 instead of5:10 4 and reduce dedicated dropout layer + to0:1instead of0:6but keep the0:6dropout rate within the GAT layer. Finally, on DNAConv we + disable weight decay entirely, instead of an original value of5:10 4 , double the learning rate from + 5:10 3 to1:10 2 , and disable dropout entirely. In all cases, we share the backward random matrix + across all nodes in a graph. + + + Transformers The model hyper-parameters were fixed across all of our experiments, except for + the number of attention heads in one case, that we will precise below, and dropout. We tested several + values of dropout probability between 0 and 0.5, but found the original value of 0.1 to perform + best. We manually tested a number of optimizers, optimizer parameters and attention mechanisms. + We tested four combinations of optimizers and schedulers : Adam with the scheduler used in [59], + Adam alone, RAdam [85] alone, and Adam with a scheduler that reduces the learning rate when + the validation perplexity plateaus. We found it necessary to reduce the initial learning rate of Adam + from1:10 4 to5:10 5 , although it could be set back to1:10 4 with a scheduler. We tried two values + of2 : 0.98 and 0.999. We also tried to change1 and observed some small differences that were + not significant enough for the main text. Finally, we tried three attention mechanisms in addition to + the standard multihead scaled dot-product attention: the dense and random (learnable) Synthesizers + of [84], as well as the fixed attention patterns of [86]. The latter needed to be adapted to language + modelling to prevent attending to future tokens, which led us to reduced the number of attention + heads to 4. The backward random matrix is always shared across all tokens and batches. + + + 1 https://github.com/bmild/nerf/issues/11 + + 17 D Weight transport and attention + + We consider an attention layer operating on inputx. The queries, keys, and values are respectively + q=xW Q ;k=xW K ;v=xW V , anddk is the dimension of the queries and keys. The layer + performs: qk T + Attention(q;k;v) =softmax p v (4)dk + + When using DFA on attention, we deliver the random feedback to the top of the layer. Accordingly, + to obtain updates toWQ ;WK ;andWV we still to have to backpropagate through the attention + mechanism itself. This involves weight transport onWV , sacrificing some biological realism for + simplicity. Overall weight transport between layers still does not occur, and updating the layers in + parallel remains possible. + Beside using FA or DFA within the attention layer, alternative mechanisms like the synthesizer + [84]–which uses random attention in place of the query and key system–or fixed attention [86] can + remove the need for weight transport. Implementing these mechanisms in DFA-trained Transformers, + or other attention-powered architectures, will require further research. + + + E Supplementary NeRF results + + Quantitative results We report per-scene scores for each dataset in Table A.5. BP values are taken + from [36]. On three scenes of the synthetic datasets, NeRF-DFA even outperforms past state-of-the-art + methods trained with BP. Note that Neural Volumes (NV) is not applicable to forward-facing view + synthesis–as is required in LLFF-Real–and thus no results are reported. + + Qualitative results We report sample renders from the NeRF-Synthetic dataset (Figure A.2) and + the LLFF-Real dataset (Figure A.2), for every scene available. However, we recommend readers to + consult the supplementary video to make better sense of characteristics like multi-view consistency + and view-dependent effects (most visible on the LLFF-Real Room scene). + + + Table A.5: Per-scene PSNR for NeRF DFA and BP against other state-of-the-art methods on the + Nerf-Synthetic and LLFF-Real. DFA performance is fairly homogeneous across each dataset and in + line with the differences in other methods. + + NV SRN LLFF NeRF + BP BP BP BP DFA + NeRF-Synthetic 26.05 22.26 24.88 31.01 25.41 + Chair 28.33 26.96 28.72 33.00 28.74 + Drums 22.58 17.18 21.13 25.01 22.15 + Ficus 24.79 20.73 21.79 30.13 25.61 + Hotdog 30.71 26.81 31.41 36.18 28.03 + Lego 26.08 20.85 24.54 32.54 24.93 + Materials 24.22 18.09 20.72 29.62 25.15 + Mic 27.78 26.85 27.48 32.91 25.43 + Ship 23.93 20.60 23.22 28.65 23.25 + LLFF-Real 22.84 24.13 26.50 20.77 + Room 27.29 28.42 32.70 24.20 + Fern 21.37 22.95 25.17 21.82 + Leaves 18.24 19.52 20.92 16.50 + Fortress 26.63 29.40 31.16 25.16 + Orchids 17.37 18.52 20.36 16.73 + Flower 26.63 25.46 27.40 21.55 + T-Rex 22.87 24.15 26.80 19.43 + Horns 24.33 24.70 27.45 20.75 + + + 18 Possible future directions Despite retranscribing scene geometry in a multi-view consistent way, + NeRF produces renders of a lower quality when trained with DFA instead of BP. In particular, it + struggles to transcribe small-scale details, resulting in "blurry" renders. Moreover, it displays high- + frequency artefacts: not in the scene geometry, but in individual pixels taking values very distant from + their neighborhood. Interestingly, this noise phenomenon is unique to NeRF-DFA: it is not observed + on NeRF-BP with similar PSNR values (achieved during training) or on other methods with similar + or lower PSNR. This leads us to hypothesize this is an aspect unique to DFA, possibly due to the + alignment process. Indeed, DFA creates a bias on the weights, by encouraging them to be "aligned" + with an arbitrary values dependant on the random matrix used. It is possible this could introduce + random noise in the final renders–though we leave a more principled experiment to future research. + To attempt to alleviate this issue, we first consider NeRF-Dual. In NeRF-Dual, we average the + pixel-wise prediction between the fine and coarse network, to attempt to remove some of the noise. + To do so, we first still use the coarse network to create a probability distribution for the hierarchical + sampling. Then, we evaluate again both the coarse and fine networks at the locations informed by + this probability distribution. Compared to vanilla NeRF, this requires an extra batch of evaluation of + the coarse network for all rays–rougly speaking, this increases inference time by 30-50% depending + on the coarse network architecture considered. We note that this is not applied during training, so that + training times remain identical. + Figure A.2 and Figure A.3 showcase comparisons between NeRF and NeRF-Dual trained with DFA + on all scenes. When viewed at high resolution–such as in our supplementary video–the NeRF-Dual + renders are more pleasing, especially for the full scenes. They remove most of the high-frequency + noise, leading to smoother renders. However, this averaging process further blurs small-scale details in + the render. This is especially visible in the NeRF-Synthetic dataset, on scenes like Ficus. Furthermore, + NeRF-Dual introduces novel artefacts in the Mic and Ship scenes, with areas improperly colored + with a violet tint. The cause for these artefacts is unknown, but they show that NeRF-Dual is far from + a silver bullet. The PSNR is also minimally increased, by less than 0.5 per scene. Nevertheless, this + shows some promise in possibilities to allievate the shortcomings of NeRF-DFA. It is possible that + changes to the overall rendering process, or the use of classic image processing techniques, may help + enhance the NeRF-DFA images. + Finally, we also experimented with increasing the capacity of the fine network, by widening its layers + to 512 neurons. We call this architecture NeRF-XL. However, we have not succeeded in getting + PSNR values higher than with vanilla NeRF on DFA. In particular, the training process becomes + much more cumbersome, as multi-GPU parallelism is needed to fit the model. It is possible that + higher network capacity may help learning both the task at hand and to align simultaneously, but + further work is required. + + + F Reproducibility + + Hardware used All main experiments require at most a single NVIDIA V100 GPU with 16GB + of memory to reproduce. Alignment measurement on large architectures (NeRF and Transformers) + require a second identical GPU to keep a copy of the network to evaluate BP gradients. + We estimate that a total of around 10,000 GPU-hours on V100s were necessary for this paper. + Accordingly, we estimate the cloud-computing carbon impact of this paper to be of 1700 kgCO 2 eq 2 . + However, without hyperparameter searches, our results can be reproduced with less than 500 GPU- + hours on V100s, with most of that budget going to NeRF and Transformers. + + Implementation We use the shared random matrix trick from [23] to reduce memory use in DFA + and enable its scaling to large networks. We use PyTorch [87] for all experiments. For reference + implementation of the methods considered, we relied on various sources. Our NeRF implementation + is based on the PyTorch implementation by Krishna Murthy 3 , with modifications to allow for proper + test and validation, as well as DFA and multi-GPU support. For recommender systems, we use + + 2 https://mlco2.github.io/impact#compute + 3 https://github.com/krrish94/nerf-pytorch + + 19 thetorchfmpackage 4 . Finally, we use PyTorch Geometric [60] for all graph operations. Our + Transformer implementation is our own. Our code is available as supplementary material. + + NeRF We provide training, testing, and rendering code along with the configurations used to obtain + our results. An example to reproduce our results is given in the supplementary code repository. Given + the computing cost associated with training a NeRF, we also provide our trained models. + + Recommender systems We provide bash scripts to reproduce the results in Table 2 and A.3, with + the results of our hyperparameter search. We provide code to reproduce the results in Table A.1. + + Graph convolutions We provide the code to reproduce all of our results. Note that the t-SNE + results are not exactly reproducible, as the CUDA implementation used is non-deterministic. + + Transformers We provide bash scripts to reproduce Table 5 and the shallow results. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 4 https://github.com/rixwew/pytorch-fm + + 20 Figure A.2: Sample renders for every scene of the NeRF-Synthetic dataset, for NeRF and NeRF-Dual + trained with DFA. + + + + + + + + + + + 21 Figure A.3: Sample renders for every scene of the LLFF-Real dataset, for NeRF and NeRF-Dual + trained with DFA. + + + + + + + + + + + + 22 \ No newline at end of file diff --git a/Corpus/Efficient Behavior of Small-World Networks.txt b/Corpus/Efficient Behavior of Small-World Networks.txt new file mode 100644 index 0000000..18b01f0 Binary files /dev/null and b/Corpus/Efficient Behavior of Small-World Networks.txt differ diff --git a/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt b/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt new file mode 100644 index 0000000..319bda1 Binary files /dev/null and b/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt differ diff --git a/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt b/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt new file mode 100644 index 0000000..64f926a Binary files /dev/null and b/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt differ diff --git a/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt b/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt new file mode 100644 index 0000000..2c16ab6 --- /dev/null +++ b/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt @@ -0,0 +1,261 @@ + Energy and Policy Considerations for Deep Learning in NLP + + + Emma Strubell Ananya Ganesh Andrew McCallum + College of Information and Computer Sciences + University of Massachusetts Amherst + {strubell, aganesh, mccallum}@cs.umass.edu + + + + + + Abstract Consumption CO 2 e (lbs) + Air travel, 1 passenger, NY↔SF 1984 Recent progress in hardware and methodol- + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + arXiv:1906.02243v1 [cs.CL] 5 Jun 2019 Human life, avg, 1 year 11,023 ogy for training neural networks has ushered + in a new generation of large networks trained American life, avg, 1 year 36,156 + on abundant data. These models have ob- Car, avg incl. fuel, 1 lifetime 126,000 + tained notable gains in accuracy across many + NLP tasks. However, these accuracy improve- Training one model (GPU) + ments depend on the availability of exception- NLP pipeline (parsing, SRL) 39 ally large computational resources that neces- w/ tuning & experimentation 78,468 sitate similarly substantial energy consump- Transformer (big) 192 tion. As a result these models are costly to + train and develop, both financially, due to the w/ neural architecture search 626,155 + cost of hardware and electricity or cloud com- Table 1: Estimated COpute time, and environmentally,due to the car- 2 emissions from training com- + mon NLP models, compared to familiar consumption. 1 bon footprint required to fuel modern tensor + processing hardware. In this paper we bring + this issue to the attention of NLP researchers NLP models could be trained and developed on by quantifying the approximate financial and a commodity laptop or server, many now require environmental costs of training a variety of re- + cently successful neural network models for multiple instances of specialized hardware such as + NLP. Based on these findings, we propose ac- GPUs or TPUs, therefore limiting access to these + tionable recommendations to reduce costs and highly accurate models on the basis of finances. + improve equity in NLP research and practice. Even when these expensive computational re- + 1 Introduction sources are available, model training also incurs a + substantial cost to the environment due to the en- + Advances in techniques and hardware for train- ergy required to power this hardware for weeks or + ing deep neural networks have recently en- months at a time. Though some of this energy may + abled impressive accuracy improvements across come from renewable or carbon credit-offset re- + many fundamental NLP tasks ( Bahdanau et al., sources, the high energy demands of these models + 2015; Luong et al., 2015; Dozat and Man- are still a concern since (1) energy is not currently + ning, 2017; Vaswani et al., 2017), with the derived from carbon-neural sources in many loca- + most computationally-hungry models obtaining tions, and (2) when renewable energy is available, + the highest scores (Peters et al.,2018;Devlin et al., it is still limited to the equipment we have to pro- + 2019;Radford et al.,2019;So et al.,2019). As duce and store it, and energy spent training a neu- + a result, training a state-of-the-art model now re- ral network might better be allocated to heating a + quires substantial computational resources which family’s home. It is estimated that we must cut + demand considerable energy, along with the as- carbon emissions by half over the next decade to + sociated financial and environmental costs. Re- deter escalating rates of natural disaster, and based + search and development of new models multiplies on the estimated CO 2 emissions listed in Table 1, + these costs by thousands of times by requiring re- + training to experiment with model architectures 1 Sources: (1) Air travel and per-capita consump- + tion: https://bit.ly/2Hw0xWc; (2) car lifetime: and hyperparameters. Whereas a decade ago most https://bit.ly/2Qbr0w1. model training and development likely make up Consumer Renew. Gas Coal Nuc. + a substantial portion of the greenhouse gas emis- China 22% 3% 65% 4% + sions attributed to many NLP researchers. Germany 40% 7% 38% 13% + To heighten the awareness of the NLP commu- United States 17% 35% 27% 19% + nity to this issue and promote mindful practice and Amazon-AWS 17% 24% 30% 26% + policy, we characterize the dollar cost and carbon Google 56% 14% 15% 10% + emissions that result from training the neural net- Microsoft 32% 23% 31% 10% + works at the core of many state-of-the-art NLP + models. We do this by estimating the kilowatts Table 2: Percent energy sourced from: Renewable (e.g. + of energy required to train a variety of popular hydro, solar, wind), natural gas, coal and nuclear for + off-the-shelf NLP models, which can be converted the top 3 cloud compute providers (Cook et al.,2017), + to approximate carbon emissions and electricity compared to the United States, 4 China 5 and Germany + costs. To estimate the even greater resources re- (Burger,2019). + quired to transfer an existing model to a new task + or develop new models, we perform a case study We estimate the total time expected for mod- + of the full computational resources required for the els to train to completion using training times and + development and tuning of a recent state-of-the-art hardware reported in the original papers. We then + NLP pipeline (Strubell et al.,2018). We conclude calculate the power consumption in kilowatt-hours + with recommendations to the community based on (kWh) as follows. Letpc be the average power + our findings, namely: (1) Time to retrain and sen- draw (in watts) from all CPU sockets during train- + sitivity to hyperparameters should be reported for ing, letpr be the average power draw from all + NLP machine learning models; (2) academic re- DRAM (main memory) sockets, letpg be the aver- + searchers need equitable access to computational age power draw of a GPU during training, and let + resources; and (3) researchers should prioritize de- gbe the number of GPUs used to train. We esti- + veloping efficient models and hardware. mate total power consumption as combined GPU, + CPU and DRAM consumption, then multiply this + 2 Methods by Power Usage Effectiveness (PUE), which ac- + counts for the additional energy required to sup-To quantify the computational and environmen- port the compute infrastructure (mainly cooling).tal cost of training deep neural network mod- We use a PUE coefficient of 1.58, the 2018 globalels for NLP, we perform an analysis of the en- average for data centers (Ascierto,2018). Then theergy required to train a variety of popular off- total powerpthe-shelf NLP models, as well as a case study of t required at a given instance during + training is given by:the complete sum of resources required to develop + LISA (Strubell et al.,2018), a state-of-the-art NLP 1.58t(pp c +pr +gp g ) + model from EMNLP 2018, including all tuning t = (1)1000 + and experimentation. The U.S. Environmental Protection Agency (EPA)We measure energy use as follows. We train the provides average COmodels described in§2.1using the default settings 2 produced (in pounds per + kilowatt-hour) for power consumed in the U.S.provided, and sample GPU and CPU power con- (EPA,2018), which we use to convert power tosumption during training. Each model was trained estimated COfor a maximum of 1 day. We train all models on 2 emissions: + + a single NVIDIA Titan X GPU, with the excep- CO 2 e = 0.954pt (2) + tion of ELMo which was trained on 3 NVIDIA This conversion takes into account the relative pro-GTX 1080 Ti GPUs. While training, we repeat- portions of different energy sources (primarily nat-edly query the NVIDIA System Management In- ural gas, coal, nuclear and renewable) consumedterface 2 to sample the GPU power consumption to produce energy in the United States. Table2and report the average over all samples. To sample lists the relative energy sources for China, Ger-CPU power consumption, we use Intel’s Running many and the United States compared to the topAverage Power Limit interface. 3 + 5 U.S. Dept. of Energy:https://bit.ly/2JTbGnI + 2 nvidia-smi:https://bit.ly/30sGEbi 5 China Electricity Council; trans. China Energy Portal: + 3 RAPL power meter:https://bit.ly/2LObQhV https://bit.ly/2QHE5O3 three cloud service providers. The U.S. break- ence. Devlin et al.(2019) report that the BERT + down of energy is comparable to that of the most base model (110M parameters) was trained on 16 + popular cloud compute service, Amazon Web Ser- TPU chips for 4 days (96 hours). NVIDIA reports + vices, so we believe this conversion to provide a that they can train a BERT model in 3.3 days (79.2 + reasonable estimate of CO 2 emissions per kilowatt hours) using 4 DGX-2H servers, totaling 64 Tesla + hour of compute energy used. V100 GPUs (Forster et al.,2019). + GPT-2. This model is the latest edition of + 2.1 Models OpenAI’s GPT general-purpose token encoder, + We analyze four models, the computational re- also based on Transformer-style self-attention and + quirements of which we describe below. All mod- trained with a language modeling objective (Rad- + els have code freely available online, which we ford et al.,2019). By training a very large model + used out-of-the-box. For more details on the mod- on massive data,Radford et al.(2019) show high + els themselves, please refer to the original papers. zero-shot performance on question answering and + language modeling benchmarks. The large modelTransformer. The Transformer model (Vaswani described inRadford et al.(2019) has 1542M pa-et al.,2017) is an encoder-decoder architecture rameters and is reported to require 1 week (168primarily recognized for efficient and accurate ma- hours) of training on 32 TPUv3 chips. 6 chine translation. The encoder and decoder each + consist of 6 stacked layers of multi-head self- + attention. Vaswani et al.(2017) report that the 3 Related work + Transformerbasemodel (65M parameters) was + trained on 8 NVIDIA P100 GPUs for 12 hours, There is some precedent for work characterizing + and the Transformerbigmodel (213M parame- the computational requirements of training and in- + ters) was trained for 3.5 days (84 hours; 300k ference in modern neural network architectures in + steps). This model is also the basis for recent the computer vision community.Li et al.(2016) + work on neural architecture search (NAS) for ma- present a detailed study of the energy use required + chine translation and language modeling (So et al., for training and inference in popular convolutional + 2019), and the NLP pipeline that we study in more models for image classification in computer vi- + detail in§4.2(Strubell et al.,2018). So et al. sion, including fine-grained analysis comparing + (2019) report that their full architecture search ran different neural network layer types. Canziani + for a total of 979M training steps, and that their et al.(2016) assess image classification model ac- + base model requires 10 hours to train for 300k curacy as a function of model size and gigaflops + steps on one TPUv2 core. This equates to 32,623 required during inference. They also measure av- + hours of TPU or 274,120 hours on 8 P100 GPUs. erage power draw required during inference on + GPUs as a function of batch size. Neither work an-ELMo. The ELMo model (Peters et al.,2018) alyzes the recurrent and self-attention models thatis based on stacked LSTMs and provides rich have become commonplace in NLP, nor do theyword representations in context by pre-training on extrapolate power to estimates of carbon and dol-a large amount of data using a language model- lar cost of training.ing objective. Replacing context-independent pre- + trained word embeddings with ELMo has been Analysis of hyperparameter tuning has been + shown to increase performance on downstream performed in the context of improved algorithms + tasks such as named entity recognition, semantic for hyperparameter search (Bergstra et al.,2011; + role labeling, and coreference.Peters et al.(2018) Bergstra and Bengio,2012;Snoek et al.,2012). To + report that ELMo was trained on 3 NVIDIA GTX our knowledge there exists to date no analysis of + 1080 GPUs for 2 weeks (336 hours). the computation required for R&D and hyperpa- + rameter tuning of neural network models in NLP.BERT.The BERT model (Devlin et al.,2019) pro- + vides a Transformer-based architecture for build- + ing contextual representations similar to ELMo, 6 Via the authorson Reddit. + 7 GPU lower bound computed using pre-emptible but trained with a different language modeling ob- P100/V100 U.S. resources priced at $0.43–$0.74/hr, upper + jective. BERT substantially improves accuracy on bound uses on-demand U.S. resources priced at $1.46– + tasks requiring sentence-level representations such $2.48/hr. We similarly use pre-emptible ($1.46/hr–$2.40/hr) + and on-demand ($4.50/hr–$8/hr) pricing as lower and upper as question answering and natural language infer- bounds for TPU v2/3; cheaper bulk contracts are available. Model Hardware Power (W) Hours kWh·PUE CO 2 e Cloud compute cost + Transformer base P100x8 1415.78 12 27 26 $41–$140 + Transformer big P100x8 1515.43 84 201 192 $289–$981 + ELMo P100x3 517.66 336 275 262 $433–$1472 + BERT base V100x64 12,041.51 79 1507 1438 $3751–$12,571 + BERT base TPUv2x16 — 96 — — $2074–$6912 + NAS P100x8 1515.43 274,120 656,347 626,155 $942,973–$3,201,722 + NAS TPUv2x1 — 32,623 — — $44,055–$146,848 + GPT-2 TPUv3x32 — 168 — — $12,902–$43,008 + + Table 3: Estimated cost of training a model in terms of CO 2 emissions (lbs) and cloud compute cost (USD). 7 Power + and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware. + + + 4 Experimental results Estimated cost (USD) + Models Hours Cloud compute Electricity4.1 Cost of training 1 120 $52–$175 $5Table3lists CO 2 emissions and estimated cost of 24 2880 $1238–$4205 $118training the models described in§2.1. Of note is 4789 239,942 $103k–$350k $9870that TPUs are more cost-efficient than GPUs on + workloads that make sense for that hardware (e.g. Table 4: Estimated cost in terms of cloud compute and + BERT). We also see that models emit substan- electricity for training: (1) a single model (2) a single + tial carbon emissions; training BERT on GPU is tune and (3) all models trained during R&D. + roughly equivalent to a trans-American flight.So + et al.(2019) report that NAS achieves a new state- about 60 GPUs running constantly throughout theof-the-art BLEU score of 29.7 for English to Ger- 6 month duration of the project. Table4lists upperman machine translation, an increase of just 0.1 and lower bounds of the estimated cost in termsBLEU at the cost of at least $150k in on-demand of Google Cloud compute and raw electricity re-compute time and non-trivial carbon emissions. quired to develop and deploy this model. 9 We see + that while training a single model is relatively in-4.2 Cost of development: Case study expensive, the cost of tuning a model for a newTo quantify the computational requirements of dataset, which we estimate here to require 24 jobs,R&D for a new model we study the logs of or performing the full R&D required to developall training required to develop Linguistically- this model, quickly becomes extremely expensive.Informed Self-Attention (Strubell et al.,2018), a + multi-task model that performs part-of-speech tag- 5 Conclusions + ging, labeled dependency parsing, predicate detec- + tion and semantic role labeling. This model makes Authors should report training time and + for an interesting case study as a representative sensitivity to hyperparameters. + NLP pipeline and as a Best Long Paper at EMNLP. Our experiments suggest that it would be benefi- + Model training associated with the project cial to directly compare different models to per- + spanned a period of 172 days (approx. 6 months). form a cost-benefit (accuracy) analysis. To ad- + During that time 123 small hyperparameter grid dress this, when proposing a model that is meant + searches were performed, resulting in 4789 jobs to be re-trained for downstream use, such as re- + in total. Jobs varied in length ranging from a min- training on a new domain or fine-tuning on a new + imum of 3 minutes, indicating a crash, to a maxi- task, authors should report training time and com- + mum of 9 days, with an average job length of 52 putational resources required, as well as model + hours. All training was done on a combination of sensitivity to hyperparameters. This will enable + NVIDIA Titan X (72%) and M40 (28%) GPUs. 8 direct comparison across models, allowing subse- + The sum GPU time required for the project quent consumers of these models to accurately as- + totaled 9998 days (27 years). This averages to sess whether the required computational resources + 8 We approximate cloud compute cost using P100 pricing. 9 Based on average U.S cost of electricity of $0.12/kWh. are compatible with their setting. More explicit half the estimated cost to use on-demand cloud + characterization of tuning time could also reveal GPUs. Unlike money spent on cloud compute, + inconsistencies in time spent tuning baseline mod- however, that invested in centralized resources + els compared to proposed contributions. Realiz- would continue to pay off as resources are shared + ing this will require: (1) a standard, hardware- across many projects. A government-funded aca- + independent measurement of training time, such demic compute cloud would provide equitable ac- + as gigaflops required to convergence, and (2) a cess to all researchers. + standard measurement of model sensitivity to data + and hyperparameters, such as variance with re- Researchers should prioritize computationally + spect to hyperparameters searched. efficient hardware and algorithms. + We recommend a concerted effort by industry and + Academic researchers need equitable access to academia to promote research of more computa- + computation resources. tionally efficient algorithms, as well as hardware + that requires less energy. An effort can also beRecent advances in available compute come at a made in terms of software. There is already ahigh price not attainable to all who desire access. precedent for NLP software packages prioritizingMost of the models studied in this paper were de- efficient models. An additional avenue throughveloped outside academia; recent improvements in which NLP and machine learning software de-state-of-the-art accuracy are possible thanks to in- velopers could aid in reducing the energy asso-dustry access to large-scale compute. ciated with model tuning is by providing easy-Limiting this style of research to industry labs to-use APIs implementing more efficient alterna-hurts the NLP research community in many ways. tives to brute-force grid search for hyperparameterFirst, it stifles creativity. Researchers with good tuning, e.g. random or Bayesian hyperparameterideas but without access to large-scale compute search techniques (Bergstra et al.,2011;Bergstrawill simply not be able to execute their ideas, and Bengio,2012;Snoek et al.,2012). Whileinstead constrained to focus on different prob- software packages implementing these techniqueslems. Second, it prohibits certain types of re- do exist, 10 they are rarely employed in practicesearch on the basis of access to financial resources. for tuning NLP models. This is likely becauseThis even more deeply promotes the already prob- their interoperability with popular deep learninglematic “rich get richer” cycle of research fund- frameworks such as PyTorch and TensorFlow ising, where groups that are already successful and not optimized, i.e. there are not simple exam-thus well-funded tend to receive more funding ples of how to tune TensorFlow Estimators usingdue to their existing accomplishments. Third, the Bayesian search. Integrating these tools into theprohibitive start-up cost of building in-house re- workflows with which NLP researchers and practi-sources forces resource-poor groups to rely on tioners are already familiar could have notable im-cloud compute services such as AWS, Google pact on the cost of developing and tuning in NLP.Cloud and Microsoft Azure. + While these services provide valuable, flexi- Acknowledgements + ble, and often relatively environmentally friendly We are grateful to Sherief Farouk and the anony- compute resources, it is more cost effective for mous reviewers for helpful feedback on earlieracademic researchers, who often work for non- drafts. This work was supported in part by theprofit educational institutions and whose research Centers for Data Science and Intelligent Infor-is funded by government entities, to pool resources mation Retrieval, the Chan Zuckerberg Initiativeto build shared compute centers at the level of under the Scientific Knowledge Base Construc-funding agencies, such as the U.S. National Sci- tion project, the IBM Cognitive Horizons Networkence Foundation. For example, an off-the-shelf agreement no. W1668553, and National ScienceGPU server containing 8 NVIDIA 1080 Ti GPUs Foundation grant no. IIS-1514053. Any opinions,and supporting hardware can be purchased for findings and conclusions or recommendations ex-approximately $20,000 USD. At that cost, the pressed in this material are those of the authors andhardware required to develop the model in our do not necessarily reflect those of the sponsor.case study (approximately 58 GPUs for 172 days) + would cost $145,000 USD plus electricity, about 10 For example, theHyperopt Python library. References Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt + Gardner, Christopher Clark, Kenton Lee, and LukeRhonda Ascierto. 2018.Uptime Institute Global Data Zettlemoyer. 2018. Deep contextualized word rep-Center Survey. Technical report, Uptime Institute. resentations. InNAACL. + Dzmitry Bahdanau, KyunghyunCho, and Yoshua Ben- + gio. 2015. Neural Machine Translation by Jointly Alec Radford, Jeffrey Wu, Rewon Child, David Luan, + Learning to Align and Translate. In3rd Inter- Dario Amodei, and Ilya Sutskever. 2019.Language + national Conference for Learning Representations models are unsupervised multitask learners. + (ICLR), San Diego, California, USA. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. + James Bergstra and Yoshua Bengio. 2012. Random 2012. Practical bayesian optimization of machine + search for hyper-parameter optimization.Journal of learning algorithms. InAdvances in neural informa- + Machine Learning Research, 13(Feb):281–305. tion processing systems, pages 2951–2959. + + James S Bergstra, R´emi Bardenet, Yoshua Bengio, and David R. So, Chen Liang, and Quoc V. Le. 2019. + Bal´azs K´egl. 2011. Algorithms for hyper-parameter The evolved transformer. InProceedings of the + optimization. InAdvances in neural information 36th InternationalConference on Machine Learning + processing systems, pages 2546–2554. (ICML). + + Bruno Burger. 2019.Net Public Electricity Generation Emma Strubell, Patrick Verga, Daniel Andor, + in Germany in 2018. Technical report, Fraunhofer David Weiss, and Andrew McCallum. 2018. + Institute for Solar Energy Systems ISE. Linguistically-Informed Self-Attention for Se- + mantic Role Labeling. InConference on Empir-Alfredo Canziani, Adam Paszke, and Eugenio Culur- ical Methods in Natural Language Processingciello. 2016. An analysis of deep neural network (EMNLP), Brussels, Belgium. models for practical applications . + Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobGary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John Uszkoreit, Llion Jones, Aidan N Gomez, LukaszDeans, Brian Johnson, Elizabeth Jardim, and Brian Kaiser, and Illia Polosukhin. 2017. Attention is allJohnson. 2017. Clicking Clean: Who is winning you need. In31st Conference on Neural Informationthe race to build a green internet?Technical report, Processing Systems (NIPS).Greenpeace. + Jacob Devlin, Ming-Wei Chang, Kenton Lee, and + Kristina Toutanova. 2019. BERT: Pre-training of + Deep Bidirectional Transformers for Language Un- + derstanding. InNAACL. + Timothy Dozat and Christopher D. Manning. 2017. + Deep biaffine attention for neural dependency pars- + ing. InICLR. + EPA. 2018. Emissions & Generation Resource Inte- + grated Database (eGRID). Technical report, U.S. + Environmental Protection Agency. + Christopher Forster, Thor Johnsen, Swetha Man- + dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie + Bernauer, Allison Gray, Sharan Chetlur, and Raul + Puri. 2019. BERT Meets GPUs. Technical report, + NVIDIA AI. + Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong. + 2016. Evaluating the energy efficiency of deep con- + volutional neural networks on cpus and gpus.2016 + IEEE International Conferences on Big Data and + Cloud Computing (BDCloud), Social Computing + and Networking (SocialCom), Sustainable Comput- + ing and Communications (SustainCom) (BDCloud- + SocialCom-SustainCom), pages 477–484. + Thang Luong, Hieu Pham, and Christopher D. Man- + ning. 2015.Effective approaches to attention-based + neural machine translation. InProceedings of the + 2015 Conference on Empirical Methods in Natural + Language Processing, pages 1412–1421. Associa- + tion for Computational Linguistics. \ No newline at end of file diff --git a/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt b/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt new file mode 100644 index 0000000..e2f2323 --- /dev/null +++ b/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt @@ -0,0 +1,793 @@ + IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 1381 + Finite-Element Neural Networks for Solving + Differential Equations + Pradeep Ramuhalli, Member, IEEE, Lalita Udpa, Senior Member, IEEE, and Satish S. Udpa, Fellow, IEEE + + Abstract—The solution of partial differential equations (PDE) + arises in a wide variety of engineering problems. Solutions to most + practical problems use numerical analysis techniques such as fi- + nite-element or finite-difference methods. The drawbacks of these + approaches include computational costs associated with the mod- + eling of complex geometries. This paper proposes a finite-element + neural network (FENN) obtained by embedding a finite-element + model in a neural network architecture that enables fast and ac- + curate solution of the forward problem. Results of applying the + FENN to severalsimpleelectromagnetic forward and inverseprob- + lems are presented. Initial results indicate that the FENN perfor- + mance as a forward model is comparable to that of the conven- + tional finite-element method (FEM). The FENN can also be used + in an iterative approach to solve inverse problems associated with Fig. 1. Iterative inversion method for solving inverse problems. the PDE. Results showing the ability of the FENN to solve the in- + verse problem given the measured signal are also presented. The + parallel nature of the FENN also makes it an attractive solution resulting in the corresponding solution to the forward problem + for parallel implementation in hardware and software. . The model output is compared to the measurement , + Index Terms—Finite-element method (FEM), finite-element using a cost function .If is less than a toler- + neural network (FENN), inverse problems. ance, the estimateis used as the desired solution. If not, + is updated to minimize the cost function. + S I. I Although finite-element methods (FEMs) [3], [4] are ex- NTRODUCTION tremely popular for solving differential equations, their majorOLUTIONS of differential equations arise in a widedrawback is computational complexity. This problem becomesvariety of engineering applications in electromagnetics,more acute when three-dimensional (3-D) finite-elementsignal processing, computational fluid dynamics, etc. Thesemodels are used in an iterative algorithm for solving the inverseequations are typically solved using either analytical or numer-problem. Recently, several authors have suggested the use ofical methods. Analytical solution methods are however feasibleneural networks (MLP or RBF networks [5]) for solving differ-only for simple geometries, which limits their applicability. Inential equations [6]–[9]. In these techniques, a neural networkmost practical problems with complex boundary conditions,is trained using a large database containing the input data andnumerical analysis methods are required in order to obtain athe solution of the differential equation. The neural networkreasonable solution. An example is the solution of Maxwell’sduring generalization learns the mapping corresponding toequations in electromagnetics. Solutions to Maxwell’s equa-the PDE. Alternatively, in [10], the solution to a differentialtions are used in a variety of applications for calculating theequation is written as a constant term, and an adjustable term interaction of electromagnetic (EM) fields with different typeswith parameters that need to be determined. A neural networkof media. is used to determine the optimal values of the parameters.Very often, the solution to differential equations is necessaryThis approach is applicable only to problems with regularfor solving the corresponding inverse problems. Inverse prob-boundaries. An extension of the approach to problems withlems in general are ill-posed, lacking continuous dependence ofirregular boundaries is given in [11]. Other neural networkthe measurements on the input. This has resulted in the devel-based differential equation solvers use multilayer perceptronopment of a variety of solution techniques ranging from simplenetworks or variations on the MLP to approximate the unknowncalibration procedures to other direct (analytical) and iterativefunction in a PDE [12]–[14]. A combination of the PDE andapproaches [1]. Iterative methods typically employ a forwardboundary conditions is used to construct an objective functionmodel that simulates the underlying physical process (Fig. 1)that is minimized during the training process.[2]. An initial estimate of the solution of the inverse problem A major limitation of these approaches is that the network ar- (represented byin Fig. 1) is applied to the forward model,chitecture is selected somewhat arbitrarily. A second drawback + is that the performance of the neural networks depends on the + Manuscript received January 17, 2004; revised April 2, 2005. data used in training and testing. As long the test data is sim- + The authors are with the Department of Electrical and Computer Engi- ilar to the training data, the network can interpolate between the neering, Michigan State University, East Lansing, MI 48824 USA (e-mail: training data points to obtain a reasonable prediction. However, rpradeep@egr.msu.edu; udpal@egr.msu.edu; udpa@egr.msu.edu). + Digital Object Identifier 10.1109/TNN.2005.857945 when the test signal is no longer similar to the training data, the + 1045-9227/$20.00 © 2005 IEEE 1382 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 + + + network is forced to extrapolate and the performance degrades. Section V draws conclusions from the results and presents + One way around this difficulty is to ensure that the training data- ideas for future work. + base has a diverse set of signals. However, this is difficult to + ensure in practice. Alternatively, we have to design neural net- II. T HE FENN + works that are capable of extrapolation. Extrapolation methods This section briefly describes the FEM and proposes its refor-are discussed extensively in literature [15]–[18], but the design mulation into a parallel neural network structure. Details aboutof an extrapolation neural network involves several issues par- the FEM can be found in [3] and [4].ticularly for ensuring that the error in the network prediction + stays within reasonable bounds during the extrapolation proce- A. The FEMdure. Consider a typical boundary value problem with the gov-An ideal solution to this problem would be to combine the erning differential equationpower of numerical models with the computational speed of + neural networks, i.e., to embed a numerical model in a neural (1)network structure. One suchfinite-element neural network + (FENN) formulation has been reported by Takeuchi and Kosugi where is a differential operator, is the applied source or + [19]. This approach, based on error minimization, derives the forcing function, and is the unknown quantity. This differen- + neural network using the energy functional resulting from the tial equation can be solved in conjunction with boundary condi- + finite-element formulation. Other reports of FENN combina- tionson theboundary enclosingthedomain .Thevariational + tions are either similar to the Takeuchi method [20], [21] or use formulation used infinite-element analysis determines the un- + Hopfield neural networks to solve the forward problem [22], known by minimizing the functional [3], [4] + [23]. Kalkkuhlet al.[24] provide a description of a FEM-based + approach to NARX modeling that may be interpreted both as (2) + a local model network, as well as a single layer feedforward + network. A slightly different approach to merging numerical with respect to the trial function . The minimization procedure + methods and neural networks is given in [25], where thefi- starts by dividing into small subdomains called elements + nite-difference time domain (FDTD) method is cast in a neural (Fig. 2) and representing in each element by means of basis + network framework for the purpose of solving electromagnetic functions defined over the element + forward problems. The related problem of mesh generation + infinite-element models has also been tackled using neural (3)networks (for instance, [26]). Generally, these networks are + designed to solve the forward problem, and must be modified + to solve inverse problems. where is the unknown solution in element , is the basis + This paper proposes a new approach that embeds afinite-ele- function associated with node in element , is the value + ment model commonly used in the solution of differential equa- of the unknown quantity at node and is the total number of + tions in a neural network. The network, called the FENN, can nodes associated with element . In general, the basis functions + solve the forward problem and can also be used in an itera- (also referred to as interpolation functions or shape functions) + tive algorithm to solve inverse problems. The primary advan- can be linear, quadratic, or of higher order. Typically,finite-el- + tage of this approach is that the FEM is represented in a parallel ement models use either linear or polynomial spline basis func- + form. Thus, it has the potential to alleviate the computational tions. + cost associated with using the FEM in an iterative algorithm The functional within an element is expressed as + for solving inverse problems. More importantly, the FENN does + not need any training, and the computation of the weights is (4) + a one-time process. The proposed approach is also different in + that the neural network architecture developed can be used to + solve the forward and inverse problems. The structure of the By substituting (3) in (4), we obtain the discrete version of the + neural network is also simpler than those reported in the litera- functional within each element + ture, making it easier to implement in parallel in both hardware (5)and software. + The rest of this paper is organized as follows. Section II where is the transpose of a matrix, is the ele-briefly describes the FEM, and derives the proposed FENN. In mental matrix with elements this paper, we focus on the problem of solving typical equa- + tions encountered in electromagnetic nondestructive evaluation (6)(NDE). However, the same concepts can be easily applied + to solve differential equations encountered in otherfields. + Sections III, IV and V present the application of the FENN and is an vector with elements + to solving forward and inverse problems, along with initial + results. A discussion of the advantages and disadvantages of (7) + the proposed FENN architecture is given in Section IV. Finally, RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1383 + + + Combining the values in (5) for each of the elements + + (8) + + where is the global matrix derived from the terms + of the elemental matrices for different elements, and is the + total number of nodes. , also called the stiffness matrix, is a + sparse, banded matrix. Equation (8) is the discrete version of + the functional and can be minimized with respect to the nodal + parameters by taking the derivative of with respect to and + setting it equal to zero, which results in the matrix equation Fig.2. (a)Schematicrepresentationofdomainandboundary. (b)SampleFEM + mesh for the domain. + (9) + + Boundary conditions for these problems are usually of two + types: natural boundary conditions and essential boundary + conditions. Essential boundary conditions (also referred to as + Dirichlet boundary conditions) impose constraints on the value + of the unknown at several nodes. Natural boundary condi- + tions (of which Neumann boundary conditions are a special + case) impose constraints on the change in across a boundary. + Dirichlet boundary conditions are imposed on the functional + minimization (9), by deleting the rows and columns of the + matrix corresponding to the nodes on the Dirichlet boundary + and modifying in (9). Fig. 3. FEM domain discretization using two elements and four nodes. + Natural boundary conditions are applied in the FEM by + adding an additional term to the functional. These boundary This process ensures that natural boundary conditions are im-conditions are then incorporated into the functional and are plicitlyandautomatically satisfiedduring theFEMsolutionpro-satisfied automatically during the solution procedure. As an cedure.example, consider the natural boundary condition represented + by the following equation [3] B. The FENN + on (10) This section describes how thefinite-element model can be + converted intoa parallel network form. Wefocus on solving typ- + where represents the Neumann boundary, is its outward ical inverse problems arising in electromagnetic NDE, but the + normal unit vector, is some constant, and , , and are basicideaisapplicabletootherareas aswell.NDEinverseprob- + known parameters associated with the boundary. Assuming that lems can be formulated as the problem offinding the material + the boundary is made up of segments, we can define properties (such as the conductivity or the permeability) within + boundary matrices and with elements the domain of the problem. Since the domain is discretized in + the FEM method by a large number of elements, the problem + can be posed as one offinding the material properties in each + of these elements. These properties are usually embedded in the + differential operator , or equivalently, in the global matrix . + Thus, in order to be able to iteratively estimate these properties + from the measurements, the material properties need to be sep- + arated out from . This separation is easier to achieve at the + element matrix level. For nodes and in element + (11) + + where are basis functions defined over segment and is + the length of the segment. The elements of are added to the + elementsof that correspond tothe nodeson the boundary . + Similarly, the elements of are added to the corresponding + elements of . The global matrix (9) is thus modified as follows + before solving for (13) + + where is the parameter representing the material property(12) in element and represents the differential operator at the 1384 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig. 4. FENN. + + + element level without embedded in it. Substituting (13) into neurons, corresponding to the members of the global ma- + the functional, we get trix . The output of each group of hidden layer neurons is the + corresponding row vector of . The weights from the input to + the hidden layer are set to the appropriate values of . Each(14) neuron in the hidden layer acts as a summation unit, (equivalent + toasummationfollowedbyalinearactivationfunction[5]).The + If we define outputs of the hidden layer neurons are the elements of the + global matrix as given in (15). + (15) Each group of hidden neurons is connected to one output + neuron (giving a total of output neurons) by a set of weights + , with each element of representing the nodal values .where Note that the set of weights between thefirst group of hidden + neurons and thefirst output neuron are the same as the set of(16)else weights between the second group of hidden neurons and the + second output neuron (as well as between successive groups + of hidden neurons and the corresponding output neuron). Each + output neuron is also a summation unit followed by a linear ac- + tivation function, and the output of each neuron is equal to : + + + (18) + (17) + + where the second part of (18) is obtained by using (15). As an + Equation (17) expresses the functional explicitly in terms of . example, the FENN architecture for a two-element, four-node + The assumption that is constant within each element is im- FEM mesh (Fig. 3) is shown in Fig. 4. In this + plicit in this expression. This assumption is usually satisfied in case, the FENN has two input neurons, 16 hidden layer neurons + problems in NDE where each element in the FEM mesh is de- and four output neurons. Thefigure illustrates the grouping of + fined within the confines of a domain, and at no time does a the hidden layer neurons, as well as the similarity inherent in + single element cross domain boundaries. Furthermore, each el- the weights that connect each group of hidden layer neurons + ement is small enough that minor variations in within an el- to the corresponding output neuron. To simplify thefigure, the + ement may be ignored. Equation (17) can be easily converted weights between the network input and hidden layer neurons + into a parallel network form. The neural network comprises an are depicted by means of vectors (for + input, output and hidden layer. In the general case with el- , 2, 3, 4 and , 2), where the individual weight values + ements and nodes in the FEM mesh, the input layer with are defined as in (16). + network inputs takes the values in each element as input. 1) Boundary Conditions in the FENN: Note that the ele- + The hidden layer has neurons 1 arranged in groups of ments of and in (11) do not depend on the material prop- + 1 erties . and need to be added appropriately to the global In this paper, we use the term“neurons”in the FENN (in the hidden and + output layers) to avoid confusion with the nodes in afinite-element mesh. matrix and the source vector as shown in (12). Equation RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1385 + + + + + + + + + + + + + + + + + + + + + Fig. 5. Geometry of mesh for 1-D FEM. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig. 6. Flowchart (with example) for designing the FENN for a general PDE. + + + (12) thus implies that natural boundary conditions can be ap- layer neurons. These weights will be referred to as the clamped + plied in the FENN as bias inputs to the hidden layer neurons weights, while the remaining weights will be referred to as the + that are a part of the boundary, and the corresponding output free weights. An example of these weights is presented later. + neurons. Dirichlet boundary conditions are applied by clamping The FENN architecture was derived without consideration of + the corresponding weights between the hidden layer and output the dimensionality of the problem at hand, and thus can be used 1386 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 + + + for 1-, 2-, 3-, or higher dimensional problems. The number of + nodes and elements in the FEM mesh dictates the number of + neurons in the different layers. The weights between the input + and hidden layer change depending on node-element connec- + tivity information. + The major drawback of the FENN is the number of neurons + and weights necessary. However, the memory requirements can + be reduced considerably, since most of the weights between the + input and hidden layer are zero. These weights, and the corre- + sponding connections, can be discarded. Similarly, most of the Fig. 7. Shielded microstrip geometry. (a) Complete problem description. (b) + elements of the matrix are also zero ( is a banded ma- Problem description using symmetry considerations. + trix). The corresponding neurons in the hidden layer can also + be discarded, reducing memory and computation requirements The network implementation of (23) can be derived as fol- + considerably. Furthermore, the weights between each group of lows. If and values at each element are the inputs to the + hidden layer neurons and the output layer are the same . network, , , , and form the weights + Weight-sharing approaches can be used here to further reduce between the input and hidden layers. The network thus uses + the storage requirements. inputneuronsand hiddenneurons.Thevaluesof ateachof + thenodesareassigned asweightsbetweenthehidden andoutput + C. A 1-D Example layers, and the source is the desired output of this network + Consider the 1-D equation (corresponding to the output neurons). Dirichlet boundary + conditions on are applied as explained earlier. + + (19) D. General Case + Fig. 6 shows aflowchart of the general scheme for convertingboundary conditions on the boundary defined by . a differential equation into the FENN structure. An exampleand are constants depending on the material and is the in two dimensions is also provided next to theflowchart. Weapplied source. Laplace’s equation and Poisson’s equation are start with the differential equation and the boundary conditionsspecial cases of this equation. The FENN formulation for this and formulate the FEM using the variational method. This in-problem starts by discretizing the domain of interest with el- volves discretizing the domain of interest with elements andements and nodes. In one dimension, each element is defined nodes, selecting basis functions, writing the functional forby two nodes (Fig. 5). Define basis functions and over each element and obtaining the element matrices and the sourceeach element and let is the value of on node in element vector. The example presented uses the FEM mesh shown in. An example of the basis functions is shown in Fig. 5. Fig. 3, with elements, and nodes, and linearFor these basis functions, i.e., basis functions. The unknown solution to the differential equa- + tion is represented by its values at each of the nodes in the(20) finite-element mesh . The element matrices are then + separated into two parts, with one part dependent on the mate-the element matrices are given by [3] rial properties and while the other is independent of them. + The FENN is then designed to have input neurons, + hidden neurons, and output neurons, where is the number + of material property parameters. In the example under consid- + eration, , since we have two material property parameters(21) ( and ). Thefirst group of input neurons takes in the + values while the second group takes in the values in each ele- + ment. The weights from the input to the hidden layer are set to + the appropriate values of . In the example, since nodes 1, 2, + (22) and 3 are part of element 1 (see Fig. 3), the weights from thefirst + input node to thefirst group of four neurons in the hidden + Here, is the length of element . The global matrix is then layer are given by + constructed by selectively adding the element matrices based + on the nodes that form an element. Specifically, is a sparse + tridiagonal matrix, and its nonzero elements are given by (24) + + The last weight is zero since node 4 is not a part of element 1. + Each group of hidden neurons is connected to one output + neuron (giving a total of output neurons) by a set of weights + , with each element of representing the nodal values . The + (23) output of each neuron in the output layer is equal to . RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1387 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig. 8. Forward problem solutions for shielded microstrip problem show the contours of constant potential for: (a) FEM solution and (b) FENN solution. (c) Error + between (a) and (b). Thex- andy-axes show the nodes in the FEM discretization of the domain, and thez-axis in (c) shows the error at each of these nodes in volts. + + + + III. F ORWARD AND INVERSE PROBLEM FORMULATION USING where is the output of the FENN. Then, for a gradient- + FENN based approach, the gradients of the error with respect to the + free hidden layer weights is given by + + The FENN architecture and algorithm lends itself to solving (27)both the forward and inverse problems. The forward problem + involves determining the weights given the material parame- Equation (27) can be used to solve the forward problem. Sim-ters and and the applied source while the inverse problem ilarly, to solve the inverse problem, the gradients of the errorinvolves determining and given and . Any optimization with respect to and (input of the FENN) are necessary, andapproach can be used to solve both these problems. Suppose we are given bydefine the error at the output of the FENN as + + + + + (28) + + + + + (26) (29) 1388 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 + + + + TABLE I + SUMMARY OF PERFORMANCE OF THE FENN A LGORITHM FOR VARIOUS PDE S + + + + + + + + + + + + + + + + + + + + + + + + + + + For the forward problem, such an approach is equivalent to the Dirichlet boundary, with on the microstrip and on + iterative approaches used to solve for the unknown nodal values the outer boundary [Fig. 7(b)]. Finally, there is no source term + in the FEM [4]. in this example (the source term would correspond to a charge + distribution in the domain of interest), i.e., . In this ex- + IV. R ESULTS ample, we assume that volts and . Further, we + assume that the domain of interest is .A. Forward Model Results The solution to the forward problem is presented in Fig. 8, + The FENN was tested using both 1- and 2-D versions of with the FEM solution using 11 nodes in each direction shown + Poisson’s equation in Fig. 8(a) and the corresponding FENN solution in Fig. 8(b). + + (30) Thesefigures show contours of constant potential. The error be- + tween the FEM and FENN solutions is presented in Fig. 8(c). As + where represents the material property, and is the applied seen from thefigure, the FENN is seen to match the FEM solu- + source. For instance, in electromagnetics may represent the tion accurately, with the peak error at any node on the order of + permittivity while represents the charge density. . + As thefirst example, consider the following 2-D equation Several other examples were also used to test the FENN and + the results are summarized in Table I. Column 1 shows the + (31) PDE used to evaluate the FENN performance, while column 2 + shows the boundary conditions used. The analytic solution to + with boundary conditions the problem is indicated in Column 3. The FENN structure and + + on (32) the number of iterations for convergence using a gradient de- + scent approach are indicated in Columns 4 and 5, respectively. + and The FENN structure, as explained earlier, has inputs, + hidden neurons and output neurons, where and are the + on (33) number of elements and nodes in the FEM mesh, respectively, + and is the number of hidden neurons, and corresponds to the + This is the governing equation for the shielded microstrip trans- number of nonzero elements in the FEM global matrix . Fi- + mission line problem shown in Fig. 7. The forward problem nally, Columns 6 and 7 present the sum-squared error (SSE) and + computes the electric potential due to the shielded microstrip the maximum error in the solution, respectively, where the er- + shown in Fig. 7(a). The potentials are zero on the shielding con- rors are computed with respect to the analytical solution. These + ductor.Sincethegeometryissymmetric,wecansolvetheequiv- results indicate that the FENN is capable of accurately deter- + alent problem shown in Fig. 7(b), by applying the homogeneous mining the potential . One advantage of the FENN approach + Neumann condition on the plane of symmetry. The inner con- is that the computation of the input-hidden layer weights is a + ductor (microstrip) is held at a constant potential of volts. one-time process, as long as the differential equation does not + Finally, we also assume that the material inside the shielding change. The only changes necessary to solve the different prob- + conductor has a permittivity , where K is a constant. The lems are changes in the input and the desired output . + permittivity in this case corresponds to the material property . + Specifically, and . The homogeneous Neu- B. Inverse Model Results + mann boundary condition is equivalent to setting . TheFENNwasalsousedtosolveseveralsimpleinverseprob- + The microstrip and the shielding conductor correspond to the lems based on (30). In all cases, the objective was to determine RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1389 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig. 9. FENN inversion results for Poisson’s equation with initial solutions (a) = x . (b) =1+ x . + + + the value of and for given values of and . Thefirst ex- In order to obtain a unique solution, we need to constrain the + ample is a 1-D problem that involves determining given value of at the boundary as well. Consider the same differen- + and , for the differential equation tial equation as (34), but with and specified as follows: + + (34) and + + with boundary conditions and . The analyt- (36) + ical solution to this inverse problem is The analytical solution for this equation is .To + and (35) solve this problem, we set and clamp the value of at + As seen from (35), the problem has an infinite number of solu- and as follows: , . + tions and we expect the solution procedure to converge to one The results of the constrained inversion obtained using 11 + of these solutions depending on the initial value. nodes and 10 elements in the correspondingfinite-element mesh + Fig. 9(a) and (b) shows two solutions to this inverse problem are shown in Fig. 10. Fig. 10(a) shows the comparison between + for two different initializations (shown using triangles). In both the analytical solution (solid line with squares) and the FENN + cases, the FENN solution (in stars) is seen to match the analyt- result (solid line with stars). The initial value of is shown in + ical solution (squares). The SSE in both cases was on the order thefigure as a dashed line. Fig. 10(b) shows the comparison + of . between the actual and desired forcing function at the FENN 1390 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig. 10. Constrained inversion result with eleven nodes. (a) Comparison of analytic and simulation results for . (b) Comparison of actual and desired NN outputs. + + + output. This result indicates that the SSE in the forcing function, weight structure that allows both the forward and inverse prob- + as well as the SSE in the inversion result, is fairly large (0.0148 lemstobesolvedusingsimplegradient-basedalgorithms.Initial + and 0.0197, respectively). The reason for this was traced back results indicate that the proposed FENN algorithm is capable of + to the mesh discretization. Fig. 11 shows the SSE in the output accurately solving both the forward and inverse problems. In + of the FENN and the SSE in the inverse problem solution as a addition, the forward problem solution from the FENN is seen + function of FEM discretization. It is seen that increasing the dis- to exactly match the FEM solution, indicating that the FENN + cretization significantly improves the solution. Similar results represents thefinite-element model exactly in a parallel config- + were observed for other problems. uration. + The major advantage of the FENN is that it represents the + finite-element model in a parallel form, enabling parallel imple- + V. D ISCUSSION AND CONCLUSION mentation in either hardware or software. Further, computing + gradients in the FENN is very simple. This is an advantage in + The FENN is closely related to thefinite-element model used solving bothforward and inverse problems using gradient-based + to solve differential equations. The FENN architecture has a methods. The gradients can also be computed in parallel and RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1391 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Fig. 11. SSE in FENN output and inversion results as a function of discretization. + + + the lack of nonlinearities in the neuron activation functions [6] C. A. Jensenet al.,“Inversion of feedforward neural networks: algo- + makes the computation of gradients simpler. A major advantage rithms and applications,”Proc. IEEE, vol. 87, no. 9, pp. 1536–1549, + of this approach for solving inverse problems is that it avoids 1999. + [7] P. Ramuhalli, L. Udpa, and S. Udpa,“Neural networkalgorithm for elec- + inverting the global matrix in each iteration. The FENN also tromagnetic NDE signal inversion,”inENDE 2000, Budapest, Hungary, + does not require any training, since most of its weights can be Jun. 2000. + computed in advance and stored. The weights depend on the [8] C. H. Barbosa, A. C. Bruno, M. Vellasco, M. Pacheco, J. P. Wikswo Jr., + and A. P. Ewing,“Automation of SQUID nondestructive evaluation of + governing differential equation and its associated boundary steel plates by neural networks,”IEEE Trans. Appl. Supercond., vol. 9, + conditions, and as long as these two factors do not change, no. 2, pp. 3475–3478, 1999. + the weights do not change. This is especially an advantage [9] W.Qing, S. Xueqin,Y.Qingxin,and Y.Weili,“Usingwaveletneural net- + works for the optimal design of electromagnetic devices,”IEEE Trans. + in solving inverse problems in electromagnetic NDE. This Magn., vol. 33, no. 2, pp. 1928–1930, 1997. + approach also reduces the computational effort associated with [10] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis,“Artificial neural networks + the network. for solving ordinary and partial differential equations,”IEEE Trans. + Neural Netw., vol. 9, no. 5, pp. 987–1000, 1998. + Future work will concentrate on applying the FENN to 3-D [11] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou,“Neural-network + electromagnetic NDE problems. The robustness of the approach methods for boundary value problems with irregular boundaries,”IEEE + will also be tested, since the ability of these approaches to in- Trans. Neural Netw., vol. 11, no. 5, pp. 1041–1049, 2000. + [12] B. P. Van Milligen, V. Tribaldos, and J. A. Jimenez,“Neural network + vert practical noisy measurements is important. Furthermore, differential equation and plasma equilibrium solver,”Phys. Rev. Lett., + the use of better optimization algorithms, like conjugate gra- vol. 75, no. 20, pp. 3594–3597, 1995. + dient methods, is expected to improve the solution speed. In ad- [13] M. W. M. G. Dissanayake and N. Phan-Thien,“Neural-network-based + approximations for solving partial differential equations,”Commun. + dition, parallel implementation of the FENN in both hardware Numer. Meth. Eng., vol. 10, pp. 195–201, 1994. + and software is under investigation. The approach described in [14] R. Masuoka,“Neural networks learning differential data,”IEICE Trans. + this paper is very general in that it can be applied to a variety Inform. Syst., vol. E83-D, no. 6, pp. 1291–1300, 2000. + [15] D.C.Youla,“Generalizedimagerestorationbythemethodofalternating + of inverse problems infields other than electromagnetic NDE. orthogonal projections,”IEEE Trans. Circuits Syst., vol. CAS-25, no. 9, + Some of these other applications will also be investigated to pp. 694–702, 1978. + show the general nature of the proposed method. [16] D. C. Youla and H. Webb,“Image restoration by the method of convex + projections: part I—theory,”IEEE Trans. Med. Imag., vol. MI-1, no. 2, + pp. 81–94, 1982. + REFERENCES [17] A. Lent and H. Tuy,“An iterative method for the extrapolation of band- + limitedfunctions,”J.Math.AnalysisandApplicat.,vol.83, pp.554–565, + [1] L. Udpa and S. S. Udpa,“Application of signal processing and pattern 1981. + recognition techniques to inverse problems in NDE,”Int. J. Appl. Elec- [18] W. Chen,“A new extrapolation algorithm for band-limited signals using + tromagn. Mechan., vol. 8, pp. 99–117, 1997. the regularization method,”IEEE Trans. Signal Process., vol. 41, no. 3, + [2] M. Yan, M. Afzal, S. Udpa, S. Mandayam, Y. Sun, L. Udpa, and P. pp. 1048–1060, 1993. + Sacks,“Iterative algorithms for electromagnetic NDE signal inversion,” [19] J. Takeuchi and Y. Kosugi,“Neural network representation of thefinite + inENDE ’97, Reggio Calabria, Italy, Sep. 14–16, 1997. element method,”Neural Netw., vol. 7, no. 2, pp. 389–395, 1994. + [3] J. Jin,The Finite Element Method in Electromagnetics. New York: [20] R. Sikora, J. Sikora, E. Cardelli, and T. Chady,“Artificial neural net- + Wiley, 1993. work application for material evaluation by electromagnetic methods,” + [4] P. Zhou,Numerical Analysis of Electromagnetic Fields. Berlin, Ger- inProc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 4027–4032. + many: Springer-Verlag, 1993. [21] G. Xu, G. Littlefair, R. Penson, and R. Callan,“Application of FE-based + [5] S. Haykin,Neural Networks: A Comprehensive Foundation. Upper neural networks to dynamic problems,”inProc. Int. Conf. Neural Infor- + Saddle River, NJ: Prentice-Hall, 1994. mation Processing, vol. 3, 1999, pp. 1039–1044. 1392 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 + + + + [22] F. Guo, P. Zhang, F. Wang, X. Ma, and G. Qiu,“Finite element anal- Lalita Udpa (S’84–M’86–SM’96) received the + ysis-based Hopfield neural network model for solving nonlinear elec- Ph.D. degree in electrical engineering from Col- + tromagneticfield problems,”inProc. Int. Joint Conf. Neural Networks, orado State University, Fort Collins, in 1986. + vol. 6, 1999, pp. 4399–4403. She is currently a Professor with the Department + [23] H. Lee and I. S. Kang,“Neural algorithm for solving differential equa- of Electrical and Computer Engineering, Michigan + tions,”J. Computat. Phys., vol. 91, pp. 110–131, 1990. State University, East Lansing. She works primarily + [24] J. Kalkkuhl, K. J. Hunt, and H. Fritz,“FEM-based neural-network in the broad areas of nondestructive evaluation, + approach to nonlinear modeling with application to longitudinal vehicle signal processing, and biomedical applications. Her + dynamics control,”IEEE Trans. Neural Netw., vol. 10, no. 4, pp. research interests include various aspects of NDE, + 885–897, 1999. such as development of computational models for + [25] R. K. Mishra and P. S. Hall,“NFDTD concept,”IEEE Trans. Neural the forward problem in NDE, signal and image pro- + Netw., vol. 16, no. 2, pp. 484–490, 2005. cessing, pattern recognition and neural networks, and development of solution + [26] D. G. Triantafyllidis and D. P. Labridis,“Afinite-element mesh gener- techniques for inverse problems. Her current projects includefinite-element + ator based on growing neural networks,”IEEE Trans. Neural Netw., vol. modeling of electromagnetic NDE phenomena, application of neural network + 13, no. 6, pp. 1482–1496, 2002. and signal processing algorithms to NDE data, and development of image + processing techniques for the analysis of NDE and biomedical images. + Dr. Udpa is a Member of Eta Kappa Nu and Sigma Xi. + + + + Satish S. Udpa(S’82–M’82–SM’91–F’03) received + the B.Tech. degree in 1975 and the Post Graduate + Diplomainelectricalengineeringin1977fromJ.N.T. + University, Hyderabad, India. He received the M.S. + degree in 1980 and the Ph.D. degree in electrical en- + gineering in 1983, both from Colorado State Univer- + sity, Fort Collins. + He has been with Michigan State University, East + Lansing, since 2001 and is currently Acting Dean for + the College of Engineering and a Professor with the + Electrical and Computer Engineering Department. + Prior to joining Michigan State, he was a Professor with Iowa State University, + Ames, from 1990 to 2001 and was associated with the Materials Assessment + Research Group. Prior to joining Iowa State, he was an Associate Professor + with the Department of Electrical Engineering at Colorado State University. + His research interests span the broad area of materials characterization and + nondestructive evaluation (NDE). Work done by him to date in the area includes + an extensive repertoire of forward models for simulating physical processes + underlying several inspection techniques. Coupled with careful experimental + Pradeep Ramuhalli (S’92–M’02) received the work, such forward models can be used for designing new sensors, optimizing + B.Tech. degree from J.N.T. University, Hyderabad, test conditions, estimating the probability of detection, assessing designs for + India, in electronics and communications engi- inspectability and training inverse models for characterizing defects. He has + neering in 1995, and the M.S. and Ph.D. degrees in also been involved in the development of system-, as well as model-based, + electrical engineering from Iowa State University, inverse solutions for defect and material property characterization. His interests + Ames, in 1998 and 2002, respectively. have expanded in recent years to include the development of noninvasive + He is currently an Assistant Professor with the tools for clinical applications. Work done to date in thisfield includes the + Department of Electrical and Computer Engi- development of new electromagnetic-acoustic (EMAT) methods for detecting + neering, Michigan State University, East Lansing. single leg separation failures in artificial heart valves and microwave imaging + His research is in the general area of nondestruc- and ablation therapy systems. He and his research group have been engaged + tive evaluation and materials characterization. His in the design and development of high-performance instrumentation including + research interests include the application of signal and image processing acoustic microscopes and single and multifrequency eddy current NDE instru- + methods, pattern recognition and neural networks for nondestructive evaluation ments. These systems, as well as software packages embodying algorithms + applications, development of model-based solutions for inverse problems in developed by Udpa for defect classification and characterization, have been + NDE, and the development of information fusion algorithms for multimodal licensed to industry. + data fusion. He is a Fellow of the American Society for Nondestructive Testing (ASNT) + Dr. Ramuhalli is a Member of Phi Kappa Phi. and a Fellow of the Indian Society of Nondestructive Testing. \ No newline at end of file diff --git a/Corpus/Floating Point Operations in Matrix-Vector Calculus.txt b/Corpus/Floating Point Operations in Matrix-Vector Calculus.txt new file mode 100644 index 0000000..2c6c299 Binary files /dev/null and b/Corpus/Floating Point Operations in Matrix-Vector Calculus.txt differ diff --git a/Corpus/Green AI - Roy Schwartz.txt b/Corpus/Green AI - Roy Schwartz.txt new file mode 100644 index 0000000..299197d Binary files /dev/null and b/Corpus/Green AI - Roy Schwartz.txt differ diff --git a/Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt b/Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt new file mode 100644 index 0000000..73d70e5 Binary files /dev/null and b/Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt differ diff --git a/Corpus/Identity Mappings in Deep Residual Networks.txt b/Corpus/Identity Mappings in Deep Residual Networks.txt new file mode 100644 index 0000000..85ba774 Binary files /dev/null and b/Corpus/Identity Mappings in Deep Residual Networks.txt differ diff --git a/Corpus/Language Models are Few-Shot Learners.txt b/Corpus/Language Models are Few-Shot Learners.txt new file mode 100644 index 0000000..2b3bb92 Binary files /dev/null and b/Corpus/Language Models are Few-Shot Learners.txt differ diff --git a/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt b/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt new file mode 100644 index 0000000..a98b373 --- /dev/null +++ b/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt @@ -0,0 +1,399 @@ + Learning Efficient Convolutional Networks through Network Slimming + + + Zhuang Liu 1∗ Jianguo Li 2 Zhiqiang Shen 3 Gao Huang 4 Shoumeng Yan 2 Changshui Zhang 1 + 1 CSAI, TNList, Tsinghua University 2 Intel Labs China 3 Fudan University 4 Cornell University + {liuzhuangthu, zhiqiangshen0214}@gmail.com,{jianguo.li, shoumeng.yan}@intel.com, + gh349@cornell.edu, zcs@mail.tsinghua.edu.cn + + + + Abstract However, larger CNNs, although with stronger represen- + tation power, are more resource-hungry. For instance, a + The deployment of deep convolutional neural networks 152-layer ResNet [14] has more than 60 million parame- + (CNNs) in many real world applications is largely hindered ters and requires more than 20 Giga float-point-operations + by their high computational cost. In this paper, we propose (FLOPs) when inferencing an image with resolution 224× + a novel learning scheme for CNNs to simultaneously 1) re- 224. This is unlikely to be affordable on resource con- + duce the model size; 2) decrease the run-time memory foot- strained platforms such as mobile devices, wearables or In- + print; and 3) lower the number of computing operations, ternet of Things (IoT) devices. + without compromising accuracy. This is achieved by en- The deployment of CNNs in real world applications areforcing channel-level sparsity in the network in a simple but mostly constrained by1) Model size: CNNs’ strong repre-effective way. Different from many existing approaches, the sentation power comes from their millions of trainable pa-proposed method directly applies to modern CNN architec- rameters. Those parameters, along with network structuretures, introduces minimum overhead to the training process, information, need to be stored on disk and loaded into mem-and requires no special software/hardware accelerators for ory during inference time. As an example, storing a typi-the resulting models. We call our approachnetwork slim- cal CNN trained on ImageNet consumes more than 300MBming, which takes wide and large networks as input mod- space, which is a big resource burden to embedded devices.els, but during training insignificant channels are automat- 2) Run-time memory: During inference time, the interme-ically identified and pruned afterwards, yielding thin and diate activations/responses of CNNs could even take morecompact models with comparable accuracy. We empirically memory space than storing the model parameters, even withdemonstrate the effectiveness of our approach with several batch size 1. This is not a problem for high-end GPUs, butstate-of-the-art CNN models, including VGGNet, ResNet unaffordable for many applications with low computationaland DenseNet, on various image classification datasets. For power.3) Number of computing operations:The convolu-VGGNet, a multi-pass version of network slimming gives a tion operations are computationally intensive on high reso-20×reduction in model size and a 5×reduction in comput- lution images. A large CNN may take several minutes toing operations. process one single image on a mobile device, making it un- + realistic to be adopted for real applications. + 1. Introduction Many works have been proposed to compress large + CNNs or directly learn more efficient CNN models for fast + In recent years, convolutional neural networks (CNNs) inference. These include low-rank approximation [7], net- + have become the dominant approach for a variety of com- work quantization [3, 12] and binarization [28, 6], weight + puter vision tasks, e.g., image classification [22], object pruning [12], dynamic inference [16], etc. However, most + detection [8], semantic segmentation [26]. Large-scale of these methods can only address one or two challenges + datasets, high-end modern GPUs and new network architec- mentioned above. Moreover, some of the techniques require + tures allow the development of unprecedented large CNN specially designed software/hardware accelerators for exe- + models. For instance, from AlexNet [22], VGGNet [31] and cution speedup [28, 6, 12]. + GoogleNet [34] to ResNets [14], the ImageNet Classifica- Another direction to reduce the resource consumption of + tion Challenge winner models have evolved from 8 layers large CNNs is to sparsify the network. Sparsity can be im- + to more than 100 layers. posed on different level of structures [2, 37, 35, 29, 25], + ∗ This work was done when Zhuang Liu and Zhiqiang Shen were interns which yields considerable model-size compression and in- + at Intel Labs China. Jianguo Li is the corresponding author. ference speedup. However, these approaches generally re- + + + + 2736 channel scaling channel scaling i-thconv-layer factors (i+1)=j-th i-thconv-layer factors (i+1)=j-th + conv-layer conv-layer Ci1 1.170 C 1.170 + C C i1 + i2 0.001 j1 Cj1 + Ci3 0.290 pruning Ci3 0.290 + C 0.003 Ci4 j2 Cj2 + … … … + … … + … + + C Cin 0.820 in 0.820 + initial network compact network + Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity + regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small + scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then + fine-tuned to achieve comparable (or even higher) accuracy as normally trained full network. + + quire special software/hardware accelerators to harvest the Low-rank Decompositionapproximates weight matrix in + gain in memory or time savings, though it is easier than neural networks with low-rank matrix using techniques like + non-structured sparse weight matrix as in [12]. Singular Value Decomposition (SVD) [7]. This method + In this paper, we proposenetwork slimming, a simple works especially well on fully-connected layers, yield- + yet effective network training scheme, which addresses all ing∼3x model-size compression however without notable + the aforementioned challenges when deploying large CNNs speed acceleration, since computing operations in CNN + under limited resources. Our approach imposes L1 regular- mainly come from convolutional layers. + ization on the scaling factors in batch normalization (BN) Weight Quantization. HashNet [3] proposes to quantizelayers, thus it is easy to implement without introducing any the network weights. Before training, network weights arechange to existing CNN architectures. Pushing the val- hashed to different groups and within each group weightues of BN scaling factors towards zero with L1 regulariza- the value is shared. In this way only the shared weights andtion enables us to identify insignificant channels (or neu- hash indices need to be stored, thus a large amount of stor-rons), as each scaling factor corresponds to a specific con- age space could be saved. [12] uses a improved quantizationvolutional channel (or a neuron in a fully-connected layer). technique in a deep compression pipeline and achieves 35xThis facilitates the channel-level pruning at the followed to 49x compression rates on AlexNet and VGGNet. How-step. The additional regularization term rarely hurt the per- ever, these techniques can neither save run-time memoryformance. In fact, in some cases it leads to higher gen- nor inference time, since during inference shared weightseralization accuracy. Pruning unimportant channels may need to be restored to their original positions.sometimes temporarily degrade the performance, but this [28, 6] quantize real-valued weights into binary/ternaryeffect can be compensated by the followed fine-tuning of weights (weight values restricted to{−1,1}or{−1,0,1}).the pruned network. After pruning, the resulting narrower This yields a large amount of model-size saving, and signifi-network is much more compact in terms of model size, run- cant speedup could also be obtained given bitwise operationtime memory, and computing operations compared to the libraries. However, this aggressive low-bit approximationinitial wide network. The above process can be repeated method usually comes with a moderate accuracy loss. for several times, yielding a multi-pass network slimming + scheme which leads to even more compact network. Weight Pruning / Sparsifying.[12] proposes to prune the + Experiments on several benchmark datasets and different unimportant connections with small weights in trained neu- + network architectures show that we can obtain CNN models ral networks. The resulting network’s weights are mostly + with up to 20x mode-size compression and 5x reduction in zeros thus the storage space can be reduced by storing the + computing operations of the original ones, while achieving model in a sparse format. However, these methods can only + the same or even higher accuracy. Moreover, our method achieve speedup with dedicated sparse matrix operation li- + achieves model compression and inference speedup with braries and/or hardware. The run-time memory saving is + conventional hardware and deep learning software pack- also very limited since most memory space is consumed by + ages, since the resulting narrower model is free of any the activation maps (still dense) instead of the weights. + sparse storing format or computing operations. In [12], there is no guidance for sparsity during training. + [32] overcomes this limitation by explicitly imposing sparse + 2. Related Work constraint over each weight with additional gate variables, + and achieve high compression rates by pruning connections + In this section, we discuss related work from five aspects. with zero gate values. This method achieves better com- + + + + 2737 pression rate than [12], but suffers from the same drawback. Advantages of Channel-level Sparsity. As discussed in + prior works [35, 23, 11], sparsity can be realized at differ-Structured Pruning / Sparsifying. Recently, [23] pro- ent levels, e.g., weight-level, kernel-level, channel-level orposes to prune channels with small incoming weights in layer-level. Fine-grained level (e.g., weight-level) sparsitytrained CNNs, and then fine-tune the network to regain gives the highest flexibility and generality leads to higheraccuracy. [2] introduces sparsity by random deactivat- compression rate, but it usually requires special software oring input-output channel-wise connections in convolutional hardware accelerators to do fast inference on the sparsifiedlayers before training, which also yields smaller networks model [11]. On the contrary, the coarsest layer-level spar-with moderate accuracy loss. Compared with these works, sity does not require special packages to harvest the infer-we explicitly impose channel-wise sparsity in the optimiza- ence speedup, while it is less flexible as some whole layerstion objective during training, leading to smoother channel need to be pruned. In fact, removing layers is only effec-pruning process and little accuracy loss. tive when the depth is sufficiently large, e.g., more than 50[37] imposes neuron-level sparsity during training thus layers [35, 18]. In comparison, channel-level sparsity pro-some neurons could be pruned to obtain compact networks. vides a nice tradeoff between flexibility and ease of imple-[35] proposes a Structured Sparsity Learning (SSL) method mentation. It can be applied to any typical CNNs or fully-to sparsify different level of structures (e.g. filters, channels connected networks (treat each neuron as a channel), andor layers) in CNNs. Both methods utilize group sparsity the resulting network is essentially a “thinned” version ofregualarization during training to obtain structured spar- the unpruned network, which can be efficiently inferenced sity. Instead of resorting to group sparsity on convolu- on conventional CNN platforms.tional weights, our approach imposes simple L1 sparsity on + channel-wise scaling factors, thus the optimization objec- Challenges. Achieving channel-level sparsity requires + tive is much simpler. pruning all the incoming and outgoing connections asso- + Since these methods prune or sparsify part of the net- ciated with a channel. This renders the method of directly + work structures (e.g., neurons, channels) instead of individ- pruning weights on a pre-trained model ineffective, as it is + ual weights, they usually require less specialized libraries unlikely that all the weights at the input or output end of + (e.g. for sparse computing operation) to achieve inference a channel happen to have near zero values. As reported in + speedup and run-time memory saving. Our network slim- [23], pruning channels on pre-trained ResNets can only lead + ming also falls into this category, with absolutely no special to a reduction of∼10% in the number of parameters without + libraries needed to obtain the benefits. suffering from accuracy loss. [35] addresses this problem + by enforcing sparsity regularization into the training objec-Neural Architecture Learning. While state-of-the-art tive. Specifically, they adoptgroup LASSOto push all theCNNs are typically designed by experts [22, 31, 14], there filter weights corresponds to the same channel towards zeroare also some explorations on automatically learning net- simultaneously during training. However, this approach re-work architectures. [20] introduces sub-modular/super- quires computing the gradients of the additional regulariza-modular optimization for network architecture search with tion term with respect to all the filter weights, which is non-a given resource budget. Some recent works [38, 1] propose trivial. We introduce a simple idea to address the aboveto learn neural architecture automatically with reinforce- challenges, and the details are presented below.ment learning. The searching space of these methods are + extremely large, thus one needs to train hundreds of mod- Scaling Factors and Sparsity-induced Penalty.Our idea + els to distinguish good from bad ones. Network slimming is introducing a scaling factorγfor each channel, which is + can also be treated as an approach for architecture learning, multiplied to the output of that channel. Then we jointly + despite the choices are limited to the width of each layer. train the network weights and these scaling factors, with + However, in contrast to the aforementioned methods, net- sparsity regularization imposed on the latter. Finally we + work slimming learns network architecture through only a prune those channels with small factors, and fine-tune the + single training process, which is in line with our goal of pruned network. Specifically, the training objective of our + efficiency. approach is given by + + 3. Network slimming L= l(f(x,W),y) +λ g(γ) (1) + (x,y) γ∈Γ We aim to provide a simple scheme to achieve channel- + level sparsity in deep CNNs. In this section, we first dis- where(x,y)denote the train input and target,Wdenotes + cuss the advantages and challenges of channel-level spar- the trainable weights, the first sum-term corresponds to the + sity, and introduce how we leverage the scaling layers in normal training loss of a CNN,g(·)is a sparsity-induced + batch normalization to effectively identify and prune unim- penalty on the scaling factors, andλbalances the two terms. + portant channels in the network. In our experiment, we chooseg(s) =|s|, which is known as + + + + 2738 convolution layers. 2), if we insert a scaling layer before + a BN layer, the scaling effect of the scaling layer will be + Train with Prune channels Initial Fine-tune the Compact completely canceled by the normalization process in BN. channel sparsity with small network pruned network networkregularization scaling factors 3), if we insert scaling layer after BN layer, there are two + consecutive scaling factors for each channel. Figure 2: Flow-chart of network slimming procedure. The dotted- + line is for the multi-pass/iterative scheme. Channel Pruning and Fine-tuning.After training under + channel-level sparsity-induced regularization, we obtain a + L1-norm and widely used to achieve sparsity. Subgradient model in which many scaling factors are near zero (see Fig- + descent is adopted as the optimization method for the non- ure 1). Then we can prune channels with near-zero scaling + smooth L1 penalty term. An alternative option is to replace factors, by removing all their incoming and outgoing con- + the L1 penalty with the smooth-L1 penalty [30] to avoid nections and corresponding weights. We prune channels + using sub-gradient at non-smooth point. with a global threshold across all layers, which is defined + As pruning a channel essentially corresponds to remov- as a certain percentile of all the scaling factor values. For + ing all the incoming and outgoing connections of that chan- instance, we prune 70% channels with lower scaling factors + nel, we can directly obtain a narrow network (see Figure 1) by choosing the percentile threshold as 70%. By doing so, + without resorting to any special sparse computation pack- we obtain a more compact network with less parameters and + ages. The scaling factors act as the agents for channel se- run-time memory, as well as less computing operations. + lection. As they are jointly optimized with the network Pruning may temporarily lead to some accuracy loss, + weights, the network can automatically identity insignifi- when the pruning ratio is high. But this can be largely com- + cant channels, which can be safely removed without greatly pensated by the followed fine-tuning process on the pruned + affecting the generalization performance. network. In our experiments, the fine-tuned narrow network + Leveraging the Scaling Factors in BN Layers.Batch nor- can even achieve higher accuracy than the original unpruned + malization [19] has been adopted by most modern CNNs network in many cases. + as a standard approach to achieve fast convergence and bet- Multi-pass Scheme. We can also extend the proposedter generalization performance. The way BN normalizes method from single-pass learning scheme (training withthe activations motivates us to design a simple and effi- sparsity regularization, pruning, and fine-tuning) to a multi-cient method to incorporates the channel-wise scaling fac- pass scheme. Specifically, a network slimming proceduretors. Particularly, BN layer normalizes the internal activa- results in a narrow network, on which we could again applytions using mini-batch statistics. Letzin andzout be the the whole training procedure to learn an even more compactinput and output of a BN layer,Bdenotes the current mini- model. This is illustrated by the dotted-line in Figure 2. Ex-batch, BN layer performs the following transformation: perimental results show that this multi-pass scheme can lead + to even better results in terms of compression rate.zzˆ= in −µ B ; zσ2 +ǫ out =γzˆ+β (2) Handling Cross Layer Connections and Pre-activation B Structure. The network slimming process introduced + whereµB andσB are the mean and standard deviation val- above can be directly applied to most plain CNN architec- + ues of input activations overB,γandβare trainable affine tures such as AlexNet [22] and VGGNet [31]. While some + transformation parameters (scale and shift) which provides adaptations are required when it is applied to modern net- + the possibility of linearly transforming normalized activa- works withcross layer connectionsand thepre-activation + tions back to any scales. design such as ResNet [15] and DenseNet [17]. For these + It is common practice to insert a BN layer after a convo- networks, the output of a layer may be treated as the input + lutional layer, with channel-wise scaling/shifting parame- of multiple subsequent layers, in which a BN layer is placed + ters. Therefore, we can directly leverage theγparameters in before the convolutional layer. In this case, the sparsity is + BN layers as the scaling factors we need for network slim- achieved at the incoming end of a layer, i.e., the layer selec- + ming. It has the great advantage of introducing no overhead tively uses a subset of channels it received. To harvest the + to the network. In fact, this is perhaps also the most effec- parameter and computation savings at test time, we need + tive way we can learn meaningful scaling factors for chan- to place achannel selectionlayer to mask out insignificant + nel pruning.1), if we add scaling layers to a CNN without channels we have identified. + BN layer, the value of the scaling factors are not meaning- + ful for evaluating the importance of a channel, because both 4. Experiments convolution layers and scaling layers are linear transforma- + tions. One can obtain the same results by decreasing the We empirically demonstrate the effectiveness of network + scaling factor values while amplifying the weights in the slimming on several benchmark datasets. We implement + + + + 2739 (a) Test Errors on CIFAR-10 + Model Test error (%) Parameters Pruned FLOPs Pruned + VGGNet (Baseline) 6.34 20.04M - 7.97×10 8 - + VGGNet (70% Pruned) 6.20 2.30M 88.5% 3.91×10 8 51.0% + DenseNet-40 (Baseline) 6.11 1.02M - 5.33×10 8 - + DenseNet-40 (40% Pruned) 5.19 0.66M 35.7% 3.81×10 8 28.4% + DenseNet-40 (70% Pruned) 5.65 0.35M 65.2% 2.40×10 8 55.0% + ResNet-164 (Baseline) 5.42 1.70M - 4.99×10 8 - + ResNet-164 (40% Pruned) 5.08 1.44M 14.9% 3.81×10 8 23.7% + ResNet-164 (60% Pruned) 5.27 1.10M 35.2% 2.75×10 8 44.9% + + (b) Test Errors on CIFAR-100 + Model Test error (%) Parameters Pruned FLOPs Pruned + VGGNet (Baseline) 26.74 20.08M - 7.97×10 8 - + VGGNet (50% Pruned) 26.52 5.00M 75.1% 5.01×10 8 37.1% + DenseNet-40 (Baseline) 25.36 1.06M - 5.33×10 8 - + DenseNet-40 (40% Pruned) 25.28 0.66M 37.5% 3.71×10 8 30.3% + DenseNet-40 (60% Pruned) 25.72 0.46M 54.6% 2.81×10 8 47.1% + ResNet-164 (Baseline) 23.37 1.73M - 5.00×10 8 - + ResNet-164 (40% Pruned) 22.87 1.46M 15.5% 3.33×10 8 33.3% + ResNet-164 (60% Pruned) 23.91 1.21M 29.7% 2.47×10 8 50.6% + (c) Test Errors on SVHN + Model Test Error (%) Parameters Pruned FLOPs Pruned + VGGNet (Baseline) 2.17 20.04M - 7.97×10 8 - + VGGNet (60% Pruned) 2.06 3.04M 84.8% 3.98×10 8 50.1% + DenseNet-40 (Baseline) 1.89 1.02M - 5.33×10 8 - + DenseNet-40 (40% Pruned) 1.79 0.65M 36.3% 3.69×10 8 30.8% + DenseNet-40 (60% Pruned) 1.81 0.44M 56.6% 2.67×10 8 49.8% + ResNet-164 (Baseline) 1.78 1.70M - 4.99×10 8 - + ResNet-164 (40% Pruned) 1.85 1.46M 14.5% 3.44×10 8 31.1% + ResNet-164 (60% Pruned) 1.81 1.12M 34.3% 2.25×10 8 54.9% + Table 1: Results on CIFAR and SVHN datasets. “Baseline” denotes normal training without sparsity regularization. In column-1, “60% + pruned” denotes the fine-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters + and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy + could typically be maintained with≥60% channels pruned. + + our method based on the publicly available Torch [5] im- images, from which we split a validation set of 6,000 im- + plementation for ResNets by [10]. The code is available at ages for model selection during training. The test set con- + https://github.com/liuzhuang13/slimming. tains 26,032 images. During training, we select the model + with the lowest validation error as the model to be pruned + 4.1. Datasets (or the baseline model). We also report the test errors of the + models with lowest validation errors during fine-tuning.CIFAR.The two CIFAR datasets [21] consist of natural im- + ages with resolution 32×32. CIFAR-10 is drawn from 10 + and CIFAR-100 from 100 classes. The train and test sets ImageNet. The ImageNet dataset contains 1.2 millioncontain 50,000 and 10,000 images respectively. On CIFAR- training images and 50,000 validation images of 100010, a validation set of 5,000 images is split from the training classes. We adopt the data augmentation scheme as in [10].set for the search ofλ(in Equation 1) on each model. We We report the single-center-crop validation error of the finalreport the final test errors after training or fine-tuning on model.all training images. A standard data augmentation scheme + (shifting/mirroring) [14, 18, 24] is adopted. The input data + is normalized using channel means and standard deviations. MNIST.MNIST is a handwritten digit dataset containingWe also compare our method with [23] on CIFAR datasets. 60,000 training images and 10,000 test images. To test the + SVHN.The Street View House Number (SVHN) dataset effectiveness of our method on a fully-connected network + [27] consists of 32x32 colored digit images. Following (treating each neuron as a channel with 1×1 spatial size), + common practice [9, 18, 24] we use all the 604,388 training we compare our method with [35] on this dataset. + + + + 2740 4.2. Network Models Model Parameter and FLOP Savings + On CIFAR and SVHN dataset, we evaluate our method 100 100.0% 100.0% 100.0% Original + Parameter Ratio + on three popular network architectures: VGGNet[31], 80 FLOPs Ratio + ResNet [14] and DenseNet [17]. The VGGNet is originally + + Ratio (%) 64.8% + 60 + designed for ImageNet classification. For our experiment a 55.1% + 49.0% 45.0% + variation of the original VGGNet for CIFAR dataset is taken 40 34.8% + from [36]. For ResNet, a 164-layer pre-activation ResNet 20 11.5% + with bottleneck structure (ResNet-164) [15] is used. For 0 + DenseNet, we use a 40-layer DenseNet with growth rate 12 VGGNet DenseNet-40 ResNet-164 + (DenseNet-40). Figure 3: Comparison of pruned models withlowertest errors on On ImageNet dataset, we adopt the 11-layer (8-conv + CIFAR-10 than the original models. The blue and green bars are 3 FC) “VGG-A” network [31] model with batch normaliza- parameter and FLOP ratios between pruned and original models. + tion from [4]. We remove the dropout layers since we use + relatively heavy data augmentation. To prune the neurons mented by building a new narrower model and copying the + in fully-connected layers, we treat them as convolutional corresponding weights from the model trained with sparsity. + channels with 1×1 spatial size. + On MNIST dataset, we evaluate our method on the same Fine-tuning.After the pruning we obtain a narrower and + 3-layer fully-connected network as in [35]. more compact model, which is then fine-tuned. On CIFAR, + SVHN and MNIST datasets, the fine-tuning uses the same + 4.3. Training, Pruning and Fine­tuning optimization setting as in training. For ImageNet dataset, + due to time constraint, we fine-tune the pruned VGG-A withNormal Training.We train all the networks normally from a learning rate of 10 −3 for only 5 epochs.scratch as baselines. All the networks are trained using + SGD. On CIFAR and SVHN datasets we train using mini- 4.4. Results batch size 64 for 160 and 20 epochs, respectively. The ini- + tial learning rate is set to 0.1, and is divided by 10 at 50% CIFAR and SVHNThe results on CIFAR and SVHN are + and 75% of the total number of training epochs. On Im- shown in Table 1. We mark all lowest test errors of a model + ageNet and MNIST datasets, we train our models for 60 inboldface. + and 30 epochs respectively, with a batch size of 256, and an Parameter and FLOP reductions. The purpose of net-initial learning rate of 0.1 which is divided by 10 after 1/3 work slimming is to reduce the amount of computing re-and 2/3 fraction of training epochs. We use a weight de- sources needed. The last row of each model has≥60%cay of10 −4 and a Nesterov momentum [33] of 0.9 without channels pruned while still maintaining similar accuracy todampening. The weight initialization introduced by [13] is the baseline. The parameter saving can be up to 10×. Theadopted. Our optimization settings closely follow the orig- FLOP reductions are typically around50%. To highlightinal implementation at [10]. In all our experiments, we ini- network slimming’s efficiency, we plot the resource sav-tialize all channel scaling factors to be 0.5, since this gives ings in Figure 3. It can be observed that VGGNet has ahigher accuracy for the baseline models compared with de- large amount of redundant parameters that can be pruned.fault setting (all initialized to be 1) from [10]. On ResNet-164 the parameter and FLOP savings are rel- + Training with Sparsity.For CIFAR and SVHN datasets, atively insignificant, we conjecture this is due to its “bot- + when training with channel sparse regularization, the hyper- tleneck” structure has already functioned as selecting chan- + parameteerλ, which controls the tradeoff between empiri- nels. Also, on CIFAR-100 the reduction rate is typically + cal loss and sparsity, is determined by a grid search over slightly lower than CIFAR-10 and SVHN, which is possi- + 10 −3 , 10 −4 , 10 −5 on CIFAR-10 validation set. For VG- bly due to the fact that CIFAR-100 contains more classes. + GNet we chooseλ=10 −4 and for ResNet and DenseNet Regularization Effect.From Table 1, we can observe that,λ=10 −5 . For VGG-A on ImageNet, we setλ=10 −5 . All on ResNet and DenseNet, typically when40%channels areother settings are kept the same as in normal training. pruned, the fine-tuned network can achieve a lower test er- + Pruning.When we prune the channels of models trained ror than the original models. For example, DenseNet-40 + with sparsity, a pruning threshold on the scaling factors with 40% channels pruned achieve a test error of 5.19% + needs to be determined. Unlike in [23] where different lay- on CIFAR-10, which is almost 1% lower than the original + ers are pruned by different ratios, we use a global pruning model. We hypothesize this is due to the regularization ef- + threshold for simplicity. The pruning threshold is deter- fect of L1 sparsity on channels, which naturally provides + mined by a percentile among all scaling factors , e.g., 40% feature selection in intermediate layers of a network. We + or 60% channels are pruned. The pruning process is imple- will analyze this effect in the next section. + + + + 2741 VGG-A Baseline 50% Pruned (a) Multi-pass Scheme on CIFAR-10 + Params 132.9M 23.2M IterTrained Fine-tunedParams PrunedFLOPs Pruned + Params Pruned - 82.5% 1 6.38 6.51 66.7% 38.6% + FLOPs 4.57×10 10 3.18×10 10 2 6.23 6.11 84.7% 52.7% + FLOPs Pruned - 30.4% 3 5.87 6.10 91.4% 63.1% + Validation Error (%) 36.69 36.66 4 6.19 6.59 95.6% 77.2% + 5 5.96 7.73 98.3% 88.7% + Table 2: Results on ImageNet. 6 7.79 9.70 99.4% 95.7% + + Model Test Error (%)Params Pruned #Neurons (b) Multi-pass Scheme on CIFAR-100 + Baseline 1.43 - 784-500-300-10 IterTrained Fine-tunedParams PrunedFLOPs Pruned + Pruned [35] 1.53 83.5% 434-174-78-10 1 27.72 26.52 59.1% 30.9% + Pruned (ours) 1.49 84.4% 784-100-60-10 2 26.03 26.52 79.2% 46.1% + 3 26.49 29.08 89.8% 67.3% + Table 3: Results on MNIST. 4 28.17 30.59 95.3% 83.0% + 5 30.04 36.35 98.3% 93.5% + 6 35.91 46.73 99.4% 97.7% + ImageNet. The results for ImageNet dataset are summa- + rized in Table 2. When 50% channels are pruned, the pa- Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR- + rameter saving is more than 5×, while the FLOP saving 100 datasets, using VGGNet. The baseline model has test errors of + is only 30.4%. This is due to the fact that only 378 (out 6.34% and 26.74%. “Trained” and “Fine-tuned” columns denote + of 2752) channels from all the computation-intensive con- the test errors (%) of the model trained with sparsity, and the fine- + tuned model after channel pruning, respectively. The parameter volutional layers are pruned, while 5094 neurons (out of and FLOP pruned ratios correspond to the fine-tuned model in that 8192) from the parameter-intensive fully-connected layers row and the trained model in the next row. are pruned. It is worth noting that our method can achieve + the savings with no accuracy loss on the 1000-class Im- more compact models. On CIFAR-10, the trained modelageNet dataset, where other methods for efficient CNNs achieves the lowest test error in iteration 5. This model[2, 23, 35, 28] mostly report accuracy loss. achieves 20×parameter reduction and 5×FLOP reduction, + MNIST.On MNIST dataset, we compare our method with while still achievinglowertest error. On CIFAR-100, after + the Structured Sparsity Learning (SSL) method [35] in Ta- iteration 3, the test error begins to increase. This is pos- + ble 3. Despite our method is mainly designed to prune sibly due to that it contains more classes than CIFAR-10, + channels in convolutional layers, it also works well in prun- so pruning channels too agressively will inevitably hurt the + ing neurons in fully-connected layers. In this experiment, performance. However, we can still prune near 90% param- + we observe that pruning with a global threshold sometimes eters and near 70% FLOPs without notable accuracy loss. + completely removes a layer, thus we prune 80% of the neu- + rons in each of the two intermediate layers. Our method 5. Analysis + slightly outperforms [35], in that a slightly lower test error There are two crucial hyper-parameters in network slim-is achieved while pruning more parameters. ming, the pruned percentagetand the coefficient of the + We provide some additional experimental results in the sparsity regularization termλ(see Equation 1). In this sec- + supplementary materials, including (1) detailed structure of tion, we analyze their effects in more detail. + a compact VGGNet on CIFAR-10; (2) wall-clock time and Effect of Pruned Percentage. Once we obtain a modelrun-time memory savings in practice. (3) comparison with trained with sparsity regularization, we need to decide whata previous channel pruning method [23]; percentage of channels to prune from the model. If we + 4.5. Results for Multi­pass Scheme prune too few channels, the resource saving can be very + limited. However, it could be destructive to the model if + We employ the multi-pass scheme on CIFAR datasets we prune too many channels, and it may not be possible to + using VGGNet. Since there are no skip-connections, prun- recover the accuracy by fine-tuning. We train a DenseNet- + ing away a whole layer will completely destroy the mod- 40 model withλ=10 −5 on CIFAR-10 to show the effect of + els. Thus, besides setting the percentile threshold as 50%, pruning a varying percentage of channels. The results are + we also put a constraint that at each layer, at most 50% of summarized in Figure 5. + channels can be pruned. From Figure 5, it can be concluded that the classification + The test errors of models in each iteration are shown in performance of the pruned or fine-tuned models degrade + Table 4. As the pruning process goes, we obtain more and only when the pruning ratio surpasses a threshold. The fine- + + + + 2742 λ= 0 λ= 10 −5 λ= 10 −4 + 400 450 2000 + 350 400 + 300 350 1500 + 300250 + + Count 250200 1000200150 150 + 100 100 500 + 50 50 + 0 0 00.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 + Scaling factor value + Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter + λ). With the increase ofλ, scaling factors become sparser. + 8.0 0Baseline + 7.5 Trained with Sparsity 10 Pruned 7.0 Fine-tuned + + + + + Channel Index ) + % 20 + + + + Test error ( 6.5 + 30 6.0 + 40 5.5 + 5.0 50 + + 4.50 10 20 30 40 50 60 70 80 90 0 20 40 60 80 + Pruned channels (%) Epoch + Figure 5: The effect of pruning varying percentages of channels, Figure 6: Visulization of channel scaling factors’ change in scale + from DenseNet-40 trained on CIFAR-10 withλ=10 −5 . along the training process, taken from the 11th conv-layer in VG- + GNet trained on CIFAR-10. Brighter color corresponds to larger + value. The bright lines indicate the “selected” channels, the dark + tuning process can typically compensate the possible accu- lines indicate channels that can be pruned. + racy loss caused by pruning. Only when the threshold goes + beyond 80%, the test error of fine-tuned model falls behind progresses, some channels’ scaling factors become largerthe baseline model. Notably, when trained with sparsity, (brighter) while others become smaller (darker).even without fine-tuning, the model performs better than the + original model. This is possibly due the the regularization 6. Conclusion effect of L1 sparsity on channel scaling factors. + We proposed the network slimming technique to learnChannel Sparsity Regularization.The purpose of the L1 more compact CNNs. It directly imposes sparsity-inducedsparsity term is to force many of the scaling factors to be regularization on the scaling factors in batch normalizationnear zero. The parameterλin Equation 1 controls its signif- layers, and unimportant channels can thus be automatically icance compared with the normal training loss. In Figure 4 identified during training and then pruned. On multiple we plot the distributions of scaling factors in the whole net- datasets, we have shown that the proposed method is able towork with differentλvalues. For this experiment we use a significantly decrease the computational cost (up to 20×) ofVGGNet trained on CIFAR-10 dataset. state-of-the-art networks, with no accuracy loss. More im- + It can be observed that with the increase ofλ, the scaling portantly, the proposed method simultaneously reduces the + factors are more and more concentrated near zero. When model size, run-time memory, computing operations while + λ=0, i.e., there’s no sparsity regularization, the distribution introducing minimum overhead to the training process, and + is relatively flat. Whenλ=10 −4 , almost all scaling factors the resulting models require no special libraries/hardware + fall into a small region near zero. This process can be seen for efficient inference. + as a feature selection happening in intermediate layers of + deep networks, where only channels with non-negligible Acknowledgements. Gao Huang is supported by the In- + scaling factors are chosen. We further visualize this pro- ternational Postdoctoral Exchange Fellowship Program of + cess by a heatmap. Figure 6 shows the magnitude of scaling China Postdoctoral Council (No.20150015). Changshui + factors from one layer in VGGNet, along the training pro- Zhang is supported by NSFC and DFG joint project NSFC + cess. Each channel starts with equal weights; as the training 61621136008/DFG TRR-169. + + + + 2743 References [20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network + architecture optimization through submodularity and super- [1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu- modularity.arXiv preprint arXiv:1609.00074, 2016. ral network architectures using reinforcement learning. In [21] A. Krizhevsky and G. Hinton. Learning multiple layers of ICLR, 2017. features from tiny images. InTech Report, 2009. [2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetof sparsity in convolutional neural networks.arXiv preprint classification with deep convolutional neural networks. In arXiv:1702.06257, 2017. NIPS, pages 1097–1105, 2012. [3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Y. Chen. Compressing neural networks with the hashing Graf. Pruning filters for efficient convnets. arXiv preprint trick. InICML, 2015. arXiv:1608.08710, 2016. + [4] S. Chintala. Training an object classifier in torch-7 on [24] M. Lin, Q. Chen, and S. Yan. Network in network. InICLR,multiple gpus over imagenet. https://github.com/ 2014.soumith/imagenet-multiGPU.torch. [25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. + [5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Sparse convolutional neural networks. InProceedings of the + matlab-like environment for machine learning. InBigLearn, IEEE Conference on Computer Vision and Pattern Recogni- + NIPS Workshop, number EPFL-CONF-192376, 2011. tion, pages 806–814, 2015. + [6] M. Courbariaux and Y. Bengio. Binarynet: Training deep [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional + neural networks with weights and activations constrained to+ networks for semantic segmentation. InCVPR, pages 3431– + 1 or-1.arXiv preprint arXiv:1602.02830, 2016. 3440, 2015. + [7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. + gus. Exploiting linear structure within convolutional net- Ng. Reading digits in natural images with unsupervised fea- + works for efficient evaluation. InNIPS, 2014. ture learning, 2011. InNIPS Workshop on Deep Learning + [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- and Unsupervised Feature Learning, 2011. + ture hierarchies for accurate object detection and semantic [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor- + segmentation. InCVPR, pages 580–587, 2014. net: Imagenet classification using binary convolutional neu- + [9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and ral networks. InECCV, 2016. + Y. Bengio. Maxout networks. InICML, 2013. [29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. + [10] S. Gross and M. Wilber. Training and investigating residual Group sparse regularization for deep neural networks.arXiv + nets. https://github.com/szagoruyko/cifar. preprint arXiv:1607.00485, 2016. + torch. [30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization + [11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- methods for l1 regularization: A comparative study and two + pressing deep neural network with pruning, trained quanti- new approaches. InECML, pages 286–297, 2007. + zation and huffman coding. InICLR, 2016. [31] K. Simonyan and A. Zisserman. Very deep convolutional + [12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights networks for large-scale image recognition. InICLR, 2015. + and connections for efficient neural network. InNIPS, pages [32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse + 1135–1143, 2015. neural networks.CoRR, abs/1611.06694, 2016. + [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into [33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the + rectifiers: Surpassing human-level performance on imagenet importance of initialization and momentum in deep learning. + classification. InICCV, 2015. InICML, 2013. + [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, + for image recognition. InCVPR, 2016. D. Anguelov, D. Erhan, et al. Going deeper with convolu- + tions. InCVPR, pages 1–9, 2015.[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningdeep residual networks. InECCV, pages 630–645. Springer, structured sparsity in deep neural networks. InNIPS, 2016.2016. [36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and github.com/szagoruyko/cifar.torch.K. Q. Weinberger. Multi-scale dense convolutional networks + for efficient prediction. arXiv preprint arXiv:1703.09844, [37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards + 2017. compact cnns. InECCV, 2016. + [38] B. Zoph and Q. V. Le. Neural architecture search with rein-[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. forcement learning. InICLR, 2017.Densely connected convolutional networks. InCVPR, 2017. + [18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. + Deep networks with stochastic depth. InECCV, 2016. + [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating + deep network training by reducing internal covariate shift. + arXiv preprint arXiv:1502.03167, 2015. + + + + + 2744 \ No newline at end of file diff --git a/Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt b/Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt new file mode 100644 index 0000000..643bfe2 Binary files /dev/null and b/Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt differ diff --git a/Corpus/Learning both Weights and Connections for Efficient Neural Networks.txt b/Corpus/Learning both Weights and Connections for Efficient Neural Networks.txt new file mode 100644 index 0000000..4c089fe Binary files /dev/null and b/Corpus/Learning both Weights and Connections for Efficient Neural Networks.txt differ diff --git a/Corpus/Learning both Weights and Connections for Efficient Neural Networks - Song Han.txt b/Corpus/Learning both Weights and Connections for Efficient Neural Networks - Song Han.txt new file mode 100644 index 0000000..bdcb2b8 Binary files /dev/null and b/Corpus/Learning both Weights and Connections for Efficient Neural Networks - Song Han.txt differ diff --git a/Corpus/Learning to Generalize.txt b/Corpus/Learning to Generalize.txt new file mode 100644 index 0000000..dac9877 --- /dev/null +++ b/Corpus/Learning to Generalize.txt @@ -0,0 +1,933 @@ + 262-A1677 7/24/01 11:12 AM Page 763 + + + + + + + + SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING + + + + MANFRED OPPER Theories that try to understand the ability of neural + + Neural Computation Research Group networks to generalize from learned examples are + Aston University discussed. Also, an approach that is based on ideas + Birmingham B4 7ET, United Kingdom from statistical physics which aims to model typical + learning behavior is compared with a worst-case + framework. + + + + + + + + + + + + + + + + + + Learning to + + + Generalize + + + + + + + + + + + + + + + + + + + + + + + + + ................................................ ◗ + + Introduction rule. To what extent is it possible to understand the com- + plexity of learning from examples by mathematical models + Neural networks learn from examples. This statement is andtheirsolutions?Thisquestionisthefocusofthisarticle. + obviously true for the brain, but also artificial networks (or I concentrate on the use of neural networks for classifica- + neural networks), which have become a powerful new tool tion. Here, one can take characteristic features (e.g., the + for many pattern-recognition problems, adapt their “syn- pixels of an image) as an input pattern to the network. In + aptic” couplings to a set of examples. Neural nets usually the simplest case, it should decide whether a given pattern + consist of many simple computing units which are com- belongs (at least more likely) to a certain class of objects + bined in an architecture which is often independent from and respond with the output 1 or 1. To learn the under- + the problem. The parameters which control the interaction lying classification rule, the network is trained on a set of + among the units can be changed during the learning phase patterns together with the classification labels, which are + and these are often called synaptic couplings.After the provided by a trainer. A heuristic strategy for training is to + learning phase, a network adopts some ability to generalize tune the parameters of the machine (the couplings of the + from the examples; it can make predictions about inputs network) using a learning algorithm, in such a way that the + which it has not seen before; it has begun to understand a errors made on the set of training examples are small, in + + + + + + PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 763 262-A1677 7/24/01 11:12 AM Page 764 + + + + + + + + MANFRED OPPER + + + the hope that this helps to reduce the errors on new data. for the case of realizable rules they are also independent + How well will the trained network be able to classify an in- of the specific algorithm, as long as the training examples + put that it has not seen before? This performance on new are perfectly learned. Because it is able to cover even bad + data defines the generalization ability of the network. This situations which are unfavorable for improvement of the + ability will be affected by the problem of realizability: The learning process, it is not surprising that this theory may + network may not be sufficiently complex to learn the rule in some cases provide too pessimistic results which are also + completely or there may be ambiguities in classification. too crude to reveal interesting behavior in the intermediate + Here, I concentrate on a second problem arising from the region of the learning curve. + fact that learning will mostly not be exhaustive and the in- In this article, I concentrate mainly on a different ap- + formation about the rule contained in the examples is not proach, which has its origin in statistical physics rather than + complete. Hence, the performance of a network may vary in mathematical statistics, and compare its results with the + from one training set to another. In order to treat the gen- worst-case results. This method aims at studying the typical + eralization ability in a quantitative way, a common model rather than the worst-case behavior and often enables the + assumes that all input patterns, those from the training set exact calculations of the entire learning curve for models of + and the new one on which the network is tested, have a pre- simple networks which have many parameters. Since both + assigned probability distribution (which characterizes the biological and artificial neural networks are composed of + feature that must be classified), and they are produced in- many elements, it is hoped that such an approach may ac- + dependently at random with the same probability distribu- tually reveal some relevant and interesting structures. + tion from the network’s environment. Sometimes the prob- At first, it may seem surprising that a problem should + ability distribution used to extract the examples and the simplifywhenthenumberofitsconstituentsbecomeslarge. + classification of these examples is called the rule.The net- However, this phenomenon is well-known for macroscopic + work’s performance on novel data can now be quantified by physical systems such as gases or liquids which consist of + the so-called generalization error,which is the probability a huge number of molecules. Clearly, it is not possible to + of misclassifying the test input and can be measured by re- study the complete microscopic state of such a system, + peating the same learning experiment many times with dif- which is described by the rapidly fluctuating positions and + ferent data. velocities of all particles. On the other hand, macroscopic + Within such a probabilistic framework, neural networks quantities such as density, temperature, and pressure are + areoftenviewedasstatisticaladaptivemodelswhichshould usually collective properties influenced by all elements. For + give a likely explanation of the observed data. In this frame- such quantities, fluctuations are averaged out in the ther- + work, the learning process becomes mathematically related modynamic limit of a large number of particles and the col- + to a statistical estimation problem for optimal network pa- lective properties become, to some extent, independent of + rameters.Hence,mathematicalstatisticsseemstobeamost themicrostate.Similarly,thegeneralizationabilityofaneu- + appropriate candidate for studying a neural network’s be- ral network is a collective property of all the network pa- + havior. In fact, various statistical approaches have been ap- rameters, and the techniques of statistical physics allow, at + plied to quantify the generalization performance. For ex- least for some simple but nontrivial models, for exact com- + ample, expressions for the generalization error have been putations in the thermodynamic limit. Before explaining + obtainedinthelimit,wherethenumberofexamplesislarge these ideas in detail, I provide a short description of feed- + compared to the number of couplings (Seung et al.,1992; forward neural networks. + Amari and Murata, 1993). In such a case, one can expect ................................................that learning is almost exhaustive, such that the statistical ◗ + + fluctuations of the parameters around their optimal values Artificial Neural Networks + are small. However, in practice the number of parameters is + often large so that the network can be flexible, and it is not Based on highly idealized models of brain function, artifi- + clear how many examples are needed for the asymptotic cial neural networks are built from simple elementary com- + theorytobecomevalid.Theasymptotictheorymayactually puting units, which are sometimes termed neurons after + miss interesting behavior of the so-called learning curve, their biological counterparts. Although hardware imple- + which displays the progress of generalization ability with mentations have become an important research topic, neu- + an increasing amount of training data. ral nets are still simulated mostly on standard computers. + A second important approach, which was introduced Each computing unit of a neural net has a single output and + into mathematical statistics in the 1970s by Vapnik and several ingoing connections which receive the outputs of + Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact other units. To every ingoing connection (labeled by the + bounds for the generalization error which are valid for any index i) a real number is assigned, the synaptic weight w,i + number of training examples. Moreover, they are entirely which is the basic adjustable parameter of the network. To + independent of the underlying distribution of inputs, and compute a unit’s output, all incoming values x are multi- i + + + + + 764 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 765 + + + + + + + + LEARNING TO GENERALIZE + + + 0.6 −0.9 0.8 + inputs + + 1.6 −1.4 −0.1 synaptic weights + + weighted sum + 1.6 × 0.6 + (–1.4) × (–0.9) + (–0.1) × 0.8 = 2.14 + + + + 1 + + + 0 + + + + −1 + 2.14 aboutput + FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numeri- + cal values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs + reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which + the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and + step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information. + + + plied by the weights w and then added. Figure 1a shows its simple structure, it can for many learning problems give i + an example of such a computation with three couplings. a nontrivial generalization performance and may be used + Finally, the result, wx,is passed through an activation as a first step to an unknown classification task. As can be i i i + function which is typically of the shape of the red curve in seen by comparing Figs. 2a and 1b, it is also a building + Fig. 1a (a sigmoidal function), which allows for a soft, am- block for the more complex multilayer networks. Hence, + biguous classification between 1 and 1. Other impor- understanding its performance theoretically may also pro- + tant cases are the step function (green curve) and the linear vide insight into the more complex machines. To learn a set + function (yellow curve; used in the output neuron for prob- of examples, a network must adjust its couplings appropri- + lems of fitting continuous functions). In the following, to ately (I often use the word couplings for their numerical + keep matters simple, I restrict the discussion mainly to the strengths, the weights w, for i1,..., N). Remarkably, i + step function. Such simple units can develop a remarkable for the perceptron there exists a simple learning algorithm + computational power when connected in a suitable archi- which always enables the network to find those parameter + tecture. An important network type is the feedforward ar- values whenever the examples can be learnt by a percep- + chitecture shown in Fig. 1b, which has two layers of comput- tron. In Rosenblatt’s algorithm, the input patterns are pre- + ing units and adjustable couplings. The input nodes (which sented sequentially (e.g., in cycles) to the network and the + do not compute) are coupled to the so-called hidden units, + whichfeedtheiroutputsintooneormoreoutputunits.With + suchanarchitectureandsigmoidalactivationfunctions,any + continuous function of the inputs can be arbitrarily closely xx 21 x2 x3 xn + approximated when the number of hidden units is suffi- + ciently large. (w1 ,w 2 ) + w ................................................ 1 w2 w3 wn ◗ + + The Perceptron x1 + + + The simplest type of network is the perceptron (Fig. 2a). + There are Ninputs, Nsynaptic couplings w, and the output i + is simply a b + N FIGURE 2 (a) The perceptron. (b) Classification of inputs + awx [1] i i by a perceptron with two inputs. The arrow indicates the vec- + i1 tor composed of the weights of the network, and the line per- + It has a single-layer architecture and the step function pendicular to this vector is the boundary between the classes + (green curve in Fig. 1a) as its activation function. Despite of input. + + + + + + PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 765 262-A1677 7/24/01 11:12 AM Page 766 + + + + + + + + MANFRED OPPER + + + output is tested. Whenever a pattern is not classified cor- + rectly, all couplings are altered simultaneously. We increase x2 + by a fixed amount all weights for which the input unit and + the correct value of the output neuron have the same sign + but we decrease them for the opposite sign. This simple + algorithm is reminiscent of the so-called Hebbian learning + rule,a physiological model of a learning processes in the + real brain. It assumes that synaptic weights are increased + when two neurons are simultaneously active. Rosenblatt’s + theorem states that in cases in which there exists a choice of + the w which classify correctly all of the examples (i.e., per- i + fectly learnable perceptron), this algorithm finds a solution + in a finite number of steps, which is at worst equal to A N 3 , + where Ais an appropriate constant. + It is often useful to obtain an intuition of a perceptron’s xa 1 + classification performance by thinking in terms of a geo- + metric picture. We may view the numerical values of the in- + puts as the coordinates of a point in some (usually) high- + dimensional space. The case of two dimensions is shown + in Fig. 2b. A corresponding point is also constructed for the + couplings w.The arrow which points from the origin of the i + coordinate system to this latter point is called the weight + vector or coupling vector. An application of linear algebra + tothecomputationofthenetworkshowsthatthelinewhich + is perpendicular to the coupling vector is the boundary be- + tween inputs belonging to the two different classes. Input + points which are on the same side as the coupling vector are + classified as 1 (the green region in Fig. 2b) and those on + the other side as 1 (red region in Fig. 2b). + Rosenblatt’s algorithm aims to determine such a line + when it is possible. This picture generalizes to higher di- direction of coupling vectorb + mensions, for which a hyperplane plays the same role of the FIGURE 3 (a) Projection of 200 random points (with ran- + line of the previous two-dimensional example. We can still dom labels) from a 200-dimensional space onto the first two + obtainanintuitivepicturebyprojectingontwo-dimensional coordinate axes (x and x). (b) Projection of the same points 1 2 + planes. In Fig. 3a, 200 input patterns with random coordi- onto a plane which contains the coupling vector of a perfectly + nates (randomly labeled red and blue) in a 200-dimensional trained perceptron. + input space are projected on the plane spanned by two arbi- + trary coordinate axes. If we instead use a plane for projec- + tion which contains the coupling vector (determined from tions for small changes of the couplings). Hence, in general, + a variant of Rosenblatt’s algorithm) we obtain the view in addition to the perfectly learnable perceptron case in + shown in Fig. 3b, in which red and green points are clearly which the final error is zero, minimizing the training error + separated and there is even a gap between the two clouds. is usually a difficult task which could take a large amount of + It is evident that there are cases in which the two sets of computer time. However, in practice, iterative approaches, + points are too mixed and there is no line in two dimensions which are based on the minimization of other smooth cost + (or no hyperplane in higher dimensions which separates functions,areusedtotrainaneuralnetwork(Bishop,1995). + them). In these cases, the rule is too complex to be per- ................................................fectly learned by a perceptron. If this happens, we must at- ◗ + + tempt to determine the choice of the coupling which mini- Capacity, VC Dimension, + mizesthenumberoferrorsonagivensetofexamples.Here, and Worst-Case Generalization + Rosenblatt’s algorithm does not work and the problem of + finding the minimum is much more difficult from the algo- As previously shown, perceptrons are only able to realize a + rithmic point. The training error, which is the number of very restricted type of classification rules, the so-called lin- + errorsmadeonthetrainingset,isusuallyanonsmoothfunc- early separable ones. Hence, independently from the issue + tion of the network couplings (i.e., it may have large varia- of finding the best algorithm to learn the rule, one may ask + + + + + + 766 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 767 + + + + + + + + LEARNING TO GENERALIZE + + + the following question: In how many cases will the percep- exp[Nf(m/N)], where the function f(a) vanishes for + tron be able to learn a given set of training examples per- a2 and it is positive for a2. Such a threshold phe- + fectly if the output labels are chosen arbitrarily? In order to nomenon is an example of a phase transition (i.e., a sharp + answer this question in a quantitative way, it is convenient change of behavior) which can occur in the thermodynamic + tointroducesomeconceptssuchascapacity,VCdimension, limit of a large network size. + andworst-casegeneralization,whichcanbeusedinthecase Generally, the point at which such a transition takesof the perceptron and have a more general meaning. place defines the so-called capacity of the neural network.In the case of perceptrons, this question was answered in Although the capacity measures the ability of a network tothe 1960s by Cover (1965). He calculated for any set of in- learn random mappings of the inputs, it is also related to itsput patterns, e.g., m,the fraction of all the 2 m possible map- ability to learn a rule (i.e., to generalize from examples).pings that can be linearly separated and are thus learnable The question now is, how does the network perform on aby perceptrons. This fraction is shown in Fig. 4 as a func- new example after having been trained to learn mexampletion of the number of examples per coupling for different on the training set?numbers of input nodes (couplings) N.Three regions can To obtain an intuitive idea of the connection betweenbe distinguished: capacity and ability to generalize, we assume a training set + Region in which m/N1: Simple linear algebra shows of size mand a single pattern for test. Suppose we define + that it is always possible to learn all mappings when the a possible rule by an arbitrary learnable mapping from + number mof input patterns is less than or equal to the inputs to outputs. If m1 is much larger than the capac- + number Nof couplings (there are simply enough adjustable ity, then for most rules the labels on the mtraining pat- + parameters). terns which the perceptron is able to recognize will nearly + Region in which m/N1: For this region, there are ex- uniquely determine the couplings (and consequently the + amples of rules that cannot be learned. However, when the answer of the learning algorithm on the test pattern), and + number of examples is less than twice the number of cou- therulecanbeperfectlyunderstoodfromtheexamples.Be- + plings (m/N2), if the network is large enough almost all low capacity, in most cases there are two different choices + mappings can be learned. If the output labels for each of of couplings which give opposite answers for the test pat- + the minputs are chosen randomly 1 or 1 with equal tern. Hence, a correct classification will occur with proba- + probability, the probability of finding a nonrealizable cou- bility 0.5 assuming all rules to be equally probable. Figure 5 + pling goes to zero exponentially when Ngoes to infinity at displays the two types of situations form3andN2. + fixed ratio m/N. This intuitive connection can be sharpened. Vapnik and + Region in which m/N2: For m/N2 the probabil- Chervonenkis established a relation between a capacity + ity for a mapping to be realizable by perceptrons decreases such as quantity and the generalization ability that is valid + to zero rapidly and it goes to zero exponentially when N for general classifiers (Vapnik, 1982, 1995). The VC dimen- + goes to infinity at fixed ratio m/N(it is proportional to sion is defined as the size of the largest set of inputs for + which all mappings can be learned by the type of classi- + fier. It equals Nfor the perceptron. Vapnik and Chervo- + 1.0 nenkis were able to show that for any training set of size m + + + + + + + + + + + + + + + fraction of realizable mappings 0.8 + + + 0.6 + + + 0.4 ? ? + + + 0.2 + + + 0.0 a b + 01234 FIGURE 5 Classification rules for four patterns based on a m/N perceptron. The patterns colored in red represent the training + FIGURE 4 Fraction of all mappings of minput patterns examples, and triangles and circles represent different class la- + which are learnable by perceptrons as a function of m/Nfor bels. The question mark is a test pattern. (a) There are two + different numbers of couplings N: N10 (in green), N20 possible ways of classifying the test point consistent with the + (in blue), and N100 (in red). examples; (b) only one classification is possible. + + + + + + PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 767 262-A1677 7/24/01 11:12 AM Page 768 + + + + + + + + MANFRED OPPER + + + larger than the VC dimension D , the growth of the num- blue curve in Fig. 6, the minimal training error will decrease VC + ber of realizable mappings is bounded by an expression for increasing complexity of the nets. On the other hand, + which grows much slower than 2 m (in fact, only like a poly- the VC dimension and the complexity of the networks in- + nomial in m). crease with the increasing number of hidden units, leading + They proved that a large difference between training er- to an increasing expected difference (confidence interval) + ror (i.e., the minimum percentage of errors that is done on between training error and generalization error as indi- + the training set) and generalization error (i.e., the proba- cated by the red curve. The sum of both (green curve) will + bility of producing an error on the test pattern after having have a minimum, giving the smallest bound on the general- + learned the examples) of classifiers is highly improbable if ization error. As discussed later, this procedure will in some + the number of examples is well above D . This theorem cases lead to not very realistic estimates by the rather pes- VC + implies a small expected generalization error for perfect simistic bounds of the theory. In other words, the rigorous + learning of the training set results. The expected general- bounds, which are obtained from an arbitrary network and + ization error is bounded by a quantity which increases pro- rule, are much larger than those determined from the re- + portionally to D and decreases (neglecting logarithmic sults for most of the networks and rules. VC + corrections in m) inversely proportional to m. ................................................Conversely, one can construct a worst-case distribution ◗ + + of input patterns, for which a size of the training set larger Typical Scenario: The Approach + than D is also necessary for good generalization. The VC of Statistical Physics VC + results should, in practice, enable us to select the network + with the proper complexity which guarantees the smallest When the number of examples is comparable to the size of + bound on the generalization error. For example, in order the network, which for a perceptron equals the VC dimen- + tofind the proper size of the hidden layer of a network with sion, the VC theory states that one can construct malicious + twolayers,onecouldtrainnetworksofdifferentsizesonthe situations which prevent generalizations. However, in gen- + same data. eral, we would not expect that the world acts as an adver- + The relation among these concepts can be better under- sary. Therefore, how should one model a typical situation? + stood if we consider a family of networks of increasing com- As a first step, one may construct rules and pattern dis- + plexity which have to learn the same rule. A qualitative pic- tributions which act together in a nonadversarial way. The + ture of the results is shown in Fig. 6. As indicated by the teacher–student paradigm has proven to be useful in such a + situation. Here, the rule to be learned is modeled by a sec- + ondnetwork,theteachernetwork;inthiscase,iftheteacher + and the student have the same architecture and the same + upper bound on numberofunits,theruleisevidentlyrealizable.Thecorrect generalization error class labels for any inputs are given by the outputs of the + teacher. Within this framework, it is often possible to ob- + tain simple expressions for the generalization error. For a + upper bound on perceptron, we can use the geometric picture to visualize confidence interval the generalization error. A misclassification of a new in- + put vector by a student perceptron with coupling vector ST + occurs only if the input pattern is between the separating + planes (dashed region in Fig. 7) defined by ST and the vec- + tor of teacher couplings TE. If the inputs are drawn ran- training error domlyfromauniformdistribution,thegeneralizationerror + is directly proportional to the angle between ST and TE. + network complexity Hence, the generalization error is small when teacher and + student vectors are close together and decreases to zero + when both coincide. + In the limit, when the number of examples is very large + all the students which learn the training examples perfectly + will not differ very much from and their couplings will be FIGURE 6 As the complexity of the network varies (i.e., close to those of the teacher. Such cases with a small gen- of the number of hidden units, as shown schematically below), + the generalization error (in red), calculated from the sum of eralization error have been successfully treated by asymp- + the training error (in green) and the confidence interval (in totic methods of statistics. On the other hand, when the + blue) according to the theory of Vapnik–Chervonenkis, shows number of examples is relatively small, there are many dif- + a minimum; this corresponds to the network with the best gen- ferent students which are consistent with the teacher re- + eralization ability. garding the training examples, and the uncertainty about + + + + 768 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 769 + + + + + + + + LEARNING TO GENERALIZE + + + with the number of couplings N(like typical volumes in + N-dimensional spaces) and Bdecreases exponentially with + m(because it becomes more improbable to be correct ST mtimes for any e0), both factors can balance each other + when mincreases like maN.ais an effective measure for TE the size of the training set when Ngoes to infinity. In order + to have quantities which remain finite as NSq, it is also + useful to take the logarithm of V(e) and divide by N, which + transforms the product into a sum of two terms. The first + one (which is often called the entropic term) increases with + increasing generalization error (green curve in Fig. 8). This + FIGURE 7 For a uniform distribution of patterns, the gen- is true because there are many networks which are not + eralization error of a perceptron equals the area of the similar to the teacher, but there is only one network equal + shaded region divided by the area of the entire circle. ST and to the teacher. For almost all networks (remember, the + TE represent the coupling vectors of the student and teacher, entropic term does not include the effect of the training ex- + respectively. amples) e0.5, i.e., they are correct half of the time by + random guessing. On the other hand, the second term (red + curve in Fig. 8) decreases with increasing generalization er- + the true couplings of the teacher is large. Possible general- ror because the probability of being correct on an input + ization errors may range from zero (if, by chance, a learn- pattern increases when the student network becomes more + ing algorithm converges to the teacher) to some worst-case similar to the teacher. It is often called the energetic contri- + value. We may say that the constraint which specifies the butionbecauseitfavorshighlyordered(towardtheteacher) + macrostateofthenetwork(itstrainingerror)doesnotspec- network states, reminiscent of the states of physical systems + ify the microstate uniquely. Nevertheless, it makes sense to at low energies. Hence, there will be a maximum (Fig. 8, ar- + speak of a typical value for the generalization error, which row) of V(e) at some value of ewhich by definition is the + is defined as the value which is realized by the majority of typical generalization error. + the students. In the thermodynamic limit known from sta- The development of the learning process as the number + tistical physics, in which the number of parameters of the of examples aNincreases can be understood as a compe- + network is taken to be large, we expect that in fact almost tition between the entropic term, which favors disordered + all students belong to this majority, provided the quantity network configurations that are not similar to the teacher, + of interest is a cooperative effect of all components of the andtheenergeticterm.Thelattertermdominateswhenthe + system. As the geometric visualization for the generaliza- number of examples is large. It will later be shown that such + tion error of the perceptron shows, this is actually the case. a competition can lead to a rich and interesting behavior as + The following approach, which was pioneered by Elizabeth the number of examples is varied. The result for the learn- + Gardner (Gardner, 1988; Gardner and Derrida, 1989), is ing curve (Györgyi and Tishby, 1990; Sompolinsky et al., + based on the calculation of V(e), the volume of the space + of couplings which both perfectly implement mtraining + examples and have a given generalization error e. For an + intuitive picture, consider that only discrete values for the entropic contribution + couplings are allowed; then V(e) would be proportional to + the number of students. The typical value of the general- + ization error is the value of e, which maximizes V(e). It + should be kept in mind that V(e) is a random number and energetic contribution + fluctuates from one training set to another. A correct treat- 1/N logfV(ε)g + ment of this randomness requires involved mathematical + techniques (Mézard et al.,1987). To obtain a picture which + is quite often qualitatively correct, we may replace it by its + average over many realizations of training sets. From ele- + mentary probability theory we see that this average num- maximum + ber can be found by calculating the volume Aof the space 0 0.1 0.2 0.3 0.4 0.5 + of all students with generalization error e, irrespective of ε + their behavior on the training set, and multiplying it by FIGURE 8 Logarithm of the average volume of students that + the probability Bthat a student with generalization error e havelearnedmexamplesandgiveegeneralizationerror(green + gives mtimes the correct answers on independent draw- curve). The blue and red curves represent the energetic and + ings of the input patterns. Since Aincreases exponentially entropic contributions, respectively. + + + + PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 769 262-A1677 7/24/01 11:12 AM Page 770 + + + + + + + + MANFRED OPPER + + + 0.5 student is free to ask the teacher questions, i.e., if the stu- + ε dent can choose highly informative input patterns. For the + simple perceptron a fruitful query strategy is to select a new 0.4 input vector which is perpendicular to the current coupling + vector of the student (Kinzel and Ruján, 1990). Such an + 0.3 input is a highly ambiguous pattern because small changes + continuous couplings in the student couplings produce different classification an- + swers. For more complicated networks it may be difficult 0.2 to obtain similar ambiguous inputs by an explicit construc- + tion. A general algorithm has been proposed (Seung et al., + 0.1 1992a) which uses the principle of maximal disagreement discrete couplings in a committee of several students as a selection process for + training patterns. Using an appropriate randomized train- 0.00.0 0.1 0.2 0.3 0.4 0.5 0. 6 ingstrategy,differentstudentsaregeneratedwhichalllearn α the same set of examples. Next, any new input vector is only + FIGURE 9 Learning curves for typical student perceptrons. accepted for training when the disagreement of its classi- + am/Nis the ratio between the number of examples and the fication between the students is maximal. For a committee + coupling number. of two students it can be shown that when the number of + examples is large, the information gain does not decrease + but reaches a positive constant. This results in a much faster + 1990) of a perceptron obtained by the statistical physics ap- decrease of the generalization error. Instead of being in- + proach (treating the random sampling the proper way) is versely proportional to the number of examples, the de- + shown by the red curve of Fig. 9. In contrast to the worst- crease is now exponentially fast. + casepredictionsoftheVCtheory,itispossibletohavesome ................................................generalization ability below VC dimension or capacity. As ◗ + + we might have expected, the generalization error decreases Bad Students and Good Students + monotonically, showing that the more that is learned, the + more that is understood. Asymptotically, the error is pro- Although the typical student perceptron has a smooth, + portional to Nand inversely proportional to m, in agree- monotonically decreasing learning curve, the possibility + ment with the VC predictions. This may not be true for that some concrete learning algorithm may result in a set + more complicated networks. of student couplings which are untypical in the sense of + our theory cannot be ruled out. For bad students, even non-................................................ ◗ monotic generalization behavior is possible. The problem + Query Learning of a concrete learning algorithm can be made to fit into the + statistical physics framework if the algorithm minimizes a + Soon after Gardner’s pioneering work, it was realized that certain cost function. Treating the achieved values of the + the approach of statistical physics is closely related to ideas new cost function as a macroscopic constraint, the tools of + in information theory and Bayesian statistics (Levin et al., statistical physics apply again. + 1989;GyörgyiandTishby,1990;OpperandHaussler,1991), As an example, it is convenient to consider a case in + for which the reduction of an initial uncertainty about the which the teacher and the student have a different archi- + true state of a system (teacher) by observing data is a cen- tecture: In one of the simplest examples one tries to learn + tral topic of interest. The logarithm of the volume of rele- a classification problem by interpreting it as a regression + vant microstates as defined in the previous section is a di- problem, i.e., a problem of fitting a continuous function + rect measure for such uncertainty. The moderate progress through data points. To be specific, we study the situation + in generalization ability displayed by the red learning curve in which the teacher network is still given by a percep- + of Fig. 9 can be understood by the fact that as learning pro- tron which computes binary valued outputs of the form + gresses less information about the teacher is gained from a ywx, 1, but as the student we choose a network i i i + newrandomexample.Here,theinformationgainisdefined with a linear transfer function (the yellow curve in Fig. 1a) + as the reduction of the uncertainty when a new example is + learned. The decrease in information gain is due to the in- Y awxi i + crease in the generalization performance. This is plausible i + because inputs for which the majority of student networks and try to fit this linear expression to the binary labels of + give the correct answer are less informative than those for the teacher. If the number of couplings is sufficiently large + which a mistake is more likely. The situation changes if the (larger than the number of examples) the linear function + + + + + 770 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 771 + + + + + + + + LEARNING TO GENERALIZE + + + (unlike the sign) is perfectly able to fit arbitrary continuous the student learns all examples perfectly. Although it may + output values. This linear fit is an attempt to explain the not be easy to construct a learning algorithm which per- + data in a more complicated way than necessary, and the forms such a maximization in practice, the resulting gener- + couplings have to be finely tuned in order to achieve this alization error can be calculated using the statistical phys- + goal. We find that the student trained in such a way does ics approach (Engel and Van den Broeck, 1993). The result + not generalize well (Opper and Kinzel, 1995). In order to is in agreement with the VC theory: There is no prediction + compare the classifications of teacher and student on a new better than random guessing below the capacity. + random input after training, we have finally converted the Although the previous algorithms led to a behavior + student’s output into a classification label by taking the sign whichisworsethanthetypicalone,wenowexaminetheop- + of its output. As shown in the red curve of Fig. 10, after positecaseofanalgorithmwhichdoesbetter.Sincethegen- + an initial improvement of performance the generalization eralization ability of a neural network is related to the fact + error increases again to the random guessing value e0.5 that similar input vectors are mapped onto the same out- + at a1 (Fig. 10, red curve). This phenomenon is called put, one can assume that such a property can be enhanced + overfitting.For a1 (i.e., for more data than parameters), if the separating gap between the two classes is maximized, + it is no longer possible to have a perfect linear fit through which defines a new cost function for an algorithm. This + the data, but a fit with a minimal deviation from a linear optimal margin perceptron can be practically realized and + function leads to the second part of the learning curve.ede- when applied to a set of data leads to the projection of + creases again and approaches 0 asymptotically for aSq. Fig. 11. As a remarkable result, it can be seen that there is a + This shows that when enough data are available, the details relatively large fraction of patterns which are located at the + of the training algorithm are less important. gap. These points are called support vectors(SVs). In order + The dependence of the generalization performance on to understand their importance for the generalization abil- + the complexity of the assumed data model is well-known. If ity, we make the following gedankenexperimentand assume + function class is used that is too complex, data values can be that all the points which lie outside the gap (the nonsupport + perfectly fitted but the predicted function will be very sen- vectors) are eliminated from the training set of examples. + sitive to the variations of the data sample, leading to very From the two-dimensional projection of Fig. 11, we may + unreliable predictions on novel inputs. On the other hand, conjecture that by running the maximal margin algorithm + functions that are too simple make the best fit almost insen- on the remaining examples (the SVs) we cannot create a + sitive to the data, which prevents us from learning enough larger gap between the points. Hence, the algorithm will + from them. converge to the same separating hyperplane as before. This + It is also possible to calculate the worst-case generaliza- intuitive picture is actually correct. If the SVs of a training + tion ability of perceptron students learning from a percep- set were known beforehand (unfortunately, they are only + tron teacher. The largest generalization error is obtained identified after running the algorithm), the margin classi- + (Fig. 7) when the angle between the coupling vectors of fier would have to be trained only on the SVs. It would au- + teacher and student is maximized under the constraint that tomatically classify the rest of the training inputs correctly. + + + + + + 0.50 + ε + 0.40 + + + 0.30 linear student + + + 0.20 + margin classifier + + 0.10 + + + 0.000123456 α + FIGURE 10 Learning curves for a linear student and for a FIGURE 11 Learning with a margin classifier and m300 + margin classifier. am/N. examples in an N150-dimensional space. + + + + + PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 771 262-A1677 7/24/01 11:12 AM Page 772 + + + + + + + + MANFRED OPPER + + + Hence, if in an actual classification experiment the number ber of consistent students is small; nevertheless, the few re- + of SVs is small compared to the number of non-SVs, we maining ones must still differ in a finite fraction of bits from + may expect a good generalization ability. each other and from the teacher so that perfect generaliza- + The learning curve for a margin classifier (Opper and tion is still impossible. For aslightly above a only the cou- c + Kinzel, 1995) learning from a perceptron teacher (calcu- plings of the teacher survive. + lated by the statistical physics approach) is shown in Fig. 10 + (blue curve). The concept of a margin classifier has recently ................................................ + been generalized to the so-called support vector machines ◗ + + Learning with Errors (Vapnik, 1995), for which the inputs of a perceptron are re- + placed by suitable features which are cleverly chosen non- + linear functions of the original inputs. In this way, nonlin- The example of the Ising perceptron teaches us that it will + ear separable rules can be learned, providing an interesting not always be simple to obtain zero training error. More- + alternative to multilayer networks. over, an algorithm trying to achieve this goal may get stuck + in local minima. Hence, the idea of allowing errors explic- + itly in the learning procedure, by introducing an appropri-................................................ ◗ ate noise, can make sense. An early analysis of such a sto- + The Ising Perceptron chastic training procedure and its generalization ability for + the learning in so-called Boolean networks (with elemen- + The approach of statistical physics can develop a specific tary computing units different from the ones used in neural + predictivepowerinsituationsinwhichonewouldliketoun- networks) can be found in Carnevali and Patarnello (1987). + derstand novel network models or architectures for which A stochastic algorithm can be useful to escape local min- + currently no efficient learning algorithm is known. As the ima of the training error, enabling a better learning of the + simplest example, we consider a perceptron for which the training set. Surprisingly, such a method can also lead to + couplings w are constrained to binary values 1 and 1 bettergeneralizationabilitiesiftheclassificationruleisalso j + (Gardner and Derrida, 1989; Györgyi, 1990; Seung et al., corrupted by some degree of noise (Györgyi and Tishby, + 1992b). For this so-called Ising perceptron(named after 1990). A stochastic training algorithm can be realized by + Ernst Ising, who studied coupled binary-valued elements as the Monte Carlo metropolis method, which was invented + a model for a ferromagnet), perfect learning of examples is to generate the effects of temperature in simulations of + equivalent to a difficult combinatorial optimization prob- physical systems. Any changes of the network couplings + lem (integer linear programming), which in the worst case which lead to a decrease of the training error during learn- + is believed to require a learning time that increases expo- ing are allowed. However, with some probability that in- + nentially with the number of couplings N. creases with the temperature, an increase of the training + To obtain the learning curve for the typical student, we error is also accepted. Although in principle this algorithm + can proceed as before, replacing V(e) by the number of may visit all the network’s configurations, for a large sys- + student configurations that are consistent with the teacher tem, with an overwhelming probability, only states close to + which results in changing the entropic term appropriately. some fixed training error will actually appear. The method + When the examples are provided by a teacher network of of statistical physics applied to this situation shows that for + thesamebinarytype,onecanexpectthatthegeneralization sufficiently large temperatures (T) we often obtain a quali- + error will decrease monotonically to zero as a function of a. tatively correct picture if we repeat the approximate calcu- + The learning curve is shown as the blue curve in Fig. 9. For lation for the noise-free case and replace the relative num- + sufficiently small a, the discreteness of the couplings has al- ber of examples aby the effective number a/T.Hence, the + most no effect. However, in contrast to the continuous case, learning curves become essentially stretched and good gen- + perfect generalization does not require infinitely many ex- eralization ability is still possible at the price of an increase + amples but is achieved already at a finite number a 1.24. in necessary training examples. c + This is not surprising because the teacher’s couplings con- Within the stochastic framework, learning (with errors) + tain only a finite amount of information (one bit per cou- can now also be realized for the Ising perceptron, and it is + pling) and one would expect that it does not take much interesting to study the number of relevant student configu- + more than aboutNexamples to learn them. The remark- rations as a function of ein more detail (Fig. 12). The green + ableandunexpectedresultoftheanalysisisthefactthatthe curve is obtained for a small value ofawhere a strong maxi- + transition to perfect generalization is discontinuous. The mum with high generalization error exists. By increasing a, + generalization error decreases immediately from a non- this maximum decreases until it is the same as the second + zero value to zero. This gives an impression about the com- maximum at e0.5, indicating a transition like that of the + plex structure of the space of all consistent students and blue learning curve in Fig. 9. For larger a, the state of per- + also gives a hint as to why perfect learning in the Ising per- fect generalization should be the typical state. Neverthe- + ceptron is a difficult task. For aslightly below a, the num- less, if the stochastic algorithm starts with an initial state c + + + + 772 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 773 + + + + + + + + LEARNING TO GENERALIZE + + + + α lar model. Here, each hidden unit is connected to a dif- 1 ferent set of the input nodes. A further simplification is the + + + + + + + + + + log (number of students) α replacement of adaptive couplings from the hidden units to 2 + the output node by a prewired fixed function which maps + the states of the hidden units to the output. α3 Two such functions have been studied in great detail. + For the first one, the output gives just the majority vote of + α the hidden units—that is, if the majority of the hidden units 4 + α is negative, then the total output is negative, and vice versa. 4 >α3 >α2 >α1 This network is called a committee machine.For the second + 0 0.1 0.2 0.3 0.4 0.5 type of network, the parity machine,the output is the par- ε ity of the hidden outputs—that is, a minus results from an + FIGURE 12 Logarithm of the number of relevant Ising stu- odd number of negative hidden units and a plus from an + dents for different values of a. even number. For both types of networks, the capacity has + been calculated in the thermodynamic limit of a large num- + ber Nof (first layer) couplings (Barkai et al.,1990; Monas- + which has no resemblance to the (unknown) teacher (i.e., son and Zecchina, 1995). By increasing the number of hid- + with e0.5), it will spend time that increases exponentially den units (but always keeping it much smaller than N), + with Nin the smaller local maximum, the metastable state. the capacity per coupling (and the VC dimension) can be + Hence, a sudden transition to perfect generalization will be made arbitrarily large. Hence, the VC theory predicts that + observable only in examples which correspond to the blue the ability to generalize begins at a size of the training set + curve of Fig. 12, where this metastable state disappears. which increases with the capacity. The learning curves of + For large vales of a(yellow curve), the stochastic algorithm the typical parity machine (Fig. 14) being trained by a par- + will converge always to the state of perfect generalization. ity teacher for (from left to right) one, two, four, and six + On the other hand, since the state with e0.5 is always hidden units seem to partially support this prediction. + metastable, a stochastic algorithm which starts with the Belowacertainnumberofexamples,onlymemorization + teacher’s couplings will never drive the student out of the ofthelearnedpatternsoccursandnotgeneralization.Then, + state of perfect generalization. It should be made clear that a transition to nontrivial generalization takes place (Han- + the sharp phase transitions are the result of the thermody- sel et al.,1992; Opper, 1994). Far beyond the transition, the + namic limit, where the macroscopic state is entirely domi- decay of the learning curves becomes that of a simple per- + nated by the typical configurations. For simulations of any ceptron (black curve in Fig. 14) independent of the num- + finite system a rounding and softening of the transitions ber of hidden units, and this occurs much faster than for + will be observed. the bound given by VC theory. This shows that the typical + ................................................ learning curve can in fact be determined by more than one ◗ + + More Sophisticated Computations + Are Needed for Multilayer Networks 0.5 + ε + As a first step to understand the generalization perfor- 0.4 mance of multilayer networks, one can study an archi- 46 + tecture which is simpler than the fully connected one of + Fig. 1b. The tree architecture of Fig. 13 has become a popu- 0.3 2 + + 10.2 + + + 0.1 + + + 0.00.0 0.1 0.2 0.3 0.4 0.5 0.6 α + + FIGURE 14 Learning curves for the parity machine with + FIGURE 13 A two-layer network with tree architecture. tree architecture. Each curve represents the generalization er- + The arrow indicates the direction of propagation of the ror eas a function of aand is distinguished by the number of + information. hidden units of the network. + + + + PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 773 262-A1677 7/24/01 11:12 AM Page 774 + + + + + + + + MANFRED OPPER + + + complexity parameter. In contrast, the learning curve of the same similarity to every teacher perceptron. Although + the committee machine with the tree architecture of Fig. 13 this symmetric state allows for some degree of generaliza- + (Schwarze and Hertz, 1992) is smooth and resembles that tion, it is not able to recover the teacher’s rule completely. + of the simple perceptron. As the number of hidden units After a long plateau, the symmetry is broken and each of + is increased (keeping Nfixed and very large), the general- the student perceptrons specializes to one of the teacher + ization error increases, but despite the diverging VC di- perceptrons, and thus their similarity with the others is + mension the curves converge to a limiting one having an lost. This leads to a rapid (but continuous) decrease in the + asymptotic decay which is only twice as slow as that of the generalization error. Such types of learning curves with + perceptron. This is an example for which typical and worst- plateaus can actually be observed in applications of fully + case generalization behaviors are entirely different. connected multilayer networks. + Recently, more light has been shed on the relation be- ................................................tween average and worst-case scenarios of the tree com- ◗ + + mittee. A reduced worst-case scenario, in which a tree Outlook + committee teacher was to be learned from tree committee + students under an input distribution, has been analyzed The worst-case approach of the VC theory and the typical + from a statistical physics perspective (Urbanczik, 1996). As case approach of statistical physics are important theories + expected, few students show a much worse generalization for modeling and understanding the complexity of learning + ability than the typical one. Moreover, such students may to generalize from examples. Although the VC approach + also be difficult to find by most reasonable learning algo- plays an important role in a general theory of learnabil- + rithms because bad students require very fine tuning of ity, its practical applications for neural networks have been + their couplings. Calculation of the couplings with finite pre- limited by the overall generality of the approach. Since only + cision requires many bits per coupling that increases faster weak assumptions about probability distributions and ma- + than exponentially with aand which for sufficiently large a chines are considered by the theory, the estimates for gen- + willbebeyondthecapabilityofpracticalalgorithms.Hence, eralization errors have often been too pessimistic. Recent + it is expected that, in practice, a bad behavior will not be developments of the theory seem to overcome these prob- + observed. lems. By using modified VC dimensions, which depend on + Transitions of the generalization error such as those the data that have actually occurred and which in favorable + observed for the tree parity machine are a characteristic cases are much smaller than the general dimensions, more + feature of large systems which have a symmetry that can realistic results seem to be possible. For the support vec- + be spontaneously broken. To explain this, consider the sim- tor machines (Vapnik, 1995) (generalizations of the margin + plest case of two hidden units. The output of this parity ma- classifiers which allow for nonlinear boundaries that sepa- + chine does not change if we simultaneously change the sign rate the two classes), Vapnik and collaborators have shown + of all the couplings for both hidden units. Hence, if the the effectiveness of the modified VC results for selecting + teacher’s couplings are all equal to 1, a student with all the optimal type of model in practical applications. + couplings equal to 1 acts exactly as the same classifier. If The statistical physics approach, on the other hand, has + there are few examples in the training set, the entropic con- revealed new and unexpected behavior of simple network + tribution will dominate the typical behavior and the typi- models,suchasavarietyofphasetransitions.Whethersuch + cal students will display the same symmetry. Their coupling transitions play a cognitive role in animal or human brains + vectors will consist of positive and negative random num- is an exciting topic. Recent developments of the theory + bers. Hence, there is no preference for the teacher or the aim to understand dynamical problems of learning. For ex- + reversed one and generalization is not possible. If the num- ample, online learning (Saad, 1998), in which the problems + ber of examples is large enough, the symmetry is broken of learning and generalization are strongly mixed, has en- + and there are two possible types of typical students, one abled the study of complex multilayer networks and has + with more positive and the other one with more negative stimulated research on the development of optimized algo- + couplings. Hence, any of the typical students will show rithms. In addition to an extension of the approach to more + some similarity with the teacher (or it’s negative image) and complicated networks, an understanding of the robustness + generalization occurs. A similar type of symmetry break- of the typical behavior, and an interpolation to the other + ing also leads to a continuous phase transition in the fully extreme, the worst-case scenario is an important subject of + connected committee machine. This can be viewed as a research. + committee of perceptrons, one for each hidden unit, which + share the same input nodes. Any permutation of these per- Acknowledgments + ceptrons obviously leaves the output invariant. Again, if I thank members of the Department of Physics of Complex Sys- + few examples are learned, the typical state reflects the sym- tems at the Weizmann Institute in Rehovot, Israel, where parts of + metry. Each student perceptron will show approximately this article were written, for their warm hospitality. + + + + + + 774 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 775 + + + + + + + + LEARNING TO GENERALIZE + + + References Cited OPPER , M., and H AUSSLER , M. (1991). Generalization perfor- + mance of Bayes optimal classification algorithm for learning a + AMARI , S., and M URATA , N. (1993). Statistical theory of learning perceptron. Phys. Rev. Lett.66,2677. + curves under entropic loss. Neural Comput.5,140. OPPER , M., and K INZEL , W. (1995). Statistical mechanics of gen- + BARKAI , E., H ANSEL , D., and K ANTER , I. (1990). Statistical me- eralization. In Physics of Neural Networks III(J. L. van Hem- + chanics of a multilayered neural network. Phys. Rev. Lett.65, men, E. Domany, and K. Schulten, Eds.). Springer-Verlag, + 2312. New York. + BISHOP , C. M. (1995). Neural Networks for Pattern Recognition. SAAD , D. (Ed.) (1998). Online Learning in Neural Networks. + Clarendon/Oxford Univ. Press, Oxford/New York. Cambridge Univ. Press, New York. + CARNEVALI , P., and P ATARNELLO , S. (1987). Exhaustive thermo- SCHWARZE , H., and H ERTZ , J. (1992). Generalization in a large + dynamical analysis of Boolean learning networks. Europhys. committee machine. Europhys. Lett.20,375. + Lett.4,1199. SCHWARZE , H., and H ERTZ , J. (1993). Generalization in fully con- + COVER , T. M. (1965). Geometrical and statistical properties of nected committee machines. Europhys. Lett.21,785. + systems of linear inequalities with applications in pattern rec- SEUNG , H. S., S OMPOLINSKY , H., and T ISHBY , N. (1992a). Statis- + ognition. IEEE Trans. El. Comp.14,326. tical mechanics of learning from examples. Phys. Rev. A45, + ENGEL , A., and V AN DEN BROECK , C. (1993). Systems that can 6056. + learn from examples: Replica calculation of uniform conver- SEUNG , H. S., O PPER , M., and S OMPOLINSKY , H. (1992b). Query + gence bound for the perceptron. Phys. Rev. Lett.71,1772. by committee. InThe Proceedings of the Vth Annual Workshop + GARDNER ,E.(1988).Thespaceofinteractionsinneuralnetworks. on Computational Learning Theory (COLT92),p. 287. Associ- + J. Phys. A21,257. ation for Computing Machinery, New York. + GARDNER , E., and D ERRIDA , B. (1989). Optimal storage proper- SOMPOLINSKY , H., T ISHBY , N., and S EUNG , H. S. (1990). Learning + ties of neural network models. J. Phys. A21,271. from examples in large neural networks. Phys. Rev. Lett.65, + GYÖRGYI , G. (1990). First order transition to perfect generaliza- 1683. + tion in a neural network with binary synapses. Phys. Rev. A41, URBANCZIK , R. (1996). Learning in a large committee machine: + 7097. Worst case and average case. Europhys. Lett.35,553. + GYÖRGYI , G., and T ISHBY , N. (1990). Statistical theory of learn- VALLET , F., C AILTON , J., and R EFREGIER , P. (1989). Linear and + ing a rule. In Neural Networks and Spin Glasses: Proceedings nonlinear extension of the pseudo-inverse solution for learn- + of the STATPHYS 17 Workshop on Neural Networks and Spin ing Boolean functions. Europhys. Lett.9,315. + Glasses(W. K. Theumann and R. Koberle, Eds.). World Scien- VAPNIK , V. N. (1982). Estimation of Dependencies Based on Em- + tific, Singapore. pirical Data.Springer-Verlag, New York. + HANSEL , D., M ATO , G., and M EUNIER , C. (1992). Memorization VAPNIK , V. N. (1995). The Nature of Statistical Learning Theory. + without generalization in a multilayered neural network. Eu- Springer-Verlag, New York. + rophys. Lett.20,471. VAPNIK , V. N., and C HERVONENKIS , A. (1971). On the uniform + KINZEL , W., and R UJÀN , P. (1990). Improving a network general- convergence of relative frequencies of events to their probabil- + ization ability by selecting examples. Europhys. Lett.13,473. ities. Theory Probability Appl.16,254. + LEVIN ,E.,T ISHBY ,N.,andS OLLA ,S.(1989).Astatisticalapproach + to learning and generalization in neural networks. In Proceed- General References ings of the Second Workshop on Computational Learning The- + ory(R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan ARBIB , M. A. (Ed.) (1995). The Handbook of Brain Theory and + Kaufmann, San Mateo, CA. Neural Networks.MIT Press, Cambridge, MA. + MÉZARD , M., P ARISI , G., and V IRASORO , M. A. (1987). Spin glass BERGER , J. O. (1985). Statistical Decision Theory and Bayesian + theory and beyond. In Lecture Notes in Physics,Vol. 9. World Analysis.Springer-Verlag, New York. + Scientific, Singapore. HERTZ ,J.A.,K ROGH ,A.,andP ALMER , R. G. (1991).Introduction + MONASSON , R., and Z ECCHINA , R. (1995). Weight space structure to the Theory of Neural Computation.Addison-Wesley, Red- + andinternalrepresentations:Adirectapproachtolearningand wood City, CA. + generalization in multilayer neural networks. Phys. Rev. Lett. MINSKY , M., and P APERT , S. (1969). Perceptrons.MIT Press, + 75,2432. Cambridge, MA. + OPPER , M. (1994). Learning and generalization in a two-layer WATKIN , T. L. H., R AU , A., and B IEHL , M. (1993). The statistical + neural network: The role of the Vapnik–Chervonenkis dimen- mechanics of learning a rule. Rev. Modern Phys.65,499. + sion. Phys. Rev. Lett.72,2113. + + + + + + + + + + + + + + + + + + PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 775 262-A1677 7/24/01 11:12 AM Page 776 \ No newline at end of file diff --git a/Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt b/Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt new file mode 100644 index 0000000..2be843a Binary files /dev/null and b/Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt differ diff --git a/Corpus/MOGRIFIER LSTM.txt b/Corpus/MOGRIFIER LSTM.txt new file mode 100644 index 0000000..c75f02e Binary files /dev/null and b/Corpus/MOGRIFIER LSTM.txt differ diff --git a/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt b/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt new file mode 100644 index 0000000..5741d6c --- /dev/null +++ b/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt @@ -0,0 +1,1145 @@ + Deep learning for visual unDerstanDing: + part 2 + + + Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang + + + + + + + + Model Compression and Acceleration + + + for Deep Neural Networks + + + The principles, progress, and challenges + + + + + + + + + + + + + In recent years, deep neural networks (DNNs) have received + increased attention, have been applied to different applica- + tions, and achieved dramatic accuracy improvements in many + tasks. These works rely on deep networks with millions or even + billions of parameters, and the availability of graphics process- + ing units (GPUs) with very high computation capability plays + a key role in their success. For example, Krizhevsky et al. [1] + achieved breakthrough results in the 2012 ImageNet Challenge + using a network containing 60 million parameters with five + convolutional layers and three fully connected layers. Usu- + ally, it takes two to three days to train the whole model on the + ImagetNet data set with an NVIDIA K40 machine. In another + example, the top face-verification results from the Labeled + Faces in the Wild (LFW) data set were obtained with networks + containing hundreds of millions of parameters, using a mix + of convolutional, locally connected, and fully connected layers + [2], [3]. It is also very time-consuming to train such a model + to obtain a reasonable performance. In architectures that only + rely on fully connected layers, the number of parameters can + grow to billions [4]. + + + Introduction + As larger neural networks with more layers and nodes are + considered, reducing their storage and computational cost + becomes critical, especially for some real-time applications ©Istockphoto.com/zapp2photo + such as online learning and incremental learning. In addition, + recent years witnessed significant progress in virtual real- + ity, augmented reality, and smart wearable devices, creating + unprecedented opportunities for researchers to tackle fun- + damental challenges in deploying deep-learning systems to + portable devices with limited resources [e.g., memory, central + processing units (CPUs), energy, bandwidth]. Efficient deep- + learning methods can have a significant impact on distributed + systems, embedded devices, and field-programmable gate ar- + ray (FPGA) for artificial intelligence (AI). For example, the + residual network-50 (ResNet-50) [5], which has 50 convolu- + tional layers, needs more than 95 megabytes of memory for Digital Object Identifier 10.1109/MSP.2017.2765695 + Date of publication: 9 January 2018 storage, and numerous floating number multiplications for + + + 126 IEEE SIgnal ProcESSIng MagazInE | January 2018 | 1053-5888/18©2018IEEE calculating each image. After discarding As larger neural networks volutional layers only. Low-rank factoriza- + some redundant weights, the network still with more layers and tion and transferred/compact filters-based + works as usual but saved more than 75% of nodes are considered, approaches provide an end-to-end pipeline + parameters and 50% computational time. reducing their storage and can be easily implemented in a CPU/ + For devices like cell phones and FPGAs GPU environment, which is straightfor- + with only several megabyte resources, how and computational ward, while parameter pruning and sharing + to compact the models used on them is cost becomes critical, use different methods such as vector quan- + also important. especially for some real- tization, binary coding, and sparse con- + Achieving these goals calls for joint time applications such straints to perform the task. Usually, it will + solutions from many disciplines, including as online learning and take several steps to achieve the goal. + but not limited to machine learning, opti- incremental learning. Regarding training protocols, models + mization, computer architecture, data com- based on parameter pruning/sharing low- + pression, indexing, and hardware design. rank factorization can be extracted from + In this article, we review recent works on compressing and pretrained ones or trained from scratch, while the transferred/ + accelerating DNNs, which attracted much attention from the compact filter and KD models can only support training from + deep-learning community and has already achieved signifi- scratch. These methods are independently designed and com- + cant progress in past years. plement each other. For example, transferred layers and pa- + We classify these approaches into four categories: rameter pruning and sharing can be used together, and model + 1) Parameter pruning and sharing: The parameter pruning quantization and binarization can be used together with low- + and sharing-based methods explore the redundancy in the rank approximations to achieve further speedup. We will de- + model parameters and try to remove the redundant and scribe the details of each theme and their properties, strengths, + noncritical ones. and drawbacks in the following sections. + 2) Low-rank factorization: Low-rank factorization-based + techniques use matrix/tensor decomposition to estimate the Parameter pruning and sharing + informative parameters of the deep convolutional neural An early work that showed that network pruning is effective in + networks (CNNs). reducing the network complexity and addressed the overfitting + 3) Transferred/compact convolutional filters: The trans- problem is [6]. Since then, it has been widely studied to compress + ferred/compact convolutional filters-based approaches DNN models, trying to remove parameters that are not crucial to + design special structural convolutional filters to reduce the the model performance. These techniques can be further classi- + storage and computation complexity. fied into three categories: model quantization and binarization, + 4) Knowledge distillation (KD): The KD methods learn a dis- parameter sharing, and structural matrix. + tilled model and train a more compact neural network to + reproduce the output of a larger network. Quantization and binarization + In Table 1, we briefly summarize these four types of meth- Network quantization compresses the original network by + ods. Generally, the parameter pruning and sharing, low-rank reducing the number of bits required to represent each weight. + factorization, and KD approaches can be used in DNNs with Gong et al. [6] and Wu et al. [7] applied k-means scalar quanti- + fully connected layers and convolutional layers, achieving zation to the parameter values. Vanhoucke et al. [8] showed that + comparable performances. On the other hand, methods using 8-bit quantization of the parameters can result in significant + transferred/compact filters are designed for models with con- speedup with minimal loss of accuracy. The work in [9] used + + + + + Table 1. A summary of different approaches for network compression. + Theme Name Description Applications More Details + Parameter pruning and sharing Reducing redundant parameters that Convolutional layer and Robust to various settings, can achieve + are not sensitive to the performance fully connected layer good performance, can support both train- + ing from scratch and pretrained model + Low-rank factorization Using matrix/tensor decomposition to Convolutional layer and Standardized pipeline, easily implement- + estimate the informative parameters fully connected layer ed, can support both training from scratch + and pretrained model + Transferred/compact Designing special structural convolutional Only for convolutional layer Algorithms are dependent on applications, + convolutional filters filters to save parameters usually achieve good performance, only + support training from scratch + KD Training a compact neural network with Convolutional layer and Model performances are sensitive to + distilled knowledge of a large model fully connected layer applications and network structure, only + support training from scratch + + + IEEE SIgnal ProcESSIng MagazInE | January 2018 | 127 16-bit fixed-point representation in stochastic rounding-based er drawback of these binary nets is that existing binarization + CNN training, which significantly reduced memory usage and schemes are based on simple matrix approximations and ignore + float- point operations with little loss in classification accuracy. the effect of binarization on the accuracy loss. To address + The method proposed in [10] first pruned the unimportant con- this issue, the work in [17] proposed a proximal Newton algo- + nections and retrained the sparsely connected networks. Then it rithm with diagonal Hessian approximation that directly mini- + quantized the link weights using weight-sharing, and then applied mizes the loss with respect to the binary weights. The work in + Huffman coding to the quantized weights as [18] significantly reduced the time on float- + well as the codebook to further reduce the point multiplication in the training stage by + rate. As shown in Figure 1 , it starts by learn- Network pruning and stochastically binarizing weights and con- + ing the connectivity via normal network train- sharing has been used verting multiplications in the hidden state + ing, followed by pruning the small-weight both to reduce network computation to sign changes. + connections. Finally, the network is retrained complexity and to address to learn the final weights for the remaining the overfitting issue. Pruning and sharing + sparse connections. This work achieves the Network pruning and sharing has been used + state-of-the-art performance among all param- both to reduce network complexity and to + eter quantization-based methods. It was shown in [11] that Hes- address the overfitting issue. An early approach to pruning was + sian weight could be used to measure the importance of network biased weight decay [19]. The optimal brain damage [20] and + parameters and proposed to minimize Hessian-weighted quantiza- the optimal brain surgeon [21] methods reduced the number + tion errors on average for clustering network parameters. A novel of connections based on the Hessian of the loss function, and + quantization framework was introduced in [12], which reduced the their works suggested that such pruning gave higher accuracy + precision of network weights to ternary values. than magnitude-based pruning such as the weight decay meth- + In the extreme case of 1-bit representation of each weight, i.e., od. Those methods supported training from scratch. + binary weight neural networks, there are also many works that A recent trend in this direction is to prune redundant, non- + directly train CNNs with binary weights; for instance, Binary- informative weights in a pretrained CNN model. For example, + Connect [13], BinaryNet [14], and XNORNetworks [15]. The Srinivas and Babu [22] explored the redundancy among neurons + main idea is to directly learn binary weights or activations dur- and proposed a data-free pruning method to remove redundant + ing the model training. The systematic study in [16] showed that neurons. Han et al. [23] proposed to reduce the total number of + networks trained with backpropagation could be robust against parameters and operations in the entire network. Chen et al. [24] + (robust against or resilient to) specific weight distortions, includ- proposed a HashedNets model that used a low-cost hash function + ing binary weights. to group weights into hash buckets for parameter sharing. The + deep compression method in [10] removed the redundant connec- + Drawbacks tions and quantized the weights and then used Huffman coding + However, the accuracy of such binary nets is significantly low- to encode the quantized weights. In [25], a simple regularization + ered when dealing with large CNNs such as GoogleNet. Anoth- method based on soft weight-sharing was proposed, which + + + + + + + + Cluster the Weights + + Train ConnectivityOriginal Compressed + Network NetworkGenerate Codebook Encode Weights + + Prune Connections + Quantize the Weights + with Codebook Encode Index + Train Weights + + Retrain Codebook + + + + + Figure 1. The three-stage compression method proposed in [10]: pruning, quantization, and encoding. The input is the original model, and the output is + the compression model. + + + 128 IEEE SIgnal ProcESSIng MagazInE | January 2018 | included both quantization and pruning in one simple (re)train- Thus the memory cost becomes O()d instead of O()d2 . + ing procedure. It is worth noting that the aforementioned prun- This circulant structure also enables the use of fast Fou- + ing schemes typically produce connection pruning in CNNs. rier transform (FFT) to speed up the computation. Given a + There is also growing interest in training compact CNNs d-dimensional vector r, the 1-layer circulant neural network + with sparsity constraints. Those sparsity constraints are in (1) has time complexity of O()ddlog . + typically introduced in the optimization In [31], a novel adaptive fastfood trans- + problem as l0 or l1 -norm regularizers. CNNs are parameter-efficient form was introduced to reparameterize the + The work in [26] imposed group sparsity due to exploring the matrix-vector multiplication of fully con- + constraints on the convolutional filters to nected layers. The adaptive fastfood trans- + achieve structured brain damage, i.e., prun- translation invariant property form matrix RR! nd# was defined as + ing entries of the convolution kernels in a of the representations to + group-wise fashion. In [27], a group-sparse input image, which is the key RS= HGPHB. (2) + regularizer on neurons was introduced to the success of training during the training stage to learn compact very deep models without Here, SG,, and B are random diago- + CNNs with reduced filters. Wen et al. [28] nal matrices. P!{,01}dd# + severe overfitting. is a random + added a structured sparsity regularizer on permutation matrix and H denotes the + each layer to reduce trivial filters, chan- Walsh–Hadamard matrix. Reparameteriz- + nels, or even layers. In filter-level pruning, all of the afore- ing a fully connected layer with d inputs and n outputs using + mentioned works used l21, -norm regularizers. The work in [29] the adaptive fastfood transform reduces the storage and the + used l1 -norm to select and prune unimportant filters. computational costs from O()nd to O()n and from O()nd to + O()ndlog , respectively. + Drawbacks The work in [32] showed the effectiveness of the new notion + There are some potential issues of the pruning and sharing of parsimony in the theory of structured matrices. Their pro- + works. First, pruning with l1 or l2 regularization requires posed method can be extended to various other structured matrix + more iterations to converge. Furthermore, all pruning criteria classes, including block and multilevel Toeplitz-like [33] matrices + require manual setup of sensitivity for layers, which demands related to multidimensional convolution [34]. + fine-tuning of the parameters and could be cumbersome for + some applications. Drawbacks + One potential problem of this kind of approach is that the struc- + Designing the structural matrix tural constraint will cause loss in accuracy since the constraint + In architectures that contain only fully connected layers, the might bring bias to the model. On the other hand, how to find a + number of parameters can grow up to billions [4]. Thus, it is proper structural matrix is difficult. There is no theoretical way + critical to explore this redundancy of parameters in fully con- from which to derive it. + nected layers, which is often the bottleneck in terms of memory + consumption. These network layers use the nonlinear transforms Low-rank factorization and sparsity + f(,xM)(=v Mx), where v ()o is an element-wise nonlinear As convolution operations constitute the bulk of all computations + operator, x is the input vector, and M is the mn# matrix of in CNNs, simplifying the convolution layer would have a direct + parameters. When M is a large general dense matrix, the cost impact on the overall speedup. The convolution kernels in a typi- + of storing mn parameters and computing matrix-vector products cal CNN is a four-dimensional tensor. The key observation is that + in Om()n time. Thus, an intuitive way to prune parameters is to there might be a significant amount of redundancy in the tensor. + impose x as a parameterized structural matrix. An mn# matrix Ideas based on tensor decomposition seem to be a particularly + that can be described using much fewer parameters than mn is promising way to remove the redundancy. Regarding to the fully + called a structured matrix. Typically, the structure should not connected layer, it can be viewed as a two-dimensional (2-D) + only reduce the memory cost but also dramatically accelerate the matrix and the low-rankness can also help. + inference and training stage via fast matrix-vector multiplication Using low-rank filters to accelerate convolution has a long + and gradient computations. history. Typical examples include high-dimensional discrete + Following this direction, the work in [30] proposed a sim- cosine transform (DCT) and wavelet systems constructed + ple and efficient approach based on circulant projections, from one-dimensional (1-D) DCT transform and 1-D wave- + while maintaining competitive error rates. Given a vector lets, respectively, using tensor products. In the context of + r=(,rr 01 ,,frd-1 ), a circulant matrix RR! dd# is defined as dictionary learning, Rigamonti et al. [35] suggested learning + separable 1-D filters. In [36], a few low-rank approximation Rr0 rd 1 g r VS - 2 r1 W and clustering schemes for the convolutional kernels were + Sr1 r0 rd 1 r W proposed. They achieved 2# speedup for a single convolu- + Rr (circ ): S - 2 + ==r WS h 1 r0 j h W. (1) tional layer with 1% drop in classification accuracy. The + Srd-2 j jrd-1 W work in [37] suggested using different tensor decomposition Sr WTd-1 rd-2 g r1 r0 X schemes, reporting a 45.# speedup with 1% drop in accuracy + + + IEEE SIgnal ProcESSIng MagazInE | January 2018 | 129 case. For the scheme in [39], the decom- + position always exists and can achieve + better performance than general CP. + Table 2 lists a performance comparison + of both methods. The actual speedup + and compression rates are used to mea- + sure the performances. We can see that + the BN version can achieve slightly bet- + ter performance while the CP version + gives higher compression rates. Original Framework Low-Rank Note that the fully connected layers Factorization Framework can be viewed as a 2-D matrix and thus + (a) (b) the aforementioned methods can also + be applied there. There are several clas- + sical works on exploiting low-rankness Figure 2. A typical framework of the low-rank regularization method. (a) is theoriginal convolutional + layer, and (b) is the low-rank constraint convolutional layer with rank-K. in fully connected layers. For instance, + Misha et al. [40] reduced the number + of dynamic parameters in deep models + in text recognition. In both works, the approximation was using the low-rank method. Reference [41] explored a low-rank + done layer by layer. After one layer was approximated by matrix factorization of the final weight layer in a DNN for + the low-rank filters, the parameters of that layer were fixed, acoustic modeling. + and the layers above were fine-tuned based on a reconstruc- + tion error criterion. These are typical low-rank methods for Drawbacks + compressing 2-D convolutional layers, which is described in Low-rank approaches are straightforward for model compres- + Figure 2. In [38], canonical polyadic (CP) decomposition of sion and acceleration. The idea complements recent advances + the kernel tensors was proposed. Their work used nonlinear in deep learning such as dropout, rectified units, and maxout. + least squares to compute the CP decomposition, which was However, the implementation is not that easy since it involves + also based on the tensor decomposition idea. In [39], a new a decomposition operation, which is computationally expen- + algorithm for computing the low-rank tensor decomposition sive. Another issue is that current methods perform low-rank + and a new method for training low-rank constrained CNNs approximation layer by layer, and thus cannot perform global + from scratch were proposed. It used batch normalization (BN) parameter compression, which is important as different lay- + to transform the activations of the internal hidden units, and it ers hold different information. Finally, factorization requires + was shown to be an effective way to deal with the exploding extensive model retraining to achieve convergence when com- + or vanishing gradients. pared to the original model. + In principle, both the CP decomposition scheme and the + decomposition scheme in [39] (BN low-rank) can be used to Transferred/compact convolutional filters + train CNNs from scratch. For the CP decomposition, finding CNNs are parameter-efficient due to exploring the transla- + the best low-rank approximation is an ill-posed problem, and tion invariant property of the representations to input image, + the best rank-K approximation may not exist in the general which is the key to the success of training very deep models + without severe overfitting. Although a strong theory is cur- + rently missing, a large amount of empirical evidence sup- + ports the notion that both the translation invariant property Table 2. Comparisons between the low-rank models and their baselines + on ILSVRC-2012. and convolutional weight-sharing are important for good + predictive performance. The idea of using transferred con-Model TOP-5 Accuracy Speedup Compression Rate volutional filters to compress CNN models is motivated by + AlexNet 80.03% 1 1 recent works in [42], which introduced the equivariant group + BN low-rank 80.56% 1.09 4.94 theory. Let x be an input, U()$ be a network or layer, and + T()$ be the transform matrix. The concept of equivariance CP low-rank 79.66% 1.82 5 is defined as VGG-16 90.60% 1 1 + BN low-rank 90.47% 1.53 2.72 TTlUU ^^ xx hh = , (3) + CP low-rank 90.31% 2.05 2.75 + GoogleNet 92.21% 1 1 which says that transforming the input x by the transform + T()$ and then passing it through the network or layer U(·) BN low-rank 91.88% 1.08 2.79 should give the same result as first mapping x through the CP low-rank 91.79% 1.20 2.84 network and then transforming the representation. Note that, + + + 130 IEEE SIgnal ProcESSIng MagazInE | January 2018 | in [42], the transforms T()$ and Tl()$ are not necessarily where Tx(·,,y) denoted the translation of the first oper- + the same as they operate on different objects. According to and by (,xy) along its spatial dimensions, with proper zero + this theory, it is reasonable to apply the transform to layers padding at borders to maintain the shape. The proposed + or filters U()$ to compress the whole network models. From framework can be used to 1) improve the classification accu- + empirical observation, deep CNNs also benefit from using a racy as a regularized version of maxout networks and 2) + large set of convolutional filters by applying a certain trans- to achieve parameter efficiency by flexibly varying their + form T()$ to a small set of base filters since it acts as a regu- architectures to compress networks. + larizer for the model. Table 3 briefly compares the performance of different + Following this trend, there are many recent works proposed methods with transferred convolutional filters, using VGG- + to build a convolutional layer from a set of base filters [42]– Net (16 layers) as the baseline model. The results are report- + [45]. What they have in common is that the transform T()$ ed on the CIFAR-10 and CIFAR-100 data sets with top-five + lies in the family of functions that only operate in the spatial error rates. It is observed that they can achieve reduction in + domain of the convolutional filters. For parameters with little or no drop in clas- + example, the work in [44] found that the sification accuracy. + lower convolution layers of CNNs learned The basic idea of KD is to + redundant filters to extract both positive and distill knowledge from a Drawbacks + negative phase information of an input sig- large teacher model into There are several issues that need to be + nal, and defined T()$ to be the simple nega- a small one by learning addressed for approaches that apply transfer + tion function the class distributions information to convolutional filters. First, + output by the teacher these methods can achieve competitive per- + T^h WW x = -x . (4) formance for wide/flat architectures (like via softened softmax. VGGNet) but not narrow/special ones (like + Here, Wx is the basis convolutional filter GoogleNet and ResNet). Second, the trans- + and W-x is the filter consisting of the shifts whose activation is fer assumptions sometimes are too strong to guide the algo- + opposite to that of Wx and selected after max-pooling opera- rithm, making the results unstable on some data sets. + tion. By doing this, the work in [44] can easily achieve 2# com- Using a compact filter for convolution can directly reduce + pression rate on all the convolutional layers. It is also shown that the computation cost. The key idea is to replace the loose and + the negation transform acts as a strong regularizer to improve overparametric filters with compact blocks to improve the + the classification accuracy. The intuition is that the learning speed, which significantly accelerate CNNs on several bench- + algorithm with pair-wise positive-negative constraint can lead marks. Decomposing 33# convolution into two 11# con- + to useful convolutional filters instead of redundant ones. volutions was used in [47], which achieved state-of-the-art + In [45], it was observed that magnitudes of the responses acceleration performance on object recognition. SqueezeNet + from convolutional kernels had a wide diversity of pattern rep- [48] was proposed to replace 33# convolution with 11# + resentations in the network, and it was not proper to discard convolution, which created a compact neural network with + weaker signals with a single threshold. Thus, a multibias non- approximately 50 fewer parameters and comparable accuracy + linearity activation function was proposed to generate more when compared to AlexNet. + patterns in the feature space at low computational cost. The + transform T()$ was define as KD + To the best of our knowledge, exploiting knowledge transfer to + TlU^h xW=+ x d , (5) compress model was first proposed by Caruana et al. [49]. They + trained a compressed model with pseudo-data labeled by an + where d were the multibias factors. The work in [46] consid- ensemble of strong classifiers and reproduced the output of the + ered a combination of rotation by a multiple of 90° and hori- original larger network. However, their work is limited to shal- + zontal/vertical flipping with low models. The idea has been recently adopted in [50] as KD + to compress deep and wide networks into shallower ones, where + TlU^h xW= Ti , (6) + Table 3. Comparisons of different approaches based on transferred where WTi was the transformation matrix that rotated the orig- convolutional filters on CIFAR-10 and CIFAR-100. + inal filters with angle i !{90,,}180270. In [42], the transform Model CIFAR-100 CIFAR-10 Compression Rate was generalized to any angle learned from data, and i was + directly obtained from data. Both [46] and [42] can achieve VGG-16 34.26% 9.85% 1 + good classification performance. MBA [45] 33.66% 9.76% 2 + Reference [43] defined T()$ as the set of translation func- CRELU [44] 34.57% 9.92% 2 + tions applied to 2-D filters CIRC [42] 35.15% 10.23% 4 + T lU^^ xhh =Tx·,,y , (7) DCNN [43] 33.57% 9.65% 1.62 xy,,!" -kkf,, ,^ xy,( h !00,) + + + IEEE SIgnal ProcESSIng MagazInE | January 2018 | 131 the compressed model mimicked the function learned by the Other types of approaches + complex model. The basic idea of KD is to distill knowledge We first summarize the works utilizing attention-based + from a large teacher model into a small one by learning the methods. Note that attention-based systems [57] can reduce + class distributions output by the teacher via softened softmax. computations significantly by learning to selectively focus or + The work in [51] introduced a KD compression framework, “attend to” a few, task-relevant input regions. The work in [57] + which eased the training of deep networks by following a student- introduced the dynamic capacity network that combined two + teacher paradigm, in which the student was penalized according types of modules: the small subnetworks with low capacity, and + to a softened version of the teacher’s output. The framework the large ones with high capacity. The low-capacity subnetworks + compressed an ensemble of deep networks (teacher) into a stu- were active on the whole input to first find the task-relevant areas + dent network of similar depth. To do so, the student was trained in the input, and then the attention mechanism was used to di- + to predict the output of the teacher, as well as the true classifica- rect the high-capacity subnetworks to focus on the task-relevant + tion labels. Despite its simplicity, KD demonstrates promising regions in the input. By doing this, the size of the CNN model + results in various image classification tasks. The work in [52] could be significantly reduced. + aimed to address the network compression Following this direction, the work in + problem by taking advantage of depth neural The standard criteria [58] introduced the conditional computation + networks. It proposed an approach to train to measure the quality idea, which only computes the gradient for + thin and deep networks, called FitNets, to of model compression some important neurons. It proposed a new + compress wide and shallower (but still deep) and acceleration are the type of general-purpose neural network com- + networks. The method was rooted in KD and ponent: a sparsely gated mixture-of-experts + extended the idea to allow for thinner and compression and the (MoE) layer. The MoE consisted of a number + deeper student models. To learn from the speedup rates. of experts, each a simple feed-forward neural + intermediate representations of the teacher network, and a trainable gating network that + network, FitNet made the student mimic the full feature maps of selected a sparse combination of the experts to process each input. + the teacher. However, such assumptions are too strict since the In [59], dynamic DNNs (D2NNs) were introduced, which were a + capacities of teacher and student may differ greatly. In certain type of feed-forward DNN that selected and executed a subset of + circumstances, FitNet may adversely affect the performance and D2NN neurons based on the input. + convergence. All the aforementioned methods are validated on There have been other attempts to reduce the number of + the MNIST, CIFAR-10, CIFAR-100, SVHN, and AFLW bench- parameters of neural networks by replacing the fully con- + mark data sets, and simulation results show that these methods nected layer with global average pooling [43], [60]. Network + match or outperform the teacher’s performance, while requiring architectures, such as GoogleNet or network in network, + notably fewer parameters and multiplications. can achieve state-of-the-art results on several benchmarks + There are several extensions along this direction of distilla- by adopting this idea. However, transfer learning, i.e., reus- + tion knowledge. The work in [53] trained a parametric student ing features learned on the ImageNet data set and applying + model to approximate a Monte Carlo teacher. The proposed them to new tasks, is more difficult with this approach. This + framework used online training and used DNNs for the student problem was noted by Szegedy et al. [60] and motivated + model. Different from previous works, which represented the them to add a linear layer on top of their networks to enable + knowledge using the softened label probabilities, [54] repre- transfer learning. + sented the knowledge by using the neurons in the higher hidden The work in [61] targeted the ResNet-based model with a + layer, which preserved as much information as the label prob- spatially varying computation time, called stochastic depth, + abilities, but are more compact. The work in [55] accelerated which enabled the seemingly contradictory setup to train short + the experimentation process by instantaneously transferring networks and used deep networks at test time. It started with + the knowledge from a previous network to each new deeper very deep networks and, while during training, for each mini- + or wider network. The techniques are based on the concept batch, randomly dropped a subset of layers and bypassed them + of function-preserving transformations between neural net- with the identity function. This model is end-to-end trainable, + work specifications. Zagoruyko et al. [56] proposed attention deterministic, and can be viewed as a black-box feature extrac- + transfer to relax the assumption of FitNet. They transferred the tor. Following this direction, the work in [62] proposed a pyra- + attention maps that are summaries of the full activations. midal residual network with stochastic depth. + Other approaches to reduce the convolutional overheads + Drawbacks include using FFT-based convolutions [63] and fast convolution + KD-based approaches can make deeper models thinner and using the Winograd algorithm [64]. Those works only aim to + help significantly reduce the computational cost. However, speedup the computation but not reduce the memory storage. + there are a few disadvantages. One of them is that KD can only + be applied to classification tasks with softmax loss function, Benchmarks, evaluation, and databases + which hinders its usage. Another drawback is that the model In the past five years, the deep-learning community has made + assumptions sometimes are too strict to make the performance great efforts in benchmark models. One of the most well- + competitive with other types of approaches. known models used in compression and acceleration for CNNs + + + 132 IEEE SIgnal ProcESSIng MagazInE | January 2018 | is Alexnet [1], which occasionally has been Proposing some general/ about how to choose different compression + used for assessing the performance of com- unified approaches is approaches and possible challenges/solu- + pression. Other popular standard models one direction that can tions in this area. + include LeNets [65], All-CNN-nets [66], be taken regarding and many others. LeNet-300-100 is a fully General suggestions + connected network with two hidden layers, the use of CNNs in There is no golden rule to measure which one + with 300 and 100 neurons each. LeNet-5 is small platforms. of the four kinds of approaches is the best. How + a convolutional network that has two convo- to choose the proper approaches is really de- + lutional layers and two fully connected layers. Recently, more pendent on the applications and requirements. Here, we provide + state-of-the-art architectures are used as baseline models in some general suggestions. + many works, including network in networks [67], VGGNets ■ If the applications needs compacted models from pretrained + [68], and ResNets [69]. Table 4 summarizes the baseline mod- models, one can choose either pruning and sharing or low- + els commonly used in several typical compression methods. rank factorization-based methods. If end-to-end solutions + The standard criteria to measure the quality of model com- are needed for the problem, the low-rank and transferred + pression and acceleration are the compression and the speedup convolutional filters approaches are preferred. + rates. Assume that a is the number of the parameters in the ■ For applications in some specific domains, methods with + original model M and a* is that of the compressed model M* , human prior (like the transferred convolutional filters and + then the compression rate a (,MM * ) of M* over M is structural matrix) sometimes have benefits. For example, + when conducting medical images classification, transferred + MM,.aa ^h * = (8)a convolutional filters should work well as medical images * (like organs) do have the rotation transformation property. + Another widely used measurement is the index space saving ■ Usually, the approaches of pruning and sharing could give + defined in several papers [70], [71] as a reasonable compression rate while not hurting the accu- + racy. Thus, for applications that require stable model accu- + MM,,aa b * = -^h * (9)a racy, it is better to utilize pruning and sharing. * + ■ If a problem involves small- or medium-size data sets, one + where a and a are the number of the dimension of the index can try the KD approaches. The compressed student model + space in the original model and that of the compressed can take the benefit of transferring knowledge from the + model, respectively. teacher model, making it a robust data set that is not large. + Similarly, given the running time s of M and s* of M*, the ■ As we mentioned in the “Introduction,” techniques of the + speedup rate d (,MM * ) is defined as four themes are orthogonal. It makes sense to combine two + or three of them to maximize the compression/speedup + MM,.sd ^h * =s (10) rates. For some specific applications, like object detection, * which requires both convolutional and fully connected lay- + Most work used the average training time per epoch to mea- ers, one can compress the convolutional layers with low- + sure the running time, while in [70] and [71], the average rank factorization and the fully connected layers with a + testing time was used. Generally, the compression rate and pruning method. + speedup rate are highly correlated, as smaller models often + results in faster computation for both the training and the + testing stages. + Good compression methods are expected to achieve almost Table 4. A summary of baseline models used in + the same performance as the original model with much smaller different representative works of network compression. + parameters and less computational time. However, for differ- Baseline Models Representative Works + ent applications with varying CNN designs, the correlation Alexnet [1] Structural matrix [30]–[32] between parameter size and computational time may be dif- + ferent. For example, it is observed that, for deep CNNs with Low-rank factorization [39] + fully connected layers, most of the parameters are in the fully Network in network [67] Low-rank factorization [39] + connected layers; while for image classification tasks, float- VGGNets [68] Transferred filters [43] + point operations are mainly in the first few convolutional lay- Low-rank factorization [39] ers since each filter is convolved with the whole image, which ResNets [69] Compact filters [48], stochastic depth [61] is usually very large at the beginning. Different applications + should focus on different layers. Parameter sharing [25] + All-CNN-nets [66] Transferred filters [44] + Discussion and challenges LeNets [65] Parameter sharing [25] + In this article, we summarized recent works on compress- Parameter pruning [21], [23] ing and accelerating DNNs. Here we discuss more details + + + IEEE SIgnal ProcESSIng MagazInE | January 2018 | 133 Technique challenges good compression approaches. Instead of directly reducing + Techniques for deep model compression methods are expected and transferring parameters from the teach- + and acceleration are still in the early stages, to achieve almost the er models, passing selectivity knowledge of + and the following challenges still need to same performance as the neurons could be helpful. One can derive + be addressed. a way to select essential neurons related to original model with much ■ Most of the current state-of-the-art ap - the task. The intuition is that, if a neuron + proaches are built on well-designed smaller parameters and is activated in certain regions or samples, + CNN models, which have limited free- less computational time. this implies these regions or samples share + dom to change the configuration (e.g., some common properties that may relate + network structural, hyperparameters). to the task. Performing such steps is time- + To handle more complicated tasks, it should provide more consuming, thus efficient implementation is important. + plausible ways to configure the compressed models. For methods with convolutional filters and the structural + ■ Pruning is an effective way to compress and accelerate matrix, we can conclude that the transformation lies in the + CNNs. Current pruning techniques are mostly designed to family of functions that only operations on the spatial dimen- + eliminate connections between neurons. On the other hand, sions. Hence, to address the imposed prior issue, one solution + a pruning channel can directly reduce the feature map is to provide a generalization of the aforementioned approach- + width and shrink the model into a thinner one. It is efficient es in two aspects: 1) instead of limiting the transformation + but also challenging because removing channels might dra- to belong to a set of predefined transformations, let it be the + matically change the input of the following layer. It is whole family of spatial transformations applied to 2-D filters + important to focus on how to address this issue. or the matrix, and 2) learn the transformation jointly with all + ■ As we mentioned previously, methods of structural matrix of the model parameters. + and transferred convolutional filters impose prior human Proposing some general/unified approaches is one direction + knowledge to the model, which could significantly affect that can be taken regarding the use of CNNs in small platforms. + the performance and stability. It is critical to investigate Yuhen et al. [75] presented a feature map dimensionality reduc- + how to control the impact of the imposed prior knowledge. tion method by excavating and removing redundancy in feature + ■ The methods of KD provide many benefits such as directly maps generated by different filters, which could also preserve + accelerating the model without special hardware or imple- intrinsic information of the original network. The idea can be + mentations. It is still worth it to develop KD-based extended to make CNNs more applicable for different platforms. + approaches and explore how to improve the performance. The work in [76] proposed a one-shot whole network compres- + ■ Hardware constraints in various of small platforms (e.g., sion scheme consisting of three components: rank selection, low- + mobile, robotic, self-driving cars) are still a major problem rank tensor decomposition, and fine-tuning to make deep CNNs + that hinder the extension of deep CNNs. How to make full work in mobile devices. From the systematic side, Facebook + use of the limited computational source available and how released the platform Caffe2 [77], which employed a particularly + to design special compression methods for such platforms lightweight and modular framework and included mobile-specif- + are still challenges that need to be addressed. ic optimizations based on the hardware design. Caffe2 can help + developers and researchers train large machine-learning models + Possible solutions and deliver AI on mobile devices. + To solve the hyperparameters configuration problem, we can + rely on the recent learning-to-learn strategy [72], [73]. This Acknowledgments + framework provides a mechanism, allowing the algorithm to We would like to thank the reviewers and broader community + automatically learn how to exploit structure in the problem of for their feedback on this survey. In particular, we would like + interest. There are two different ways to combine the learning- to thank Hong Zhao from the Department of Automation of + to-learn module with the model compression. The first designs Tsinghua University for her help on modifying this article. + compression and learning-to-learn simultaneously, while the This research is supported by National Science Foundation of + second way first configures the model with learn-to-learning China, grant number 61401169. The corresponding author of + and then prunes the parameters. this article is Pan Zhou. + Channel pruning provides the efficiency benefit on + both CPUs and GPUs because no special implementation is Authors + required. But it is also challenging to handle the input con- Yu Cheng (chengyu@us.ibm.com) received his bachelor’s + figuration. One possible solution is to use the training-based degree in automation from Tsinghua University, Beijing, + channel pruning methods [74], which focus on imposing sparse China, in 2010 and his Ph.D. degree in computer science + constraints on weights during training, and could adaptively from Northwestern University, Evanston, Illinois in 2015. + determine hyperparameters. However, training from scratch Currently, he is a research staff member at AI Foundations Lab, + for such a method is costly for very deep CNNs. IBM T.J. Watson Research Center, Yorktown Heights, New + Exploring new types of knowledge in the teacher models York. His research is focused on deep learning in general, with + and transferring it to the student models is useful for the KD specific interests in deep generative models and deep models + + 134 IEEE SIgnal ProcESSIng MagazInE | January 2018 | compression. He also has published many works regarding the [12] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv + applications of deep learning in computer vision and natural Preprint, arXiv:1612.01064, 2016. + language processing. [13] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep neu- + ral networks with binary weights during propagations,” in Proc. Advances Neural + Duo Wang (d-wang15@mails.tsinghua.edu.cn) received the Information Processing Systems Annu. Conf., 2015, pp. 3123–3131. + B.S. degree in automation from the Harbin Institute of [14] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks + Technology, China, in 2015, where he is currently pursuing his with weights and activations constrained to +1 or −1,” Computing Res. Repository, + vol. abs/1602.02830, 2016. [Online]. Available: https://arxiv.org/abs/1602.02830 Ph.D. degree in the Department of Automation, Tsinghua [15] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet University. His research interests are deep/machine learning and classification using binary convolutional neural networks,” in Proc. European Conf. + their applications in computer vision and robotics vision. Computer Vision, 2016, pp. 525–542. + Pan Zhou (panzhou@hust.edu.cn) received his B.S. degree [16] P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, “Deep + neural networks are robust to weight binarization and other non-linear distortions,” in the Advanced Class of Huazhong University of Science and Computing Res. Repository, vol. abs/1606.01981, 2016. [Online]. Available: https:// + Technology (HUST), Wuhan China, and his M.S. degree in elec- arxiv.org/abs/1606.01981 + tronics and information engineering from the same university in [17] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep networks,” + Computing Res. Repository, vol. abs/1611.01600, 2016. [Online]. Available: https:// 2006 and 2008, respectively. He received his Ph.D. degree from arxiv.org/abs/1611.01600 + the School of Electrical and Computer Engineering at the [18] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with + Georgia Institute of Technology, Atlanta in 2011. Currently, he is few multiplications,” Computing Res. Repository, vol. abs/1510.03009, 2015. + [Online]. Available: https://arxiv.org/abs/1510.03009 an associate professor with School of Electronic Information and [19] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network con- Communications, HUST. His research interests include big data struction with back-propagation,” Adv. Neural Inform. Process. Syst. 1, 1989, pp. + analytics and machine learning, security and privacy, and infor- 177–185. + mation networks. [20] Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information pro- + cessing systems 2,” in Optimal Brain Damage, D. S. Touretzky, Ed. San Mateo, Tao Zhang (taozhang@mail.tsinghua.edu.cn) received his CA: Morgan Kaufmann, 1990, pp. 598–605. + B.S., M.S., and Ph.D. degrees from Tsinghua University, [21] B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives for + Beijing, China, in 1993, 1995, and 1999, respectively, and his network pruning: Optimal brain surgeon,” in Advances in Neural Information + Processing Systems, vol. 5. San Mateo, CA: Morgan Kaufmann, 1993, pp. 164– Ph.D. degree from Saga University, Japan, in 2002, all in con- 171. + trol engineering. He is a professor with the Department of [22] S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural net- + Automation, Tsinghua University. His current research inter- works,” in Proc. British Machine Vision Conf., 2015, pp. 31.1–31.12. + ests include artificial intelligence, robotics, image processing, [23] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connec- + tions for efficient neural networks,” in Proc. 28th Int. Conf. Neural Information control theory, and control of spacecraft. Processing Systems, 2015, pp. 1135–1143. + [24] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing + References neural networks with the hashing trick,” in Proc. Machine Learning Research + Workshop Conf., 2015, pp. 2285–2294.[1] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep + convolutional neural networks,” in Proc. Conf. Neural Information Processing [25] K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network + Systems, 2012, pp. 1097–1105. compression,” Computing Res. Repository, vol. abs/1702.04008, 2017. [Online]. + Available: https://arxiv.org/abs/1702.04008 [2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to + human-level performance in face verification,” in Proc. IEEE Conf. Computer [26] V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain dam- + Vision Pattern Recognition, 2014, pp. 1701–1708. age,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 2554– + 2564.[3] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are sparse, + selective, and robust,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, [27] H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact + 2015, pp. pp. 2892–2900. CNNs,” in Proc. European Conf. Computer Vision, 2016, pp. 662–677. + [4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. [28] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in + Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale distributed deep deep neural networks,” Adv. Neural Inform. Process. Syst., vol. 29, pp. 2074–2082, + networks,” in Proc. Conf. Neural Information Processing Systems, 2012, pp. 2016. + 1223–1231. [29] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for + [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni- efficient convnets,” Computing Res. Repository, vol. abs/1608.08710, 2016. + tion,” Computing Res. Repository, vol. abs/1512.03385, 2015. [Online]. Available: [Online]. Available: https://arxiv.org/abs/1608.08710 + https://arxiv.org/pdf/1512.03385.pdf [30] Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. Chang, “An + [6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep convolutional exploration of parameter redundancy in deep networks with circulant projections,” in + networks using vector quantization,” Computing Res. Repository, vol. Proc. Int. Conf. Computer Vision, 2015, pp. 2857–2865. + abs/1412.6115, 2014. [Online]. Available: https://arxiv.org/pdf/1412.6115.pdf [31] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. + [7] Y. W. Q. H. Jiaxiang Wu, C. Leng, and J. Cheng, “Quantized convolutional neu- Wang, “Deep fried convnets,” in Proc. Int. Conf. Computer Vision, 2015, pp. 1476– + ral networks for mobile devices,” in Proc. IEEE Conf. Computer Vision Pattern 1483. + Recognition, 2016, pp. 4820–4828. [32] V. Sindhwani, T. Sainath, and S. Kumar. (2015). Structured transforms for + [8] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural net- small-footprint deep learning. Advances in Neural Information Processing + works on cpus,” in Proc. Conf. Neural Information Processing Systems Deep Systems, 28, pp. 3088–3096. [Online]. Available: http://papers.nips.cc/paper/5869- + Learning and Unsupervised Feature Learning Workshop, 2011. structured-transforms-for-small-footprint-deep-learning.pdf + [9] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning [33] J. Chun and T. Kailath, Generalized Displacement Structure for Block- + with limited numerical precision,” in Proc. 32nd Int. Conf. Machine Learning, Toeplitz, Toeplitz-Block, and Toeplitz-Derived Matrices. Berlin, Germany: + 2015, vol. 37, pp. 1737–1746. Springer, 1991, pp. 215–236. + [10] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural [34] M. V. Rakhuba and I. V. Oseledets. (2015). Fast multidimensional convolution + networks with pruning, trained quantization and Huffman coding,” in Proc. Int. in low-rank tensor formats via cross approximation. SIAM J. Sci. Comput., 37(2). + Conf. Learning Representations, 2016. [Online]. Available: http://dx.doi.org/10.1137/140958529 + [11] Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network quantiza- [35] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable filters,” + tion,” Computing Res. Repository, vol. abs/1612.01543, 2016. [Online]. Available: in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2013, pp. 2754– + https://arxiv.org/abs/1612.01543 2761. + + + IEEE SIgnal ProcESSIng MagazInE | January 2018 | 135 [36] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting lin- [57] A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and A. C. + + + + + + + ear structure within convolutional networks for efficient evaluation,” Adv. Neural Courville, “Dynamic capacity networks,” in Proc. 33rd Int. Conf. Machine Learning, + + + + + + + Inform. Process. Syst. vol. 27, pp. 1269–1277, 2014. 2016, pp. 2549–2558. + + + + + + + + + + + + [37] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neu- [58] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. + + + + + + + ral networks with low rank expansions,” in Proc. British Machine Vision Conf., (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts + + + + + + + 2014, pp. 1–13. layer. [Online]. Available: https://openreview.net/pdf?id=B1ckMDqlg + + + + + + + + + + + + [38] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky, [59] D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and J. + + + + + + + “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,” Odobez, “Deep dynamic neural networks for multimodal gesture segmentation and + + + + + + + Computing Res. Repository, vol. abs/1412.6553, 2014. [Online]. Available: https:// recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1583– + + + + + + + arxiv.org/abs/1412.6553 1597, 2016. + + + + + + + + + + + + [39] C. Tai, T. Xiao, X. Wang, and E. Weinan, “Convolutional neural networks [60] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. + + + + + + + with low-rank regularization,” Computing Res. Repository, vol. abs/1511.06067, Vanhoucke, and A. Rabinovich. (2015). Going deeper with convolutions. Proc. IEEE + + + + + + + 2015. Computer Vision Pattern Recognition. [Online]. Available: http://arxiv.org/ + + + + + + + + abs/1409.4842 + + + + [40] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas. (2013). + + + + + + + + Predicting parameters in deep learning. Advances in Neural Information [61] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks + + + + + + + Processing Systems, 26, 2148 –2156. [Online]. Available: http://media.nips.cc/nips- with stochastic depth,” Computing Res. Repository, vol. arXiv:1603.09382, + + + + + + + books/nipspapers/paper_files/nips26/1053.pdf 2016. + + + + + + + + + + + + [41] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, [62] Y. Yamada, M. Iwamura, and K. Kise. (2016). Deep pyramidal residual networks + + + + + + + “Low-rank matrix factorization for deep neural network training with high-dimen- with separated stochastic depth, Computing Res. Repository, vol. abs/1612.01230. + + + + + + + sional output targets,” in Proc. IEEE Int. Conf. Acoustics Speech Signal [Online]. Available: http://arxiv.org/abs/1612.01230 + + + + + + + Processing, 2013, pp. 6655–6659. + + + + [63] M. Mathieu, M. Henaff, and Y. Lecun, “Fast training of convolutional networks + + + + + + + [42] T. S. Cohen and M. Welling, “Group equivariant convolutional networks,” through FFTs,” Computing Res. Repository, vol. arXiv:1312.5851, 2014. + + + + + + + arXiv Preprint, arXiv:1602.07576, 2016. + + + + [64] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” + + + + + + + [43] S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural networks,” in in Proc. IEEE Conf. Computer Vision Pattern Recognition , 2016, pp. 4013 – + + + + + + + Proc. Advances Neural Information Processing Systems, 2016, pp. 1082–1090. 4021. + + + + + + + + + + + + [44] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving con- [65] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied + + + + + + + volutional neural networks via concatenated rectified linear units,” arXiv Preprint, to document recognition,” Proc. IEEE, pp. 2278–2324, 1998. + + + + + + + arXiv:1603.05201, 2016. + + + + [66] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for + + + + + + + + [45] H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in deep neural simplicity: The all convolutional net,” Computing Res. Repository, vol. abs/1412.6806, + + + + + + + + networks,” arXiv Preprint, arXiv:1604.00676, 2016. 2014. [Online]. Available: https://arxiv.org/abs/1412.6806 + + + + + + + + + + + + [46] S. Dieleman, J. D Fauw, and K. Kavukcuoglu, “Exploiting cyclic symmetry in [67] M. Lin, Q. Chen, and S. Yan , “Network in network,” in Proc. Int. Conf. + + + + + + + + convolutional neural networks,” in Proc. 33rd Int. Conf. Machine Learning, 2016, vol. Learning Representations, 2014. [Online]. Available: https://arxiv.org/abs/ + + + + + + + + 48, pp. 1889–1898. 1312.4400 + + + + + + + + + + + + [47] C. Szegedy, S. Ioffe, and V. Vanhoucke. (2016). Inception-v4, inception-resnet and [68] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large- + + + + + + + + the impact of residual connections on learning, Computing Res. Repository, vol. scale image recognition,” Computing Res. Repository, vol. abs/1409.1556, 2014. + + + + + + + + abs/1602.07261. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1602. [Online]. Available: https://arxiv.org/abs/1409.1556 + + + + + + + + html#SzegedyIV16 + + + + [69] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni- + + + + + + + + [48] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, small, low tion,” arXiv Preprint, arXiv:1512.03385, 2015. + + + + + + + + power fully convolutional neural networks for real-time object detection for autono- + + + + [70] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and S. Chang, “An + + + mous driving,” Computing Res. Repository, vol. abs/1612.01051, 2016. [Online]. + + + + exploration of parameter redundancy in deep networks with circulant projections,” in + + + Available: https://arxiv.org/abs/1612.01051 + + + + Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 2857–2865. + + + + + + + + [49] C. Buciluaˇ, R. Caruana, and A. Niculescu-Mizil. (2006). Model compression. + + + + [71] M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “ACDC: A structured + + + Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 535– + + + + efficient linear layer,” in Proc. Int. Conf. Learning Representations, 2016. + + + 541. [Online]. Available: http://doi.acm.org/10.1145/1150402.1150464 + + + + + + + + [72] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. + + + [50] J. Ba and R. Caruana, “Do deep nets really need to be deep?” Adv. Neural Inform. + + + + Schaul, and N. de Freitas, “Learning to learn by gradient descent by gradient + + + Process. Syst., vol. 27, pp. 2654–2662, 2014. + + + + descent,” in Proc. Neural Information Processing Systems Conf., 2016, pp. 3981– + + + + + + + + [51] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural net- 3989. + + + + + + + + work,” Computing Res. Repository, vol. abs/1503.02531, 2015. [Online]. Available: + + + + [73] D. Ha, A. Dai, and Q. Le, “Hypernetworks,” in Proc. Int. Conf. Learning + + + https://arxiv.org/abs/1503.02531 + + + + Representations, 2016. + + + + + + + [52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, + + + + [74] J. M. Alvarez and M. Salzmann, “Learning the number of neurons in deep net- + + + “Fitnets: Hints for thin deep nets,” Computing Res. Repository, vol. abs/1412.6550, + + + + works,” in Proc. Neural Information Processing Systems Conf., 2016, pp. 2270– + + + 2014. [Online]. Available: https://arxiv.org/abs/1412.6550 + + + + 2278. + + + + + + + [53] A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling. (2015). Bayesian + + + + [75] Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature map for + + + dark knowledge. Advances in Neural Information Processing Systems, 28, 3420–3428. + + + + portable deep model,” in Proc. 34th Int. Conf. Machine Learning, 2017, pp. 3703– + + + [Online]. Available: http://papers.nips.cc/paper/5965-bayesian-dark-knowledge.pdf + + + + 3711. + + + + + + + [54] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression by dis- + + + + [76] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep + + + tilling knowledge from neurons,” in Proc. 30th AAAI Conf. Artificial Intelligence, + + + + convolutional neural networks for fast and low power mobile applications,” Computing + + + 2016, pp. 3560–3566. + + + + Res. Repository, vol. abs/1511.06530, 2015. [Online]. Available: https://arxiv.org/ + + + + + + + [55] T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning via abs/1511.06530 + + + + + + + knowledge transfer,” Computing Res. Repository, vol. abs/1511.05641, 2015. [Online]. + + + + [77] Facebook, Inc. Caffe2: A new lightweight, modular, and scalable deep learning + + + Available: https://arxiv.org/abs/1511.05641 + + + + framework. (2016). [Online]. Available: https://caffe2.ai/ + + + + + + + [56] S. Zagoruyko and N. Komodakis. (2016). Paying more attention to attention: + + + + + + + + Improving the performance of convolutional neural networks via attention transfer, + + + + + + + + Computing Res. Repository, vol. abs/1612.03928. [Online]. Available: http://arxiv.org/ + + + + + + + + abs/1612.03928 SP + + + + + + + + + + 136 IEEE SIgnal ProcESSIng MagazInE | January 2018 | \ No newline at end of file diff --git a/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt b/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt new file mode 100644 index 0000000..47f9152 --- /dev/null +++ b/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt @@ -0,0 +1,662 @@ + Movement Pruning: + Adaptive Sparsity by Fine-Tuning + + + + + Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2 + 1 Hugging Face, 2 Cornell University + {victor,thomas}@huggingface.co;arush@cornell.edu + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + arXiv:2005.07683v1 [cs.CL] 15 May 2020 Abstract + + Magnitude pruning is a widely used strategy for reducing model size in pure + supervised learning; however, it is less effective in the transfer learning regime that + has become standard for state-of-the-art natural language processing applications. + We propose the use ofmovement pruning, a simple, deterministic first-order weight + pruning method that is more adaptive to pretrained model fine-tuning. We give + mathematical foundations to the method and compare it to existing zeroth- and + first-order pruning methods. Experiments show that when pruning large pretrained + language models, movement pruning shows significant improvements in high- + sparsity regimes. When combined with distillation, the approach achieves minimal + accuracy loss with down to only 3% of the model parameters. + + + 1 Introduction + + Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art + performance in applications in natural language processing and related fields. In this setup, a large + model pretrained on a massive generic dataset is then fine-tuned on a smaller annotated dataset to + perform a specific end-task. Model accuracy has been shown to scale with the pretrained model and + dataset size [Raffel et al., 2019]. However, significant resources are required to ship and deploy these + large models, and training the models have high environmental costs [Strubell et al., 2019]. + Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at + only a small cost of accuracy. Pruning methods, which remove weights based on their importance, + are a particularly simple and effective method for compressing models to be sent to edge devices such + as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high + absolute values, is the most widely used method for weight pruning. It has been applied to a large + variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al., + 2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis + [Frankle et al., 2019]. + While magnitude pruning is highly effective for standard supervised learning, it is inherently less + useful in the transfer learning regime. In supervised learning, weight values are primarily determined + by the end-task training data. In transfer learning, weight values are mostly predetermined by the + original model and are only fine-tuned on the end task. This prevents these methods from learning to + prune based on the fine-tuning step, or “fine-pruning.” + In this work, we argue that to effectively reduce the size of models for transfer learning, one should + instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during + fine-tuning. Movement pruning differs from magnitude pruning in that both weights with low and + high values can be pruned if they shrink during training. This strategy moves the selection criteria + from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective. To + + + Preprint. Under review. test this approach, we introduce a particularly simple, deterministic version of movement pruning + utilizing the straight-through estimator [Bengio et al., 2013]. + We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019, + Vaswani et al., 2017] on a diverse set of fine-tuning tasks. In highly sparse regimes (less than 15% of + remaining weights), we observe significant improvements over magnitude pruning and other 1st-order + methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original + BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI) + [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of + the differences between magnitude pruning and movement pruning shows that the two methods lead + to radically different pruned models with movement pruning showing greater ability to adapt to the + end-task. + + 2 Related Work + + In addition to magnitude pruning, there are many other approaches for generic model weight pruning. + Most similar to our approach are methods for using parallel score matrices to augment the weight + matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo- + lutional networks. Differing from our methods, these methods keep the weights of the model fixed + (either from a randomly initialized network or a pre-trained network) and the scores are updated to + find a good sparse subnetwork. + Many previous works have also explored using higher-order information to select prunable weights. + LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for + deletion. Our method does not require the (possibly costly) computation of second-order derivatives + since the importance scores are obtained simply as the by-product of the standard fine-tuning. Theis + et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In + contrast, we found it useful to preserve the direction of movement in our algorithm. + Compressing pretrained language models for transfer learning is also a popular area of study. Other + approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning + [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model + and targets individual weight. We also show that having a teacher can further improve our approach. + Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train + sparse language models from scratch. This differs from our approach which focuses on the fine-tuning + stage. Finally, another popular compression approach is quantization. Quantization has been applied + to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014] + providing high memory compression rates at the cost of no or little performance. As shown in + previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and + can be combined to further improve the performance/size ratio. + + 3 Background: Score-Based Pruning + + We first establish shared notation for discussing different neural network pruning strategies. Let + W2Rnn refer to a generic weight matrix in the model (we consider square matrices, but they + could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of + associated importance scoresS2Rnn . Given importance scores, each pruning strategy computes a + maskM2 f0;1gnn . Inference for an inputxbecomesa= (WM)x, whereis the Hadamard + product. A common strategy is to keep the top-vpercent of weights by importance. We define Top v as a function which selects thev%highest values inS:1; STop(S) (1) v i;j = i;j in topv% + 0; o.w. + + Magnitude-based weight pruning determines the mask based on the absolute value of each weight as a measure of importance. Formally, we have importance scoresS= jWi;j j , and masks 1i;jn M=Top(S)(Eq(1)). There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude pruning: the model is first trained until convergence and weights with the lowest + magnitudes are removed afterward. The sparsified model is then re-trained with the removed weights + fixed to 0. This loop is repeated until the desired sparsity level is reached. + + 2 Magnitude pruning L0 regularization Movement pruning Soft movement pruning + Pruning Decision 0th order 1st order 1st order 1st order + Masking Function Top v Continuous Hard-Concrete Top v Thresholding + Pruning Structure Local or Global Global Local or Global Global + Learning Objective L L+l0 E(L0 ) L L+mvp R(S) + Gradient Form Gumbel-Softmax Straight-Through Straight-ThroughP P PScoresS jW ) )i;j j (@L )(t W(t) f(S(t) ) (@L )(t) W(t) (@L )(t) W(t + t@W i;j i;j i;j i;j t@W i;j i;j t@W i;j + Table 1: Summary of the pruning methods considered in this work and their specificities. The + expression offofL0 regularization is detailed in Eq (3). + + + In this study, we focus onautomated gradual pruning[Zhu and Gupta, 2018]. It supplements + magnitude pruning by allowing masked weights to be updated such that they are not fixed for the + entire duration of the training. Automated gradual pruning enables the model to recover from previous + masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity levelv during training using a cubic sparsity scheduler:v(t) =vf + (v t 3 + i vf ) 1ti . The sparsity nt level at time stept,v(t) is increased from an initial valuevi (usually 0) to a final valuevf innpruning + steps afterti steps of warm-up. The model is thus pruned and trained jointly. + + 4 Movement Pruning + + Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running + model. In this work, we focus on movement pruning methods where importance is derived from + first-order information. Intuitively, instead of selecting weights that are far from zero, we retain + connections that are moving away from zero during the training process. We consider two versions of + movement pruning: hard and soft. + For (hard) movement pruning, masks are computed using the Top v function:M=Top(S). Unlike v magnitude pruning, during training, we learn both the weightsP Wand their importance scoresS. + During the forward pass, we compute for alli,a ni = Wk=1 i;k Mi;k xk . + Since the gradient of Top v is 0 everywhere it is defined, we follow Ramanujan et al. [2020], Mallya + and Lazebnik [2018] and approximate its value with thestraight-through estimator[Bengio et al., + 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The + approximation of gradient of the lossLwith respect toSi;j is given by + @L @L @a= i @L= W x@S j (2) + i;j @a i @S i;j @a i;ji + This implies that the scores of weights are updated, even if these weights are masked in the forward + pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge. + We also consider a relaxed (soft) version of movement pruning based on the binary mask function + described by Mallya and Lazebnik [2018]. Here we replace hyperparametervwith a fixed global + threshold valuethat controls the binary mask. The mask is calculated asP M= (S> ). In order to + control the sparsity level, we add a regularization termR(S) =mvp (Si;j i;j )which encourages + the importance scores to decrease over time 1 . The coefficientmvp controls the penalty intensity and + thus the sparsity level. + Finally we note that these approaches yield a similar updateL0 regularization based pruning, another + movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the + hard-concretedistribution, where the maskMis sampled for alli;jwith hyperparametersb >0, + l <0, andr >1: u U(0;1) Si;j =(log(u)log(1u) +Si;j )=b + Zi;j = (rl)Si;j +l M i;j = min(1;ReLU(Zi;j )) + + The expectedP L0 norm has a closed form involving the parameters of the hard-concrete: E(L0 ) = + logSi;j i;j blog(l=r). Thus, the weights and scores of the model can be optimized in + P1 We also experimented with jSi;j i;j jbut it turned out to be harder to tune while giving similar results. + + 3 (a) Magnitude pruning (b) Movement pruning + Figure 1: During fine-tuning (on MNLI), the weights stay close to their pre-trained values which + limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are + plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning + selects weights that are moving away from 0. + + + an end-to-end fashion to minimize the sum of the training lossLand the expectedL0 penalty. A + coefficientl0 controls theL0 penalty and indirectly the sparsity level. Gradients take a similar form: + @L @L rl= W@S i;j xj f(Si;j )wheref(Si;j ) = S Zi;j 1g (3) + i;j @a i b i;j (1Si;j )1f0 + + At test time, a non-stochastic estimation of the mask is used:M^= min 1;ReLU (rl)(S)+l + and weights multiplied by 0 can simply be discarded. + Table 1 highlights the characteristics of each pruning method. The main differences are in the masking + functions, pruning structure, and the final gradient form. + + Method Interpretation In movement pruning, the gradient ofLwith respect toWi;j is given + by the standard gradient derivation: @L = @L M@W i;j @a i;j xj . By combining it to Eq(2), we have i @L = @L W@S i;j @W i;j (we omit the binary mask termMi;j for simplicity). From the gradient update in i;j + Eq (2),S @Li;j is increasing when <0, which happens in two cases: @S i;j + (a) @L <0andW@W i;j >0i;j + (b) @L >0andW@W i;j <0i;j + It means that during trainingWi;j is increasing while being positive or is decreasing while being + negative. It is equivalent to saying thatSi;j is increasing whenWi;j is moving away from 0. Inversely, + Si;j is decreasing when @L >0which means thatW@S i;j is shrinking towards 0. i;j + While magnitude pruning selects the most important weights as the ones which maximize their + distance to 0 (jWi;j j), movement pruning selects the weights which are moving the most away from + 0 (Si;j ). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement + pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from + equation (2), afterTgradient updates, we have + XS(T) @L= )(t) W(t) (4) i;j S (@W i;j i;jt