diff --git a/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt b/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt deleted file mode 100644 index 0c2f968..0000000 --- a/Corpus/A Survey of Model Compression and Acceleration for Deep Neural Networks - Cheng.txt +++ /dev/null @@ -1,555 +0,0 @@ - IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1 - - - - A Survey of Model Compression and Acceleration - - for Deep Neural Networks - - Yu Cheng, Duo Wang, Pan Zhou,Member, IEEE,and Tao Zhang,Senior Member, IEEE - - - - - Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model - recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment - in devices with low memory resources or in applications with to billions [4]. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - arXiv:1710.09282v7 [cs.LG] 7 Feb 2019 strict latency requirements. Therefore, a natural thought is to As larger neural networks with more layers and nodes - perform model compression and acceleration in deep networks are considered, reducing their storage and computational cost - without significantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech- - niques for compacting and accelerating CNNs model developed. tion, recent years witnessed significant progress in virtual - These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre- - parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle - ferred/compact convolutional filters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced. - For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efficient deep learning methods can have - performance, related applications, advantages, and drawbacks significant impacts on distributed systems, embedded devices, - etc. Then we will go through a few very recent additional and FPGA for Artificial Intelligence. For example, the ResNet- - successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion floating number multiplications matrix, the main datasets used for evaluating the model per- - formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant - this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than - on this topic. 75% of parameters and 50% computational time. For devices - Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte - Model Compression and Acceleration, resources, how to compact the models used on them is also - important. - Achieving these goal calls for joint solutions from manyI. I NTRODUCTION disciplines, including but not limited to machine learning, op- - In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing, - lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works - achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which - These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community - billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years. - very high computation capability plays a key role in their We classify these approaches into four categories: pa- - success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans- - achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional filters, and knowledge distil- - using a network containing 60 million parameters with five lation. The parameter pruning and sharing based methods - convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to - it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor- - ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to - example is the top face verification results on the Labeled estimate the informative parameters of the deep CNNs. The - Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional filters - containing hundreds of millions of parameters, using a mix design special structural convolutional filters to reduce the - of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge - distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft - Way, Redmond, WA 98052, USA. compact neural network to reproduce the output of a larger - Duo Wang and Tao Zhang are with the Department of Automation, network. - Tsinghua University, Beijing 100084, China. In Table I, we briefly summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074, - China. rank factorization and knowledge distillation approaches can IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2 - - - TABLE I - SUMMARIZATION OF DIFFERENT APPROACHES FOR MODEL COMPRESSION AND ACCELERATION . - Theme Name Description Applications More details - Parameter pruning and sharing Reducing redundant parameters which Convolutional layer and Robust to various settings, can achieve - are not sensitive to the performance fully connected layer good performance, can support both train - from scratch and pre-trained model - Low-rank factorization Using matrix/tensor decomposition to Convolutional layer and Standardized pipeline, easily to be - estimate the informative parameters fully connected layer implemented, can support both train - from scratch and pre-trained model - Transferred/compact convolutional Designing special structural convolutional Convolutional layer Algorithms are dependent on applications, - filters filters to save parameters only usually achieve good performance, - only support train from scratch - Knowledge distillation Training a compact neural network with Convolutional layer and Model performances are sensitive - distilled knowledge of a large model fully connected layer to applications and network structure - only support train from scratch - - - be used in DNN models with fully connected layers and - convolutional layers, achieving comparable performances. On - the other hand, methods using transferred/compact filters are - designed for models with convolutional layers only. Low-rank - factorization and transfered/compact filters based approaches - provide an end-to-end pipeline and can be easily implemented - in CPU/GPU environment, which is straightforward. while - parameter pruning & sharing use different methods such as - vector quantization, binary coding and sparse constraints to - perform the task. Generally it will take several steps to achieve - the goal. Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output - eter pruning/sharing low-rank factorization can be extracted is the compression model. - from pre-trained ones or trained from scratch. While the - transferred/compact filter and knowledge distillation models - can only support train from scratch. These methods are inde- memory usage and float point operations with little loss in - pendently designed and complement each other. For example, classification accuracy. - transferred layers and parameter pruning & sharing can be The method proposed in [10] quantized the link weights - used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the - used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce - speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con- - properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the - small-weight connections. Finally, the network was retrained - II. P to learn the final weights for the remaining sparse connections. ARAMETER PRUNING AND SHARING This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importancefitting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classified into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix. instance, BinaryConnect [12], BinaryNet [13] and XNORNet- - works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization activation during the model training. The systematic study in - Network quantization compresses the original network by [15] showed that networks trained with back propagation could - reducing the number of bits required to represent each weight. be resilient to specific weight distortions, including binary - Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights. - quantization to the parameter values. Vanhouckeet al.[8] Drawbacks: the accuracy of the binary nets is significantly - showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet. - in significant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina- - work in [9] used 16-bit fixed-point representation in stochastic rization schemes are based on simple matrix approximations - rounding based CNN training, which significantly reduced and ignore the effect of binarization on the accuracy loss. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3 - - - To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of - Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear - directly minimizes the loss with respect to the binary weights. transformsf(x;M) =(Mx), where()is an element-wise - The work in [17] reduced the time on float point multiplication nonlinear operator,xis the input vector, andMis themn - in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense - converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing - significant changes. matrix-vector products inO(mn)time. Thus, an intuitive - way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing structural matrix. Anmnmatrix that can be described - Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured - network complexity and to address the over-fitting issue. An matrix. Typically, the structure should not only reduce the - early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference - [18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and - Surgeon [20] methods reduced the number of connections gradient computations. - based on the Hessian of the loss function, and their work sug- Following this direction, the work in [30], [31] proposed a - gested that such pruning gave higher accuracy than magnitude- simple and efficient approach based on circulant projections, - while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training (rprocedure of those methods followed the way training from 0 ;r 1 ;;r d1 ), a circulant matrixR2Rdd is defined - as: - scratch manner. 2 3 r A recent trend in this direction is to prune redundant, 0 rd1 ::: r 2 r1 6r6 1 r0 rd1 r2 77 non-informative weights in a pre-trained CNN model. For 6 .. . 7 - example, Srinivas and Babu [21] explored the redundancy R= circ(r) :=66 . r . .. . 71 r0 . 7: (1)6 . 7 among neurons, and proposed a data-free pruning method to 4r . .. .. 5d2 rd1 - remove redundant neurons. Hanet al.[22] proposed to reduce rd1 rd2 ::: r 1 r0 - the total number of parameters and operations in the entire thus the memory cost becomesO(d)instead ofO(d2 ).network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourierused a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan- In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fullyon soft weight-sharing was proposed, which included both connected layers. The Adaptive Fastfood transform matrixquantization and pruning in one simple (re-)training procedure. R2Rnd was defined as:The above pruning schemes typically produce connections - pruning in CNNs. R=SHGHB (2) - There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices. 2 - with sparsity constraints. Those sparsity constraints are typ- f0;1gdd is a random permutation matrix, andHdenotes - ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con- - norm regularizers. The work in [25] imposed group sparsity nected layer withdinputs andnoutputs using the Adaptive - constraint on the convolutional filters to achieve structured Fastfood transform reduces the storage and the computational - brain Damage, i.e., pruning entries of the convolution kernels costs fromO(nd)toO(n)and fromO(nd)toO(nlogd), - in a group-wise fashion. In [26], a group-sparse regularizer respectively. - on neurons was introduced during the training stage to learn The work in [29] showed the effectiveness of the new - compact CNNs with reduced filters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their - structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured - filters, channels or even layers. In the filter-level pruning, all matrix classes, including block and multi-level Toeplitz-like - the above works usedl2;1 -norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34]. - usedl1 -norm to select and prune unimportant filters. Following this idea, [35] proposed a general structured effi- - Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs. - and sharing. First, pruning withl1 orl2 regularization requires Drawbacks: one problem of this kind of approaches is that - more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the - pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand, - which demands fine-tuning of the parameters and could be how to find a proper structural matrix is difficult. There is no - cumbersome for some applications. theoretical way to derive it out. - - C. Designing Structural Matrix III. L OW -RANK FACTORIZATION AND SPARSITY - In architectures that contain fully-connected layers, it is Convolution operations contribute the bulk of most com- - critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4 - - - TABLE II - COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES - ON ILSVRC-2012. - Model TOP-5 Accuracy Speed-up Compression Rate - AlexNet 80.03% 1. 1. - BN Low-rank 80.56% 1.09 4.94 - CP Low-rank 79.66% 1.82 5. - VGG-16 90.60% 1. 1. - Fig. 2. A typical framework of the low-rank regularization method. The left BN Low-rank 90.47% 1.53 2.72 - is the original convolutional layer and the right is the low-rank constraint CP Low-rank 90.31% 2.05 2.75 - convolutional layer with rank-K. GoogleNet 92.21% 1. 1. - BN Low-rank 91.88% 1.08 2.79 - CP Low-rank 91.79% 1.20 2.84 - would improve the compression rate as well as the overall - speedup. For the convolution kernels, it can be viewed as a - 4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic - the intuition that there is a significant amount of redundancy parameters in deep models using the low-rank method. [42] - in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the final weight - remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted - it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite - help. the fully connected layer for designing compact multi-task - It has been a long time for using low-rank filters to acceler- deep learning architectures. - ate convolution, for example, high dimensional DCT (discrete Drawbacks: low-rank approaches are straightforward for - cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements - to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti- - respectively. Learning separable 1D filters was introduced fied units and maxout. However, the implementation is not - by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which - idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current - approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and - kernels were proposed in [37]. They achieved 2speedup thus cannot perform global parameter compression, which - for a single convolutional layer with 1% drop in classification is important as different layers hold different information. - accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to - decomposition schemes, reporting a 4.5speedup with 1% achieve convergence when compared to the original model. - drop in accuracy in text recognition. - The low-rank approximation was done layer by layer. The IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS - parameters of one layer were fixed after it was done, and the CNNs are parameter efficient due to exploring the trans-layers above were fine-tuned based on a reconstruction error lation invariant property of the representations to the inputcriterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-fitting. Although a strong theorylowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant propertyused nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu-decomposition for training low-rank constrained CNNs from tional filters to compress CNN models is motivated by recentscratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input,()be a network or layer andT()be theboth the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is defined as:Low-rank) can be used to train CNNs from scratch. However, - there are few differences between them. For example, finding T‘ (x) = (Tx) (3)the best low-rank approximation in CP decomposition is an ill- - posed problem, and the best rank-K(Kis the rank number) indicating that transforming the inputxby the transformT() - approximation may not exist sometimes. While for the BN and then passing it through the network or layer()should - scheme, the decomposition always exists. We perform a simple give the same result as first mappingxthrough the network - comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq. - speedup and the compression rates are used to measure their (10), the transformsT()andT0 ()are not necessarily the - performances. same as they operate on different objects. According to this - As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or filters - be viewed as a 2D matrix and thus the above mentioned ()to compress the whole network models. From empirical - methods can also be applied there. There are several classical observation, deep CNNs also benefit from using a large set of - works on exploiting low-rankness in fully connected layers. convolutional filters by applying certain transformT()to a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5 - - - small set of base filters since it acts as a regularizer for the TABLE III - model. ASIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND - Following this direction, there are many recent reworks CIFAR-100. - proposed to build a convolutional layer from a set of base Model CIFAR-100 CIFAR-10 Compression Rate - filters [43]–[46]. What they have in common is that the VGG-16 34.26% 9.85% 1. - transformT()lies in the family of functions that only operate MBA [46] 33.66% 9.76% 2. - CRELU [45] 34.57% 9.92% 2. in the spatial domain of the convolutional filters. For example, CIRC [43] 35.15% 10.23% 4. - the work in [45] found that the lower convolution layers of DCNN [44] 33.57% 9.65% 1.62 - CNNs learned redundant filters to extract both positive and - negative phase information of an input signal, and definedT() Drawbacks: there are few issues to be addressed for ap-to be the simple negation function: proaches that apply transform constraints to convolutional fil- - T(Wx ) =W (4) ters. First, these methods can achieve competitive performance x for wide/flat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional filter andW is the filter x ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2compression Using a compact filter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric filters with compact blocks to improve the classification accuracy. The intuition is that the learning the speed, which significantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing33convolution into two11to useful convolutional filters instead of redundant ones. convolutions was used in [48], which achieved significantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace33convolution with11convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The - transformT()was define as: V. K NOWLEDGE DISTILLATION T‘ (x) =Wx + (5) To the best of our knowledge, exploiting knowledge transfer - wherewere the multi-bias factors. The work in [47] con- (KT) to compress model was first proposed by Caruanaet - sidered a combination of rotation by a multiple of90 and al.[50]. They trained a compressed/ensemble model of strong - horizontal/vertical flipping with: classifiers with pseudo-data labeled, and reproduced the output - of the original larger network. But the work is limited toT‘ (x) =WT (6) shallow models. The idea has been recently adopted in [51] - whereWT was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide - original filters with angle2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model - transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The - was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from - can achieve good classification performance. a large teacher model into a small one by learning the class - The work in [44] definedT()as the set of translation distributions output via softmax. - functions applied to 2D filters: The work in [52] introduced a KD compression framework, - which eased the training of deep networks by following aT‘ (x) =T(;x;y)x;y2fk;:::;kg;(x;y)6=(0;0) (7) student-teacher paradigm, in which the student was penalized - whereT(;x;y)denoted the translation of the first operand by according to a softened version of the teacher’s output. The - (x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into - at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained - can be used to 1) improve the classification accuracy as a to predict the output and the classification labels. Despite - regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various - parameter efficiency by flexibly varying their architectures to image classification tasks. The work in [53] aimed to address - compress networks. the network compression problem by taking advantage of - Table III briefly compares the performance of different depth neural networks. It proposed an approach to train thin - methods with transferred convolutional filters, using VGGNet but deep networks, called FitNets, to compress wide and - (16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended - on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In - observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher - little or no drop in classification accuracy. network, FitNet made the student mimic the full feature maps IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 6 - - - of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec- - the capacities of teacher and student may differ greatly. ture such as GoogleNet or Network in Network, can achieve - All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting - 10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully - experimental results show that these methods match or outper- optimized the utilization of the computing resources inside - form the teacher’s performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62] - parameters and multiplications. and motivated them to increase the depth and width of the - There are several extension along this direction of dis- network while keeping the computational budget constant. - tillation knowledge. The work in [54] trained a parametric The work in [63] targeted the Residual Network based - student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called - proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory - neural networks for the student model. Different from previous setup to train short networks and used deep networks at test - works which represented the knowledge using the soften label time. It started with very deep networks, while during training, - probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers - neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this - information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual - The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed - instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers - network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best - are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional - mations between neural network specifications. Zagoruyko networks with adaptive inference graphs to adaptively define - et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66]. - assumption of FitNet. They transferred the attention maps that Other approaches to reduce the convolutional overheads in-are summaries of the full activations. clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help significantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classification tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral filters [70]. Those worksperformance competitive with other type of approaches. only aim to speed up the computation but not reduce the - memory storage.VI. O THER TYPES OF APPROACHES - We first summarize the works utilizing attention-based - methods. Note that attention-based mechanism [58] can reduce VII. B ENCHMARKS , E VALUATION AND DATABASES - computations significantly by learning to selectively focus or In the past five years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to first standard models include LeNets [71], All-CNN-nets [72] andfind the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been significantly reduced. layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected - a sparse combination of the experts to process each input. In The standard criteria to measure the quality of model - [61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the - which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters - selected and executed a subset of D2NN neurons based on the in the original modelManda is that of the compressed - input. modelM , then the compression rate(M;M )ofM over - There have been other attempts to reduce the number of Mis aparameters of neural networks by replacing the fully connected (M;M ) = : (8)a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7 - - - TABLE IV or low rank factorization based methods. If you need - SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION . and transferred convolutional filters approaches could be - Baseline Models Representative Works considered. - Alexnet [1] structural matrix [29], [30], [32] For applications in some specific domains, methods with low-rank factorization [40] human prior (like the transferred convolutional filters, Network in network [73] low-rank factorization [40] - VGG nets [74] transferred filters [44] structural matrix) sometimes have benefits. For example, - low-rank factorization [40] when doing medical images classification, transferred Residual networks [75] compact filters [49], stochastic depth [63] convolutional filters could work well as medical images parameter sharing [24] - All-CNN-nets [72] transferred filters [45] (like organ) do have the rotation transformation property. - LeNets [71] parameter sharing [24] Usually the approaches of pruning & sharing could give parameter pruning [20], [22] reasonable compression rate while not hurt the accuracy. - Thus for applications which requires stable model accu- - Another widely used measurement is the index space saving racy, it is better to utilize pruning & sharing. - defined in several papers [30], [35] as If your problem involves small/medium size datasets, you - can try the knowledge distillation approaches. The com-aa - (M;M ) = ; (9) pressed student model can take the benefit of transferringa knowledge from teacher model, making it robust datasets - whereaandaare the number of the dimension of the index which are not large. - space in the original model and that of the compressed model, As we mentioned before, techniques of the four groups - respectively. are orthogonal. It is reasonable to combine two or three - Similarly, given the running timesofMands ofM , of them to maximize the performance. For some spe- - the speedup rate(M;M )is defined as: cific applications, like object detection, which requires - s both convolutional and fully connected layers, you can(M;M ) = : (10)s compress the convolutional layers with low rank based - Most work used the average training time per epoch to measure method and the fully connected layers with a pruning - the running time, while in [30], [35], the average testing time technique. - was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster - computation for both the training and the testing stages. Techniques for deep model compression and acceleration - Good compression methods are expected to achieve almost are still in the early stage and the following challenges still - the same performance as the original model with much smaller need to be addressed. - parameters and less computational time. However, for different Most of the current state-of-the-art approaches are built - applications with different CNN designs, the relation between on well-designed CNN models, which have limited free- - parameter size and computational time may be different. dom to change the configuration (e.g., network structural, - For example, it is observed that for deep CNNs with fully hyper-parameters). To handle more complicated tasks, - connected layers, most of the parameters are in the fully it should provide more plausible ways to configure the - connected layers; while for image classification tasks, float compressed models. - point operations are mainly in the first few convolutional layers Pruning is an effective way to compress and acceler- - since each filter is convolved with the whole image, which is ate CNNs. The current pruning techniques are mostly - usually very large at the beginning. Thus compression and designed to eliminate connections between neurons. On - acceleration of the network should focus on different type of the other hand, pruning channel can directly reduce the - layers for different applications. feature map width and shrink the model into a thinner - one. It is efficient but also challenging because removing - VIII. D ISCUSSION AND CHALLENGES channels might dramatically change the input of the - following layer.In this paper, we summarized recent efforts on compressing - and accelerating deep neural networks (DNNs). Here we dis- As we mentioned before, methods of structural matrix - and transferred convolutional filters impose prior humancuss more details about how to choose different compression knowledge to the model, which could significantly affectapproaches, and possible challenges/solutions on this area. the performance and stability. It is critical to investigate - how to control the impact of those prior knowledge.A. General Suggestions The methods of knowledge distillation provide many ben- - There is no golden rule to measure which approach is the efits such as directly accelerating model without special - best. How to choose the proper method is really depending hardware or implementations. It is still worthy developing - on the applications and requirements. Here are some general KD-based approaches and exploring how to improve their - guidance we can provide: performances. - If the applications need compacted models from pre- Hardware constraints in various of small platforms (e.g., - trained models, you can choose either pruning & sharing mobile, robotic, self-driving car) are still a major problem IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8 - - - to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g., - use of the limited computational source and how to design video and image frames [88], [89]). - special compression methods for such platforms are still - challenges that need to be addressed. IX. ACKNOWLEDGMENTS - Despite the great achievements of these compression ap- - proaches, the black box mechanism is still the key barrier The authors would like to thank the reviewers and broader - to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular, - is still an important problem. we would like to thank Hong Zhao from the Department of - Automation of Tsinghua University for her help on modifying - C. Possible Solutions the paper. This research is supported by National Science - Foundation of China with Grant number 61401169.To solve the hyper-parameters configuration problem, we - can rely on the recent learning-to-learn strategies [76], [77]. - This framework provides a mechanism allowing the algorithm REFERENCES - to automatically learn how to exploit structure in the problem [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with of interest. Very recently, leveraging reinforcement learning deep convolutional neural networks,” inNIPS, 2012. - to efficiently sample the design space and improve the model [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the - compression has also been tried [78]. gap to human-level performance in face verification,” inCVPR, 2014. - [3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efficiency benefit on both adaptive feature sharing in multi-task networks with applications in - CPU and GPU because no special implementation is required. person attribute classification,”CoRR, vol. abs/1611.05377, 2016. - But it is also challenging to handle the input configuration. [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, - M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel distributed deep networks,” inNIPS, 2012. - pruning methods [79], which focus on imposing sparse con- [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image - straints on weights during training. However, training from recognition,”CoRR, vol. abs/1512.03385, 2015. - [6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In deep convolutional networks using vector quantization,”CoRR, vol. - [80], the authors provided an iterative two-step algorithm to abs/1412.6115, 2014. - effectively prune channels in each layer. [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized - convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models on Computer Vision and Pattern Recognition (CVPR), 2016. - and transferring it to the student models is useful for the [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of - knowledge distillation (KD) approaches. Instead of directly re- neural networks on cpus,” inDeep Learning and Unsupervised Feature - Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl- [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep - edge of neurons could be helpful. One can derive a way to learning with limited numerical precision,” inProceedings of the - select essential neurons related to the task [81], [82]. The 32Nd International Conference on International Conference on Machine - Learning - Volume 37, ser. ICML’15, 2015, pp. 1737–1746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing - or samples, that implies these regions or samples share some deep neural networks with pruning, trained quantization and huffman - common properties that may relate to the task. coding,”International Conference on Learning Representations (ICLR), - 2016. For methods with the convolutional filters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network - matrix, we can conclude that the transformation lies in the quantization,”CoRR, vol. abs/1612.01543, 2016. - family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep - neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is in Neural Information Processing Systems 28: Annual Conference on - to provide a generalization of the aforementioned approaches Neural Information Processing Systems 2015, December 7-12, 2015, - in two aspects: 1) instead of limiting the transformation to Montreal, Quebec, Canada, 2015, pp. 3123–3131. - [13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predefined transformations, let it be the works with weights and activations constrained to +1 or -1,”CoRR, vol. - whole family of spatial transformations applied on 2D filters abs/1602.02830, 2016. - or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: - Imagenet classification using binary convolutional neural networks,” in model parameters. ECCV, 2016. - Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, - some general/unified approaches is one direction. Wanget al. “Deep neural networks are robust to weight binarization and other non- - [83] presented a feature map dimensionality reduction method linear distortions,”CoRR, vol. abs/1606.01981, 2016. - [16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen- networks,”CoRR, vol. abs/1611.01600, 2016. - erated from different filters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks - information of the original network. The idea can be applied with few multiplications,”CoRR, vol. abs/1510.03009, 2015. - [18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The construction with back-propagation,” inAdvances in Neural Information - work in [84] proposed a one-shot whole network compression Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177–185. - scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information - processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and fine-tuning to make deep Damage, pp. 598–605. - CNNs work in mobile devices. [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives - Despite the classification task, people are also adapting the for network pruning: Optimal brain surgeon,” inAdvances in Neural - Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164– compacted models in other tasks [85]–[87]. We would like to 171. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9 - - - - [21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net- - networks,” inProceedings of the British Machine Vision Conference works,”arXiv preprint arXiv:1602.07576, 2016. - 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp. [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural - 31.1–31.12. networks,” inAdvances In Neural Information Processing Systems, 2016, - [22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and pp. 1082–1090. - connections for efficient neural networks,” inProceedings of the 28th [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and - International Conference on Neural Information Processing Systems, ser. improving convolutional neural networks via concatenated rectified - NIPS’15, 2015. linear units,”arXiv preprint arXiv:1603.05201, 2016. - [23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com- [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in - pressing neural networks with the hashing trick.” JMLR Workshop and deep neural networks,”arXiv preprint arXiv:1604.00676, 2016. - Conference Proceedings, 2015. [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic - [24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural symmetry in convolutional neural networks,” inProceedings of the - network compression,”CoRR, vol. abs/1702.04008, 2017. 33rd International Conference on International Conference on Machine - [25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain Learning - Volume 48, ser. ICML’16, 2016. - damage,” in2016 IEEE Conference on Computer Vision and Pattern [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception- - Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, resnet and the impact of residual connections on learning.”CoRR, vol. - pp. 2554–2564. abs/1602.07261, 2016. - [26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, - cnns,” inEuropean Conference on Computer Vision, Amsterdam, the small, low power fully convolutional neural networks for real-time object - Netherlands, October 2016, pp. 662–677. detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016. - [27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ - sparsity in deep neural networks,” inAdvances in Neural Information inProceedings of the 12th ACM SIGKDD International Conference on - Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, Knowledge Discovery and Data Mining, ser. KDD ’06, 2006, pp. 535– - I. Guyon, and R. Garnett, Eds., 2016, pp. 2074–2082. 541. - [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in - filters for efficient convnets,”CoRR, vol. abs/1608.08710, 2016. Advances in Neural Information Processing Systems 27: Annual Confer- - [29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for ence on Neural Information Processing Systems 2014, December 8-13 - small-footprint deep learning,” inAdvances in Neural Information Pro- 2014, Montreal, Quebec, Canada, 2014, pp. 2654–2662. - cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a - and R. Garnett, Eds., 2015, pp. 3088–3096. neural network,”CoRR, vol. abs/1503.02531, 2015. - [30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and - Chang, “An exploration of parameter redundancy in deep networks with Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550, - circulant projections,” inInternational Conference on Computer Vision 2014. - (ICCV), 2015. [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling, - [31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and “Bayesian dark knowledge,” inAdvances in Neural Information Process- - S. Chang, “Fast neural networks with circulant projections,”CoRR, vol. ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, - abs/1502.03436, 2015. and R. Garnett, Eds., 2015, pp. 3420–3428. - [32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression - and Z. Wang, “Deep fried convnets,” inInternational Conference on by distilling knowledge from neurons,” inProceedings of the Thirtieth - Computer Vision (ICCV), 2015. AAAI Conference on Artificial Intelligence, February 12-17, 2016, - [33]J. Chun and T. Kailath,Generalized Displacement Structure for Block- Phoenix, Arizona, USA., 2016, pp. 3560–3566. - Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel- [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning - berg: Springer Berlin Heidelberg, 1991, pp. 215–236. via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015. - [34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention: - in low-rank tensor formats via cross approximation,”SIAM J. Scientific Improving the performance of convolutional neural networks via atten- - Computing, vol. 37, no. 2, 2015. tion transfer,”CoRR, vol. abs/1612.03928, 2016. - [35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc: [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by - A structured efficient linear layer,” inInternational Conference on jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014. - Learning Representations (ICLR), 2016. [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and - [36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable A. C. Courville, “Dynamic capacity networks,” inProceedings of the - filters,” in2013 IEEE Conference on Computer Vision and Pattern 33nd International Conference on Machine Learning, ICML 2016, New - Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754– York City, NY, USA, June 19-24, 2016, 2016, pp. 2549–2558. - 2761. [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, - [37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, and J. Dean, “Outrageously large neural networks: The sparsely-gated - “Exploiting linear structure within convolutional networks for efficient mixture-of-experts layer,” 2017. - evaluation,” inAdvances in Neural Information Processing Systems 27, [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and - Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. J. Odobez, “Deep dynamic neural networks for multimodal gesture - Weinberger, Eds., 2014, pp. 1269–1277. segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell., - [38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional vol. 38, no. 8, pp. 1583–1597, 2016. - neural networks with low rank expansions,” inProceedings of the British [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, - Machine Vision Conference. BMVA Press, 2014. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” - [39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit- inComputer Vision and Pattern Recognition (CVPR), 2015. - sky, “Speeding-up convolutional neural networks using fine-tuned cp- [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep - decomposition,”CoRR, vol. abs/1412.6553, 2014. Networks with Stochastic Depth, 2016. - [40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual - with low-rank regularization,” vol. abs/1511.06067, 2015. networks with separated stochastic depth,”CoRR, vol. abs/1612.01230, - [41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, 2016. - “Predicting parameters in deep learning,” in Advances in Neural [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and - Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, R. Feris, “Blockdrop: Dynamic inference paths in residual networks,” - Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 2148–2156. inCVPR, 2018. - [Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer- - files/nips26/1053.pdf ence graphs,” 2018. - [42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab- [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional - hadran, “Low-rank matrix factorization for deep neural network training networks through FFTs, 2014. - with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net- - Acoustics, Speech and Signal Processing, 2013. works,” in2016 IEEE Conference on Computer Vision and Pattern IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10 - - - - Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong, - pp. 4013–4021. M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X. - [69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S. Yu, “Ibm research and columbia university trecvid-2012 multimedia - Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol. event detection (med), multimedia event recounting (mer), and semantic - abs/1611.05138, 2016. indexing (sin) systems,” 2012. - [70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving - pooling in deep networks,” inProceedings of the IEEE Conference on - Computer Vision and Pattern Recognition, 2018. Yu Cheng(yu.cheng@microsoft.com) currently is a - [71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Researcher at Microsoft. Before that, he was a Re- - applied to document recognition,” inProceedings of the IEEE, 1998, pp. search Staff Member at IBM T.J. Watson Research - 2278–2324. Center. Yu got his Ph.D. from Northwestern Univer- - [72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried- sity in 2015 and bachelor from Tsinghua University - miller, “Striving for simplicity: The all convolutional net,”CoRR, vol. in 2010. His research is about deep learning in - abs/1412.6806, 2014. general, with specific interests in the deep generative - [73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014. model, model compression, and transfer learning. - [74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for He regularly serves on the program committees of - large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014. top-tier AI conferences such as NIPS, ICML, ICLR, - [75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image CVPR and ACL. - recognition,”arXiv preprint arXiv:1512.03385, 2015. - [76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, - D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient - descent by gradient descent,” inNeural Information Processing Systems - (NIPS), 2016. Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference ceived the B.S. degree in automation from theon Learning Representations 2016, 2016. Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018. Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in are about deep learning, particularly in few-shotdeep networks,” pp. 2270–2278, 2016. learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on robotics vision.Computer Vision (ICCV), Oct 2017. - [81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep - neural networks,”ECCV, 2018. - [82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric - learning via cross sample similarities transfer,” inProceedings of the - Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), - New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852– Pan Zhou(panzhou@hust.edu.cn) is currently an - 2859. associate professor with School of Electronic In- - [83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature formation and Communications, Wuhan, China. He - map for portable deep model,” inProceedings of the 34th International received his Ph.D. in the School of Electrical and - Conference on Machine Learning, ser. Proceedings of Machine Learning Computer Engineering at the Georgia Institute of - Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Technology in 2011. Before that, he received his - Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. B.S. degree in theAdvanced Classof HUST, and - 3703–3711. a M.S. degree in the Department of Electronics - [84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression and Information Engineering from HUST, Wuhan, - of deep convolutional neural networks for fast and low power mobile China, in 2006 and 2008, respectively. His current - applications,”CoRR, vol. abs/1511.06530, 2015. research interest includes big data analytics and - [85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient machine learning, security and privacy, and information networks. - object detection models with knowledge distillation,” inAdvances in - Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, - S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, - Eds., 2017, pp. 742–751. - [86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob- - “Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE tained his B.S., M.S., and Ph.D. degrees from Ts- - Conference on Computer Vision and Pattern Recognition (CVPR), June inghua University, Beijing, China, in 1993, 1995, - 2018. and 1999, respectively, and another Ph.D. degree - [87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, from Saga University, Saga, Japan, in 2002, all in - Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy control engineering. He is currently a Professor with - trade-offs for modern convolutional object detectors,” in2017 IEEE the Department of Automation, Tsinghua University. - Conference on Computer Vision and Pattern Recognition, CVPR 2017, He serves the Associate Dean, School of Information - Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296–3297. Science and Technology and Head of the Department - [88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence of Automation. His current research interests include - modeling for video event detection,” in The IEEE Conference on artificial intelligence, robotics, image processing, - Computer Vision and Pattern Recognition (CVPR), June 2014. control theory, and control of spacecraft. \ No newline at end of file diff --git a/Corpus/A guide to convolution arithmetic for deep learning.txt b/Corpus/A guide to convolution arithmetic for deep learning.txt deleted file mode 100644 index a47ff7f..0000000 Binary files a/Corpus/A guide to convolution arithmetic for deep learning.txt and /dev/null differ diff --git a/Corpus/Analysis and Design of Echo State Networks.txt b/Corpus/Analysis and Design of Echo State Networks.txt deleted file mode 100644 index ec72712..0000000 --- a/Corpus/Analysis and Design of Echo State Networks.txt +++ /dev/null @@ -1,1298 +0,0 @@ - LETTER Communicated by Herbert Jaeger - - - - Analysis and Design of Echo State Networks - - - Mustafa C. Ozturk - can@cnel.ufl.edu - Dongming Xu - dmxu@cnel.ufl.edu - JoseC.Pr´ ´ıncipe - principe@cnel.ufl.edu - Computational NeuroEngineering Laboratory, Department of Electrical and - Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A. - - - The design of echo state network (ESN) parameters relies on the selec- - tion of the maximum eigenvalue of the linearized system around zero - (spectral radius). However, this procedure does not quantify in a sys- - tematic manner the performance of the ESN in terms of approximation - error. This article presents a functional space approximation framework - to better understand the operation of ESNs and proposes an information- - theoretic metric, the average entropy of echo states, to assess the richness - of the ESN dynamics. Furthermore, it provides an interpretation of the - ESN dynamics rooted in system theory as families of coupled linearized - systems whose poles move according to the input signal dynamics. With - this interpretation, a design methodology for functional approximation - is put forward where ESNs are designed with uniform pole distributions - covering the frequency spectrum to abide by the richness metric, irre- - spective of the spectral radius. A single bias parameter at the ESN input, - adapted with the modeling error, configures the ESN spectral radius to - the input-output joint space. Function approximation examples compare - the proposed design methodology versus the conventional design. - - - 1 Introduction - - Dynamic computational models require the ability to store and access the - time history of their inputs and outputs. The most common dynamic neural - architecture is the time-delay neural network (TDNN) that couples delay - lines with a nonlinear static architecture where all the parameters (weights) - are adapted with the backpropagation algorithm. The conventional delay - line utilizes ideal delay operators, but delay lines with local first-order re- - cursive filters have been proposed by Werbos (1992) and extensively stud- - ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera, - 1993). Chains of first-order integrators are interesting because they effec- - tively decrease the number of delays necessary to create time embeddings - - - Neural Computation19, 111–138(2007) C 2006 Massachusetts Institute of Technology 112 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - (Principe, 2001). Recurrent neural networks (RNNs) implement a differ- - ent type of embedding that is largely unexplored. RNNs are perhaps the - most biologically plausible of the artificial neural network (ANN) models - (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990), - but are not well understood theoretically (Siegelmann & Sontag, 1991; - Siegelmann, 1993; Kremer, 1995). One of the main practical problems with - RNNs is the difficulty to adapt the system weights. Various algorithms, - such as backpropagation through time (Werbos, 1990) and real-time recur- - rent learning (Williams & Zipser, 1989), have been proposed to train RNNs; - however, these algorithms suffer from computational complexity, resulting - in slow training, complex performance surfaces, the possibility of instabil- - ity, and the decay of gradients through the topology and time (Haykin, - 1998). The problem of decaying gradients has been addressed with spe- - cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter- - native second-order training methods based on extended Kalman filtering - (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov, - Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp - et al., 1998) provide more reliable performance and have enabled practical - applications in identification and control of dynamical systems (Kechri- - otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado, - Kambhampati, & Warwick, 1995). - Recently,twonewrecurrentnetworktopologieshavebeenproposed:the - echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and - the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨ - 2002). ESNs possess a highly interconnected and recurrent topology of - nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001) - and contain information about the history of input and output patterns. - The outputs of these internal PEs (echo states) are fed to a memoryless but - adaptive readout network (generally linear) that produces the network out- - put. The interesting property of ESN is that only the memoryless readout is - trained, whereas the recurrent topology has fixed connection weights. This - reduces the complexity of RNN training to simple linear regression while - preserving a recurrent topology, but obviously places important constraints - in the overall architecture that have not yet been fully studied. Similar ideas - have been explored independently by Maass and formalized in the LSM - architecture. LSMs, although formulated quite generally, are mostly im- - plemented as neural microcircuits of spiking neurons (Maass et al., 2002), - whereas ESNs are dynamical ANN models. Both attempt to model biolog- - ical information processing using similar principles. We focus on the ESN - formulation in this letter. - The echo state condition is defined in terms of the spectral radius (the - largest among the absolute values of the eigenvalues of a matrix, denoted - by·) of the reservoir’s weight matrix (W<1). This condition states - that the dynamics of the ESN is uniquely controlled by the input, and the - effect of the initial states vanishes. The current design of ESN parameters Analysis and Design of Echo State Networks 113 - - - relies on the selection of spectral radius. However, there are many possible - weight matrices with the same spectral radius, and unfortunately they do - not all perform at the same level of mean square error (MSE) for functional - approximation. A similar problem exists in the design of the LSM. LSMs - have been shown to possess universal approximation given the separation - property (SP) for the liquid (reservoir in ESNs) and the approximation - property (AP) for the readout (Maass et al., 2002). SP is quantified by a - kernel-quality measure proposed in Maass, Legenstein, and Bertschinger - (2005) that is based on the rank of a matrix formed by the system states - corresponding to different input signals. The kernel quality is a measure - for the complexity and diversity of nonlinear operations carried out by the - liquid on its input stream in order to boost the classification power of a - subsequent linear decision hyperplane (Maass et al., 2005). A variation of - SP has been proposed in Bertschinger and Natschlager (2004), and it has¨ - been argued that complex calculations can be best carried out by networks - on the boundary between ordered and chaotic dynamics. - Inthisletter,weareinterestedinstudyingtheESNforfunctionalapprox- - imation (filters that map input functionsu(·) of time on output functionsy(·) - of time). We see two major shortcomings with the current ESN approach - that uses echo state condition as a design principle. First, the impact of fixed - reservoir parameters for function approximation means that the informa- - tion about the desired response is conveyed only to the output projection. - This is not optimal, and strategies to select different reservoirs for different - applications have not been devised. Second, imposing a constraint only on - the spectral radius is a weak condition to properly set the parameters of - the reservoir, as experiments show (different randomizations with the same - spectral radius perform differently for the same problem; see Figure 2). - This letter aims to address these two problems by proposing a frame- - work, a metric, and a design principle for ESNs. The framework is a signal - processing interpretation of basis and projections in functional spaces to - describe and understand the ESN architecture. According to this interpre- - tation, the ESN states implement a set of basis functionals (representation - space) constructed dynamically by the input, while the readout simply - projects the desired response onto this representation space. The metric - to describe the richness of the ESN dynamics is an information-theoretic - quantity, the average state entropy (ASE). Entropy measures the amount of - information contained in a given random variable (Shannon, 1948). Here, - the random variable is the instantaneous echo state from which the en- - tropy for the overall state (vector) is estimated. The probability density - function (pdf) in a differential geometric framework should be thought of - as a volume form; that is, in our case, the pdf of the state vector describes - the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946) - established information as a coordinate free metric in the state manifold. - Therefore, entropy becomes a global descriptor of information that quanti- - fies the volume of the manifold defined by the random variable. Due to the 114 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - time dependency of the states, the state entropy averaged over time (ASE) - is an appropriate estimate of the volume of the state manifold. - The design principle specifies that one should consider independently - thecorrelationamongthebasisandthespectralradius.Intheabsenceofany - information about the desired response, the ESN states should be designed - with the highest ASE, independent of the spectral radius. We interpret the - ESN dynamics as a combination of time-varying linear systems obtained - from the linearization of the ESN nonlinear PE in a small, local neighbor- - hood of the current state. The design principle means that the poles of the - linearized ESN reservoir should have uniform pole distributions to gener- - ate echo states with the most diverse pole locations (which correspond to - the uniformity of time constants). Effectively, this will create the least cor- - related bases for a given spectral radius, which corresponds to the largest - volume spanned by the basis set. When the designer has no other informa- - tion about the desired response to set the basis, this principle distributes - the system’s degrees of freedom uniformly in space. It approximates for - ESNs the well-known property of orthogonal basis. The unresolved issue - that ASE does not quantify is how to set the spectral radius, which depends - again on the desired mapping. The concept of memory depth as explained - in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the - issues associated with the spectral radius. The correlation time of the de- - siredresponse(asestimatedbythefirstzerooftheautocorrelationfunction) - gives an indication of the type of spectral radius required (long correlation - time requires high spectral radius). Alternatively, a simple adaptive bias is - added at the ESN input to control the spectral radius integrating the infor- - mation from the input-output joint space in the ESN bases. For sigmoidal - PEs, the bias adjusts the operating points of the reservoir PEs, which has - the net effect of adjusting the volume of the state manifold as required to - approximate the desired response with a small error. This letter shows that - ESNs designed with this strategy obtain systematically better results in a - set of experiments when compared with the conventional ESN design. - - - 2 Analysis of Echo State Networks - - 2.1 Echo States as Bases and Projections.Let us consider the ar- - chitecture and recursive update equation of a typical ESN more closely. - Consider the recurrent discrete-time neural network given in Figure 1 - withMinput units,Ninternal PEs, andLoutput units. The value of - the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal - units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)= - [y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M - weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal - PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij Analysis and Design of Echo State Networks 115 - - - Input Layer Dynamical Reservoir Read-out - - Win WW out - - - - - - - - x(n) u(n) - - . + - . y(n) - - - - - - - - Wback - - - Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed- - weight (W<1) recurrent network and a linear readout. The recurrent net- - work is a reservoir of highly interconnected dynamical components, states of - which are called echo states. The memoryless linear readout is trained to pro- - duce the output. - - - output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The - activation of the internal PEs (echo state) is updated according to - - x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)), (2.1) - - wheref=(f1 ,f2 ,...,fN )aretheinternalPEs’activationfunctions.Here,all - f e−x - i ’s are hyperbolic tangent functions ( ex − ). The output from the readout ex +e−x - network is computed according to - - y(n+1)=fout (Wout x(n+1)), (2.2) - - wherefout =(fout ,fout ,...,fout ) are the output unit’s nonlinear functions 1 2 L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity. - ESNs resemble the RNN architecture proposed in Puskorius and - Feldkamp (1996) and also used by Sanchez (2004) in brain-machine 116 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - interfaces. The critical difference is the dimensionality of the hidden re- - current PE layer and the adaptation of the recurrent weights. We submit - that the ideas of approximation theory in functional spaces (bases and pro- - jections), so useful in adaptive signal processing (Principe, 2001), should - be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued - function of a real-valued vector - - u(t)=[u1 (t),u2 (t),...,uM (t)] T . - - In functional approximation, the goal is to estimate the behavior ofh(u(t)) - as a combination of simpler functionsϕi (t), called the basis functionals, - such that its approximant,hˆ(u(t)), is given by - - N - hˆ(u(t))= ai ϕi (t). - i=1 - - Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of - the central questions in practical functional approximation is how to choose - the set of bases to approximate a given desired signal. In signal processing, - thechoicenormallygoesforacompletesetoforthogonalbasis,independent - of the input. When the basis set is complete and can be made as large - as required, fixed bases work wonders (e.g., Fourier decompositions). In - neural computing, the basic idea is to derive the set of bases from the - input signal through a multilayered architecture. For instance, consider a - single hidden layer TDNN withNPEs and a linear output. The hidden- - layer PE outputs can be considered a set of nonorthogonal basis functionals - dependent on the input, -   - - ϕi (u(t))=g bij uj (t). - j - - bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi- - mation produced by the TDNN is then - - N - h ˆ(u(t))= ai ϕi (u(t)), (2.3) - i=1 - - whereai ’s are the weights of the output layer. Notice that thebij ’s adapt - the bases and theai ’s adapt the projection in the projection space. Here the - goal is to restrict the number of bases (number of hidden layer PEs) because - their number is coupled with the number of parameters to adapt, which - has an impact on generalization and training set size, for example. Usually, Analysis and Design of Echo State Networks 117 - - - since all of the parameters of the network are adapted, the best basis in the - joint (input and desired signals) space as well as the best projection can be - achieved and represents the optimal solution. The output of the TDNN is - a linear combination of its internal representations, but to achieve a basis - set (even if nonorthogonal), linear independence among theϕi (u(t))’s must - be enforced. Ito, Shah and Pon, and others have shown that this is indeed - the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside - the scope of this article. - The ESN (and the RNN) architecture can also be studied in this frame- - work. The states of equation 2.1 correspond to the basis set, which are - recursively computed from the input, output, and previous states through - Win ,W,andWback . Notice, however, that none of these weight matrices is - adapted, that is, the functional bases in the ESN are uniquely defined by the - input and the initial selection of weights. In a sense, ESNs are trading the - adaptive connections in the RNN hidden layer by a brute force approach - of creating fixed diversified dynamics in the hidden layer. - For an ESN with a linear readout network, the output equation (y(n+ - 1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and - ai ’s are replaced by the echo states and the readout weights, respectively. - The readout weights are adapted in the training data, which means that the - ESN is able to find the optimal projection in the projection space, just like - the RNN or the TDNN. - A similar perspective of basis and projections for information processing - in biological networks has been proposed by Pouget and Sejnowski (1997). - They explored the possibility that the response of neurons in parietal cortex - serves as basis functions for the transformations from the sensory input - to the motor responses. They proposed that “the role of spatial represen- - tations is to code the sensory inputs and posture signals in a format that - simplifies subsequent computation, particularly in the generation of motor - commands”. - The central issue in ESN design is exactly the nonadaptive nature of - the basis set. Parameter sets in the reservoir that provide linearly inde- - pendent states and possess a given spectral radius may define drastically - different projection spaces because the correlation among the bases is not - constrained. A simple experiment was designed to demonstrate that the se- - lection of the ESN parameters by constraining the spectral radius is not the - most suitable for function approximation. Consider a 100-unit ESN where - the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let - the ESN generate the seventh power of the input signal. Different realiza- - tions of a randomly connected 100-unit ESN were constructed where the - entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025, - and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input - weights are set to+1or,−1 with equal probabilities, andWback is set to - zero. Input is applied for 300 time steps, and the echo states are calculated - using equation 2.1. The next step is to train the linear readout. One method 118 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - MSE for different realizations10 4 - - - - - - - - - 10 6 - - - - - - - - - 10 8 - - - - - 10 9 - 0 10 20 30 40 50 - Different realizations - - Figure 2: Performances of ESNs for different realizations ofWwith the same - weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba- - bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius - of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results - show that for each set of random weights that provide the same spectral ra- - dius, the correlation or degree of redundancy among the bases will change, and - different performances are encountered in practice. - - - to determine the optimal output weight matrix,Wout , in the mean square - error (MSE) sense (where MSE is defined byO=1 (d−y)T (d−y)) is to use 2 the Wiener solution given by Haykin (2001): - - −1 1 - Wout =E[xx T ]−1 E[xd]∼ 1 - = x(n)x(n)T x(n)d(n) . (2.4) N Nn n - - Here,E[.] denotes the expected value operator, andddenotes the desired - signal. Figure 2 depicts the MSE values for 50 different realizations of - the ESNs. As observed, even though each ESN has the same sparseness - and spectral radius, the MSE values obtained vary greatly among differ- - ent realizations. The minimum MSE value obtained among the 50 realiza- - tions is 5.9x10 −9 , whereas the maximum MSE is 8.9x10 −5 . This experiment Analysis and Design of Echo State Networks 119 - - - demonstrates that a design strategy that is based solely on the spectral - radius is not sufficient to specify the system architecture for function ap- - proximation. This shows that for each set of random weights that provide - thesamespectralradius,thecorrelationordegreeofredundancyamongthe - bases will change, and different performances are encountered in practice. - - 2.2 ESN Dynamics as a Combination of Linear Systems.It is well - known that the dynamics of a nonlinear system can be approximated by - that of a linear system in a small neighborhood of an equilibrium point - (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis - with hyperbolic tangent nonlinearities and approximate the ESN dynam- - ics by the dynamics of the linearized system in the neighborhood of the - current system state. Hence, when the system operating point varies over - time, the linear system approximating the ESN dynamics changes. We are - particularly interested in the movement of the poles of the linearized ESN. - Consider the update equation for the ESN without output feedback given - by - - x(n+1)=f(Win u(n+1)+Wx(n)). - - Linearizing the system around the current statex(n), one obtains the - Jacobian matrix,J(n+1), defined by -  f˙(net 1 (n))w ˙11 f(net 1 (n))w12 ··· f˙(net 1 (n))w1N   f˙(net J(n+1)= 2 (n))w ˙21 f(net 2 (n))w22 ··· f˙(net 2 (n))w2N   ··· ··· ··· ···  - f˙(net N (n))w ˙N1 f(net N (n))wN2 ···f˙(net N (n))wNN -  f˙(net 1 (n)) 0 ··· 0 -   0 f ˙(net  = 2 (n))··· 0   ·W=F(n)·W. (2.5) -  ··· ··· ··· ···  - 00···f˙ (net N (n)) - - - Here,net i (n)istheith entry of the vector (Win u(n+1)+Wx(n)), andwij - denotes the (i,j)th entry ofW. The poles of the linearized system at time - n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the - amplitude of each PE changes, the local slope changes, and so the poles of - - - - 1 The transfer function of a linear systemx(n+1)=Ax(n)+Bu(n)is X(z) =(zI−U(z) - A)−1 B=Adjoint(zI−A) B. The poles of the transfer function can be obtained by solving det(zI−A) - det(zI−A)=0. The solution corresponds to the eigenvalues ofA. 120 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - the linearized system are time varying, although the parameters of ESN are - fixed. - In order to visualize the movement of the poles, consider an ESN with - 100 states. The entries of the internal weight matrix are chosen to be 0, - 0.4 and−0.4 with probabilities 0.9, 0.05, and 0.05.Wis scaled such that a - spectral radius of 0.95 is obtained. Input weights are set to+1or−1 with - equal probabilities. A sinusoidal signal with a period of 100 is fed to the - system, and the echo states are computed according to equation 2.1. Then - the Jacobian matrix and the eigenvalues are calculated using equation 2.5. - Figure 3 shows the pole tracks of the linearized ESN for different input - values. A single ESN with fixed parameters implements a combination of - many linear systems with varying pole locations, hence many different - time constants that modulate the richness of the reservoir of dynamics as a - function of input amplitude. Higher-amplitude portions of the signal tend - to saturate the nonlinear function and cause the poles to shrink toward - the origin of thez-plane (decreases the spectral radius), which results in a - system with a large stability margin. When the input is close to zero, the - poles of the linearized ESN are close to the maximal spectral radius chosen, - decreasing the stability margin. When compared to their linear counterpart, - an ESN with the same number of states results in a detailed coverage of - thez-plane dynamics, which illustrates the power of nonlinear systems. - Similar results can be obtained using signals of different shapes at the ESN - input. - A key corollary of the above analysis is that the spectral radius of an - ESN can be adjusted using a constant bias signal at the ESN input without - changing the recurrent connection matrix,W. The application of a nonzero - constant bias will move the operating point to regions of the sigmoid func- - tion closer to saturation and always decrease the spectral radius due to the - shape of the nonlinearity. 2 The relevance of bias in terms of overall system - performance has also been discussed in Jaeger (2002b) and Bertschinger - and Natschlager (2004), but here we approach it from a system theory per-¨ - spective and explain its effect on reservoir dynamics. - - 3 Average State Entropy as a Measure of the Richness of ESN Reservoir - - Previous research was aware of the influence of diversity of the recurrent - layer outputs on the overall performance of ESNs and LSMs. Several met- - rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al., - - - 2 AssumeWhas nondegenerate eigenvalues and corresponding linearly independent - eigenvectors. Then consider the eigendecomposition ofW,whereW=PDP −1 ,Pis the - eigenvectormatrixandDisthediagonalmatrixofeigenvalues(Dii )ofW.SinceF(n)andD - are diagonal,J(n+1)=F(n)W=F(n)(PDP −1 )=P(F(n)D)P−1 is the eigendecomposition - ofJ(n+1). Here, each entry ofF(n)D,f (net(n))Dii , is an eigenvalue ofJ. Therefore, - |f (net(n))Dii |≤|Dii |sincef (net i )≤f (0). Analysis and Design of Echo State Networks 121 - - - (A) 1 (B) 1 - D0.8 0.8 - 0.6 C 0.6 - 0.4 0.4 - - - - Imaginary - Amplitude 0.2 0.2 - 0 B E 0 - -0.2 -0.2 - -0.4 -0.4 - -0.6 -0.6 - -0.8 -0.8 - -1 -1 0 20 40 60 80 100 -1 -0.5 Real 0 0.5 1 Time - (C) 1 (D) 1 - 0.8 0.8 - 0.6 0.6 - 0.4 0.4 - - - - Imaginary 0.2 - - - Imaginary 0.2 - 0 0 - -0.2 -0.2 - -0.4 -0.4 - -0.6 -0.6 - -0.8 -0.8 - -1 -1-1 -0.5 Real 0 0.5 1 -1 -0.5 Real 0 0.5 1 - - (E) 1 (F) 1 - 0.8 0.8 - 0.6 0.6 - 0.4 0.4 - - - - Imaginary 0.2 - - - Imaginary 0.2 - 0 0 - -0.2 -0.2 - -0.4 -0.4 - -0.6 -0.6 - -0.8 -0.8 - -1 -1-1 -0.5 Real 0 0.5 1 -1 -0.5 Real 0 0.5 1 - - Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input - goes through a cycle. An ESN with fixed parameters implements a combination - of linear systems with varying pole locations. (A) One cycle of sinusoidal signal - with a period of 100. (B–E) The positions of poles of the linearized systems - when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative - pole locations show the movement of the poles as the input changes. Due to - the varying pole locations, different time constants modulate the richness of - the reservoir of dynamics as a function of input amplitude. Higher-amplitude - signals tend to saturate the nonlinear function and cause the poles to shrink - toward the origin of thez-plane (decreases the spectral radius), which results in - a system with a large stability margin. When the input is close to zero, the poles - ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing - the stability margin. An ESN with more states results in a detailed coverage of - thez-plane dynamics, which illustrates the power of nonlinear systems, when - compared to their linear counterpart. 122 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - 2005). Here, our approach of bases and projections leads to a new metric. - We propose the instantaneous state entropy to quantify the distribution of - instantaneous amplitudes across the ESN states. Entropy of the instanta- - neous ESN states is appropriate to quantify performance in function ap- - proximation because the ESN output is a mere weighted combination of - the instantaneous value of the ESN states. If the echo state’s instantaneous - amplitudes are concentrated on only a few values across the ESN state dy- - namic range, the ability to approximate an arbitrary desired response by - weighting the states is limited (and wasteful due to redundancy between - the different states), and performance will suffer. On the other hand, if the - ESN states provide a diversity of instantaneous amplitudes, it is much eas- - ier to achieve the desired mapping. Hence, the instantaneous entropy of the - states appears as a good measure to quantify the richness of dynamics with - instantaneous mappers. Due to the time structure of signals, the average - state entropy (ASE), defined as the state entropy averaged over time, will be - the parameter used to quantify the diversity in the dynamical reservoir of - the ESN. Moreover, entropy has been proposed as an appropriate measure - of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE - measures the volume of the echo state manifold spanned by trajectories. - Renyi’squadraticentropyisemployedherebecauseitisaglobalmeasure - of information. In addition, an efficient nonparametric estimator of Renyi’s - entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe, - Xu, & Fisher, 2000). Renyi’s entropy with parameterγfor a random variable - Xwith a pdffX (x) is given by Renyi (1970): - - - 1Hγ (X)= logE[fγ−1 (X)].1−γ X - - - Renyi’s quadratic entropy is obtained forγ=2 (forγ→1, Shannon’s en- - tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un- - known pdf to be estimated, Parzen windowing approximates the underly- - ing pdf by - - - 1N - fX (x)= KN σ (x−xi ), - i=1 - - whereKσ is the kernel function with the kernel sizeσ. Then the Renyi’s - quadratic entropy can be estimated by (Principe et al., 2000) - -   - - H2 (X)=−log1 - KN2 σ (xj −xi ) . (3.1) - j i Analysis and Design of Echo State Networks 123 - - - The instantaneous state entropy is estimated using equation 3.1 where - thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T - of an ESN withNinternal PEs. Results will be shown with a gaussian kernel - with kernel size chosen to be 0.3 of the standard deviation of the entries - of the state vector. We will show that ASE is a more sensitive parameter to - quantify the approximation properties of ESNs by experimentally demon- - strating that ESNs with different spectral radius and even with the same - spectral radius display different ASEs. - Let us consider the same 100-unit ESN that we used in the previous - section built with three different spectral radii 0.2, 0.5, 0.8 with an input - signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks. - The instantaneous state entropy is also calculated at each time step using - equation 3.1 and plotted in Figure 4B. First, note that the instantaneous - state entropy changes over time with the distribution of the echo states as - we would expect, since state entropy is dependent on the input signal that - also changes in this case. Second, as the spectral radius increases in the - simulation, the diversity in the echo states increases. For the spectral radius - of 0.2, echo state’s instantaneous amplitudes are concentrated on only a - few values, which is wasteful due to redundancy between different states. - In practice, to quantify the overall representation ability over time, we will - use ASE, which takes values−0.735,−0.007, and 0.335 for the spectral - radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral - radius, several ASEs are possible. Figure 4C shows ASEs from 50 different - realizations of ESNs with the same spectral radius of 0.5, which means that - ASE is a finer descriptor of the dynamics of the reservoir. Although we - have presented an experiment with sinusoidal signal, similar results are - obtained for other inputs as long as the input dynamic range is properly - selected. - Maximizing ASE means that the diversity of the states over time is the - largest and should provide a basis set that is as uncorrelated as possible. - This condition is unfortunately not a guarantee that the ESN so designed - will perform the best, because the basis set in ESNs is created independent - of the desired response and the application may require a small spectral - radius. However, we maintain that when the desired response is not ac- - cessible for the design of the ESN bases or when the same reservoir is - to be used for a number of problems, the default strategy should be to - maximize the ASE of the state vector. The following section addresses - the design of ESNs with high ASE values and a simple mechanism to - adjust the reservoir dynamics without changing the recurrent connection - weights. - - 4 Designing Echo State Networks - - 4.1 Design of the Echo State Recurrent Connections.According to the - interpretation of ESNs as coupled linear systems, the design of the internal 124 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - connection matrix,W, will be based on the distribution of the poles of the - linearized system around zero state. Our proposal is to design the ESN - such that the linearized system has uniform pole distribution inside the - unit circle of thez-plane. With this design scenario, the system dynamics - will include uniform coverage of time constants arising from the uniform - distribution of the poles, which also decorrelates as much as possible the - basis functionals. This principle was chosen by analogy to the identification - oflinearsystemsusingKautzfilters(Kautz,1954),whichshowsthatthebest - approximation of a given transfer function by a linear system with finite - order is achieved when poles are placed in the neighborhood of the spectral - resonances. When no information is available about the desired response, - we should uniformly spread the poles to anticipate good approximation to - arbitrary mappings. - We again use a maximum entropy principle to distribute the poles inside - the unit circle uniformly. The constraints of a circle as boundary conditions - for discrete linear systems and complex conjugate locations are easy to - include for the pole distribution (Thogula, 2003). The poles are first initial- - ized at random locations; the quadratic Renyi’s entropy is calculated by - equation 3.1, and poles are moved such that the entropy of the new dis- - tribution is increased over iterations (Erdogmus & Principe, 2002). This - method is efficient to find uniform coverage of the unit circle with an arbi- - trary number of poles. The system with the uniform pole locations can be - interpreted using linear system theory. The poles that are close to the unit - circle correspond to many sharp bandpass filters specializing in different - frequency regions, whereas the inner poles realize filters of larger frequency - support. Moreover, different orientations (angles) of the poles create filters - of different center frequencies. - Now the problem is to construct an internal weight matrix from the pole - locations (eigenvalues ofW). In principle, we would like to create a sparse - - - - - Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs - ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8, - from top to bottom, respectively. The diversity of echo states increases when the - spectral radius increases. Within the dynamic range of the echo states, systems - with smaller spectral radius can generate only uneven representations, while - forW=0.8, outputs of echo states almost uniformly distribute within their - dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1. - Information contained in the echo states is changing over time according to the - input amplitude. Therefore, the richness of representation is controlled by the - input amplitude. Moreover, the value of ASE increases with spectral radius. - (C) ASEs from 50 different realizations of ESNs with the same spectral radius - of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the - reservoir than the spectral radius. Analysis and Design of Echo State Networks 125 - - - (A) Echo States1 - 0 - - 10 20 40 60 801001201401601802001 - 0 - - 10 20 40 60 801001201401601802001 - 0 - - 10 20 40 60 80100120140160180200Time - (B) State Entropy1.5 Spectral Radius = 0.2 - 1 Spectral Radius = 0.5 Spectral Radius = 0.8 - 0.5 - 0 - - 0.5 - - 1 - - 1.5 - - 2 - - 2.50 50 100 150 200Time - (C) Different ASEs for the same spectral radius0.3 - - 0.25 - - 0.2 - - ASE0.15 - - 0.1 - - 0.050 10 20 30 40 50 - Trials 126 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - matrix, so we started with the sparsest matrix (with an inverse), which is - the direct canonical structure given by (Kailath, 1980) - -  −a1 −a2 ···−aN−1 −aN -  10··· 00  W= 01··· 00   . (4.1) - ··· ··· ··· ··· ··· - 00··· 10 - - The characteristic polynomial ofWis - - l(s)=det(sI−W)=sN +a N−11 s +a2 sN−2 +aN - =(s−p1 )(s−p2 )···(s−pN ), (4.2) - - wherepi ’s are the eigenvalues andai ’s are the coefficients of the character- - istic polynomial ofW. Here, we know the pole locations of the linear system - obtained from the linearization of the ESN, so using equation 4.2, we can - obtain the characteristic polynomial and constructWmatrix in the canon- - ical form using equation 4.1. We will call the ESN constructed based on - the uniform pole principle ASE-ESN. All other possible solutions with the - same eigenvalues can be obtained byQ−1 WQ,whereQis any nonsingular - matrix. - To corroborate our hypothesis, we would like to show that the linearized - ESN designed with the recurrent weight matrix having the eigenvalues - uniformly distributed inside the unit circle creates higher ASE values for a - given spectral radius compared to other ESNs with random internal con- - nection weight matrices. We will consider an ESN with 30 states and use our - procedure to create theWmatrix for ASE-ESN for different spectral radii - between [0.1, 0.95]. Similarly, we constructed ESNs with sparse randomW - matrices with different sparseness constraints. This corresponds to a weight - distribution having the values 0,cand−cwith probabilitiesp1 ,(1−p1 )/2, - and (1−p1 )/2, wherep1 defines the sparseness ofWandcis a constant - that takes a specific value depending on the spectral radius. We also created - Wmatrices with values uniformly distributed between−1 and 1 (U-ESN) - and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then, - for differentWin matrices, we run the ASE-ESNs with the sinusoidal input - given in section 3 and calculate ASE. Figure 5 compares the ASE values - averaged over 1000 realizations. As observed from the figure, the ASE-ESN - with uniform pole distribution generates higher ASE on average for all - spectral radii compared to ESNs with sparse and uniform random connec- - tions. This approach is indeed conceptually similar to Jeffreys’ maximum - entropy prior (Jeffreys, 1946): it will provide a consistently good response - for the largest class of problems. Concentrating the poles of the linearized Analysis and Design of Echo State Networks 127 - - - 1 - ASEESN - 0.8 UESN - sparseness=0.2 - 0.6 sparseness=0.1 - sparseness=0.07 - 0.4 - - ASE 0.2 - - 0 - - - 0.2 - - - 0.40 0.2 0.4 0.6 0.8 1 - Spectral radius - - Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith - uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN - with uniformly distributed weights between−1 and 1. Randomly generated - weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the - networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole - distribution generates a higher ASE on average for all spectral radii compared - to ESNs with random connections. - - - system in certain regions of the space provides good performance only if - the desired response has energy in this part of the space, as is well known - from the theory of Kautz filters (Kautz, 1954). - - 4.2 Design of the Adaptive Bias.In conventional ESNs, only the out- - put weights are trained, optimizing the projections of the desired response - onto the basis functions (echo states). Since the dynamical reservoir is fixed, - the basis functions are only input dependent. However, since function ap- - proximation is a problem in the joint space of the input and desired signals, - a penalty in performance will be incurred. From the linearization analysis - that shows the crucial importance of the operating point of the PE non- - linearity in defining the echo state dynamics, we propose to use a single - external adaptive bias to adjust the effective spectral radius of an ESN. No- - tice that according to linearization analysis, bias can reduce only spectral - radius. The information for adaptation of bias is the MSE in training, which - modulates the spectral radius of the system with the information derived - from the approximation error. With this simple mechanism, some informa- - tionfromtheinput-outputjointspaceisincorporatedinthedefinitionofthe - projection space of the ESN. The beauty of this method is that the spectral 128 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - radius can be adjusted by a single parameter that is external to the system - without changing reservoir weights. - The training of bias can be easily accomplished. Indeed, since the pa- - rameter space is only one-dimensional, a simple line search method can be - efficiently employed to optimize the bias. Among different line search al- - gorithms, we will use a search that uses Fibonacci numbers in the selection - of points to be evaluated (Wilde, 1964). The Fibonacci search method min- - imizes the maximum number of evaluations needed to reduce the interval - of uncertainty to within the prescribed length. In our problem, a bias value - is picked according to Fibonacci search. For each value of bias, training - data are applied to the ESN, and the echo states are calculated. Then the - corresponding optimal output weights and the objective function (MSE) - are evaluated to pick the next bias value. - Alternatively, gradient-based methods can be utilized to optimize the - bias, due to simplicity and low computational cost. System update equation - with an external bias signal,b,isgivenby - - x(n+1)=f(Win u(n+1)+Win b+Wx(n)). - - The update equation forbis given by - - ∂O(n+1) ∂x(n+1)=−e·Wout × (4.3)∂b ∂b ∂x(n)=−e·Wout × f˙(net n+1 )· W× +Win . (4.4)∂b - - Here,Ois the MSE defined previously. This algorithm may suffer from - similar problems observed in gradient-based methods in recurrent net- - works training. However, we observed that the performance surface is - rather simple. Moreover, since the search parameter is one-dimensional, - the gradient vector can assume only one of the two directions. Hence, im- - precision in the gradient estimation should affect the speed of convergence - but normally not change the correct gradient direction. - - 5 Experiments - - This section presents a variety of experiments in order to test the validity - of the ESN design scheme proposed in the previous section. - - 5.1 Short-TermMemoryCapacity.Thisexperimentcomparestheshort- - term memory (STM) capacity of ESNs with the same spectral radius using - the framework presented in Jaeger (2002a). Consider an ESN with a sin- - gle input signal,u(n), optimally trained with the desired signalu(n−k), - for a given delayk. Denoting the optimal output signalyk (n), thek-delay Analysis and Design of Echo State Networks 129 - - - STM capacity of a network,MC k , is defined as a squared correlation coef- - ficient betweenu(n−k)andyk (n) (Jaeger, 2002a). The STM capacity,MC, - of the network is defined as ∞ MC k=1 k . STM capacity measures how accu- - rately the delayed versions of the input signal are recovered with optimally - trained output units. Jaeger (2002a) has shown that the memory capacity - for recalling an independent and identically distributed (i.i.d.) input by an - Nunit RNN with linear output units is bounded byN. - We use ESNs with 20 PEs and a single input unit. ESNs are driven - by an i.i.d. random input signal,u(n), that is uniformly distributed over - [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions - of the input,u(n−1),...,u(n−40). We used four different ESNs: R-ESN, - U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN - used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47, - −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a - sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof - U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spec- - tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed - with uniform poles. BASE-ESN has the same recurrent weight matrix as - ASE-ESN and an adaptive bias at its input. In each ESN, the input weights - are set to 0.1 or−0.1 with equal probability, and direct connections from the - input to the output are allowed, whereasWback is set to0(Jaeger, 2002a). - The echo states are calculated using equation 2.1 for 200 samples of the - input signal, and the first 100 samples corresponding to initial transient - are eliminated. Then the output weight matrix is calculated using equation - 2.4. For the BASE-ESN, the bias is trained for each task. All networks are - run with a test input signal, and the corresponding output andMC k are - calculated. Figure 6 shows thek-delay STM capacity (averaged over 100 - trials) of each ESN for delays 1,...,40 for the test signal. The STM capac- - ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, - and 16.90, respectively. First, ESNs with uniform pole distribution (ASE- - ESN and BASE-ESN) haveMCs that are much longer than the randomly - generated ESN given in Jaeger (2002a) in spite of all having the same spec- - tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical - maximumvalueofN=20.AcloserlookatthefigureshowsthatR-ESNper- - forms slightly better than ASE-ESN for delays less than 9. In fact, for small - k, large ASE degrades the performance because the tasks do not need long - memory depth. However, the drawback of high ASE for smallkis recov- - ered in BASE-ESN, which reduces the ASE to the appropriate level required - for the task. Overall, the addition of the bias to the ASE-ESN increases the - STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly - better STM compared to R-ESN with only three different weight values, - although it has more distinct weight values compared to R-ESN. It is also - significant to note that theMCwill be very poor for an ESN with smaller - spectral radius even with an adaptive bias, since the problem requires large - ASE and bias can only reduce ASE. This experiment demonstrates the 130 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - 1 RESN - UESN - ASEESN0.8 BASEESN - - - - - - - Memory Capacity 0.6 - - - 0.4 - - - 0.2 - - - 0 - 0 10 20 30 40 - Delay - - Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed - using the test signal. The results are averaged over 100 different realizations of - each ESN type with the specifications given in the text for differentWandWin - matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are - 13.09, 13.55, 16.70, and 16.90, respectively. - - - suitability of maximizing ASE in tasks that require a substantial memory - length. - - 5.2 Binary Parity Check.The effect of the adaptive bias was marginal - in the previous experiment since the nature of the problem required large - ASE values. However, there are tasks in which the optimal solutions re- - quire smaller ASE values and smaller spectral radius. Those are the tasks - where the adaptive bias becomes a crucial design parameter in our design - methodology. - Consider an ESN with 100 internal units and a single input unit. ESN is - drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal - is to train an ESN to generate them-bit parity corresponding to lastmbits - received, wheremis 3,...,8. Similar to the previous experiments, we used - the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly - connected ESN where the entries ofWmatrix are set to 0, 0.06,−0.06 - with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse - connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN - are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 - with equal probability, and direct connections from the input to the output - are allowed whereasWback is set to 0. The echo states are calculated using - equation 2.1 for 1000 samples of the input signal, and the first 100 samples - correspondingtotheinitialtransientareeliminated.Thentheoutputweight Analysis and Design of Echo State Networks 131 - - - 350 - - 300 - - 250 - - - - - - - Wrong Decisions 200 - - 150 - - 100 - ASEESN50 RESN - BASEESN0 - 3 4 5 6 7 8 - m - - Figure 7: The number of wrong decisions made by each ESN form=3,...,8 - in the binary parity check problem. The results are averaged over 100 differ- - ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin - matrices with the specifications given in the text. The total numbers of wrong - decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and - 699. - - - - matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias - is trained for each task. The binary decision is made by a threshold detector - that compares the output of the ESN to 0.5. Figure 7 shows the number of - wrong decisions (averaged over 100 different realizations) made by each - ESN form=3,...,8. - The total numbers of wrong decisions form=3,...,8 of R-ESN, ASE- - ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs - poorly since the nature of the problem requires a short time constant for - fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the - R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions. - BASE-ESN performs a lot better than ASE-ESN and slightly better than - the R-ESN since the adaptive bias reduces the spectral radius effectively. - Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN, - since the task requires access to longer input history, which compromises - the need for fast response. Indeed, the bias in the BASE-ESN takes effect - when there are errors (m>4) and when the task benefits from smaller - spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and - 2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide - range of bias values that result in similar MSE values (between 0 and 3). In 132 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - summary, this experiment clearly demonstrates the power of the bias signal - to configure the ESN reservoir according to the mapping task. - - 5.3 System Identification.This section presents a function approxima- - tion task where the aim is to identify a nonlinear dynamical system. The - unknown system is defined by the difference equation - - y(n+1)=0.3y(n)+0.6y(n−1)+f(u(n)), - - where - - f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu). - - The input to the system is chosen to be sin(2πn/25). - We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with - 30 internal units and a single input unit. TheWmatrix of each ESN is scaled - suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN - where the entries ofWmatrix are set to 0, 0.35,−0.35 with probabilities 0.8, - 0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or−1 with - equal probability, and direct connections from the input to the output are - allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated - using equation 2.4. The MSE values (averaged over 100 realizations) for R- - ESN and ASE-ESN are 1.23x10 −5 and 1.83x10 −6 , respectively. The addition - of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10 −6 - to 3.27x10 −9 . - - 6 Discussion - - The great appeal of echo state networks (ESNs) and liquid state machine - (LSM) is their ability to construct arbitrary mappings of signals with rich - and time-varying temporal structures without requiring adaptation of the - free parameters of the recurrent layer. The echo state condition allows the - recurrent connections to be fixed with training limited to the linear output - layer. However, the literature did not elucidate on how to properly choose - the recurrent parameters for system identification applications. Here, we - provide an alternate framework that interprets the echo states as a set - of functional bases formed by fixed nonlinear combinations of the input. - The linear readout at the output stage simply computes the projection of - the desired output space onto this representation space. We further in- - troduce an information-theoretic criterion, ASE, to better understand and - evaluate the capability of a given ESN to construct such a representation - layer. The average entropy of the distribution of the echo states quantifies - thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest - to achieve the smallest correlation among the bases and be able to cope with Analysis and Design of Echo State Networks 133 - - - arbitrary mappings. However, not all function approximation problems re- - quire the same memory depth, which is coupled to the spectral radius. The - effective spectral radius of an ESN can be optimized for the given problem - with the help of an external bias signal that is adapted using the joint input- - output space information. The interesting property of this method when - applied to ESN built from sigmoidal nonlinearities is that it allows the fine - tuning of the system dynamics for a given problem with a single external - adaptive bias input and without changing internal system parameters. In - our opinion, the combination of the largest possible ASE and the adapta- - tion of the spectral radius by the bias produces the most parsimonious pole - location of the linearized ESN when no knowledge about the mapping is - available to optimally locate the bass functionals. Moreover, the bias can be - easily trained with either a line search method or a gradient-based method - since it is one-dimensional. We have illustrated experimentally that the de- - sign of the ESN using the maximization of ASE with the adaptation of the - spectral radius by the bias has provided consistently better performance - across tasks that require different memory depths. This means that these - two parameters’ design methodology is preferred to the spectral radius - criterion proposed by Jaeger, and it is still easily incorporated in the ESN - design. - Experiments demonstrate that the ASE for ESN with uniform linearized - poles is maximized when the spectral radius of the recurrent weight matrix - approaches one (instability). It is interesting to relate this observation with - the computational properties found in dynamical systems “at the edge of - chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993; - Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨ - tomata rules are evolved to perform a complex computation, evolution will - tend to select rules with “critical” parameter values, which correlate with - a phase transition between ordered and chaotic regimes. Recently, similar - conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨ - Langton’s interpretation of edge of chaos was questioned by Mitchell et al. - (1993). Here, we provide a system-theoretic view and explain the computa- - tional behavior with the diversity of dynamics achieved with linearizations - that have poles close to the unit circle. According to our results, the spectral - radiusoftheoptimalESNinfunctionapproximationisproblemdependent, - and in general it is impossible to forecast the computational performance - as the system approaches instability (the spectral radius of the recurrent - weight matrix approaches one). However, allowing the system to modu- - late the spectral radius by either the output or internal biasing may allow - a system close to instability to solve various problems requiring different - spectral radii. - Our emphasis here is mostly on ESNs without output feedback connec- - tions. However, the proposed design methodology can also be applied to - ESNs with output feedback. Both feedforward and feedback connections - contribute to specify the bases to create the projection space. At the same 134 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - time, there are applications where the output feedback contributes to the - system dynamics in a different fashion. For example, it has been shown that - a fixed weight (fully trained) RNN with output feedback can implement a - family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992). - In meta-learning, the role of output feedback in the network is to bias the - system to different regions of dynamics, providing multiple input-output - mappings required (Santiago & Lendaris, 2004). However, results could not - be replicated with ESNs (Prokhorov, 2005). We believe that more work has - to be done on output feedback in the context of ESNs but also suspect that - the echo state condition may be a restriction on the system dynamics for - this type of problem. - There are many interesting issues to be researched in this exciting new - area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s - representation layer in an unsupervised fashion. In fact, we can easily adapt - withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild, - and Principe (2003): extra weights linking the outputs of recurrent states to - maximize output entropy. Output entropy maximization is a well-known - metric to create independent components (Bell & Sejnowski, 1995), and - here it means that the echo states will become as independent as possible. - This would circumvent the linearization of the dynamical system to set the - recurrent weights and would fine-tune continuously in an unsupervised - manner the parameters of the ESN among different inputs. However, it - goes against the idea of a fixed ESN reservoir. - The reservoir of recurrent PEs can be thought of as a new form of a time- - to-space mapping. Unlike the delay line that forms an embedding (Takens, - 1981), this mapping may have the advantage of filtering noise and produce - representations with better SNRs to the peaks of the input, which is very - appealing for signal processing and seems to be used in biology. However, - further theoretical work is necessary in order to understand the embedding - capabilities of ESNs. One of the disadvantages of the ESN correlated basis - is in the design of the readout. Gradient-based algorithms will be very - slow to converge (due to the large eigenvalue spread of modes), and even - if recursive methods are used, their stability may be compromised by the - condition number of the matrix. However, our recent results incorporating - anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of - solving this problem. - Finally we would like to briefly comment on the implications of these - models to neurobiology and computational neuroscience. The work by - Pouget and Sejnowski (1997) has shown that the available physiological - data are consistent with the hypothesis that the response of a single neuron - in the parietal cortex serves as a basis function generated by the sensory - input in a nonlinear fashion. In other words, the neurons transform the - sensory input into a format (representation space) such that the subsequent - computation is simplified. Then, whenever a motor command (output of - the biological system) needs to be generated, this simple computation to Analysis and Design of Echo State Networks 135 - - - read out the neuronal activity is done. There is an intriguing similarity - betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski - and our interpretation of echo states in ESN. We believe that similar ideas - can be applied to improve the design of microcircuit implementations of - LSMs. First, the framework of functional space interpretation (bases and - projections) is also applicable to microcircuits. Second, the ASE measure - may be directly utilized for LSM states because the states are normally low- - pass-filtered before the readout. However, the control of ASE by changing - the liquid dynamics is unclear. Perhaps global control of thresholds or bias - current will be able to accomplish bias control as in ESN with sigmoid - PEs. - - - Acknowledgments - - ThisworkwaspartiallysupportedbyNSFECS-0422718,NSFCNS-0540304, - and ONR N00014-1-1-0405. - - - References - - Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer. - Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor- - ical perception, and probability learning: Some applications of a neural model. - Psychological Review, 84, 413–451. - Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach - to blind separation and blind deconvolution.Neural Computation, 7(6), 1129– - 1159. - Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨ - in recurrent neural networks.Neural Computation, 16(7), 1413–1436. - Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal - of Physics, 14(1), 1–13. - de Vries, B. (1991).Temporal processing with neural networks—the development of the - gamma model. Unpublished doctoral dissertation, University of Florida. - Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural - network for system identification and control.IEEE Proceedings of Control Theory - and Applications, 142(4), 307–314. - Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179–211. - Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation: - Stochastic information gradient.Signal Processing Letters, 10(8), 242–245. - Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for - adaptive system training.IEEE Transactions on Neural Networks, 13(5), 1035–1044. - Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream - Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle - (Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 29–53). Dordrecht, - Netherlands: Kluwer. 136 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle - River, NJ. Prentice Hall. - Haykin, S. (2001).Adaptive filter theory(4th ed.). Upper Saddle River, NJ: Prentice - Hall. - Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa- - tion, 9(8), 1735–1780. - Hopfield, J. (1984). Neurons with graded response have collective computational - properties like those of two-state neurons.Proceedings of the National Academy of - Sciences, 81, 3088–3092. - Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math- - ematics, 5(1), 189–203. - Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural - networks(Tech. Rep. No. 148). Bremen: German National Research Center for - Information Technology. - Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152). - Bremen: German National Research Center for Information Technology. - Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL, - EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German - National Research Center for Information Technology. - Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems - and saving energy in wireless communication.Science, 304(5667), 78–80. - Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems. - Proceedings of the Royal Society of London, A 196, 453–461. - Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall. - Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit - Theory, 1(3), 29–39. - Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks - for adaptive communication channel equalization.IEEE Transactions on Neural - Networks, 5(2), 267–278. - Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks. - IEEE Transactions on Neural Networks, 6(5), 1000–1004. - Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation - theory(2nd ed.). New York: Springer-Verlag. - Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 12–37. - Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the - computational power and generalization capability of neural microcircuits. In - L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing - systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press. - Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨ - stable states: A new framework for neural computation based on perturbations. - Neural Computation, 14(11), 2531–2560. - Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos: - Evolving cellular automata to perform computations.Complex Systems, 7, 89– - 130. - Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. - Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293– - 301). Singapore: World Scientific. Analysis and Design of Echo State Networks 137 - - - Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex - using basis functions.Journal of Cognitive Neuroscience, 9(2), 222–237. - Principe, J. (2001). Dynamic neural networks and optimal signal processing. In - Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6– - 28). Boca Raton, FL: CRC Press. - Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new - class of adaptive IIR filters with restricted feedback.IEEE Transactions on Signal - Processing, 41(2), 649–656. - Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin - (Ed.),Unsupervised adaptive filtering(pp. 265–319). Hoboken, NJ: Wiley. - Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter- - national Joint Conference on Neural Networks(pp. 1463–1466). Montreal, Canada. - Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed - weights in recurrent neural networks: An overview. InProc. of International Joint - Conference on Neural Networks(pp. 2018–2022). Honolulu, Hawaii. - Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys- - tems with Kalman filter trained recurrent networks.IEEE Transactions on Neural - Networks, 5(2), 279–297. - Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap- - plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 1407–1420. - Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, - M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with - echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and - Signal Processing. Philadelphia. - Renyi, A. (1970).Probability theory. New York: Elsevier. - Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis. - Unpublished doctoral dissertation, University of Florida. - Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net- - works: Reformulating fixed weight neural networks. InProc. of International Joint - Conference on Neural Networks(pp. 189–194). Budapest, Hungary. - Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in - multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 10–18. - Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical - Journal, 27, 623–656. - Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc- - toral dissertation, Rutgers University. - Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied - Mathematics Letters, 4(6), 77–80. - Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended - Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process- - ing systems, 1(pp. 133–140). San Mateo, CA: Morgan Kaufmann. - Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. - Young (Eds.),Dynamical systems and turbulence(pp. 366–381). Berlin: Springer. - Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub- - lished master’s thesis, University of Florida. - Werbos, P. (1990). Backpropagation through time: What it does and how to do it. - Proceedings of IEEE, 78(10), 1550–1560. 138 M. Ozturk, D. Xu, and J. Pr´ıncipe - - - Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua- - tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 65–89). New - York: Van Nostrand Reinhold. - Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall. - Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running - fully recurrent neural networks.Neural Computation, 1, 270–280. - - - Received December 28, 2004; accepted June 1, 2006. \ No newline at end of file diff --git a/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt b/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt deleted file mode 100644 index 430d70b..0000000 Binary files a/Corpus/Bayesian Compression for Deep Learning - Christos Louizos.txt and /dev/null differ diff --git a/Corpus/CORPUS.txt b/Corpus/CORPUS.txt index 9fde5e3..23fb525 100644 --- a/Corpus/CORPUS.txt +++ b/Corpus/CORPUS.txt @@ -1,3 +1,5 @@ +<> <> <> + Neural Ordinary Differential Equations Ricky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud @@ -738,6 +740,10 @@ an ODESolve model: <
> +<> <> <> + + +<> <> <> Learning differential equations that are easy to solve @@ -749,7 +755,7 @@ an ODESolve model: mattjj@google.com duvenaud@cs.toronto.edu - Abstract + Abstract Differential equations parameterized by neural networks become expensive to solve @@ -843,7 +849,7 @@ to time, integrated along the entire solution trajectory: where k·k2 is the squared `2 norm, and the dependence on the dynamics parameters θ is implicit through the solution z(t) integrating dz(t) - dt = f (z(t), t, θ). + <
>. During training, we weigh this regularization term by a hyperparameter λ and add it to our original loss to get our regularized objective: @@ -905,39 +911,39 @@ causes an unnecessary exponential slowdown, costing O(exp(K)). This is because e appear in lower derivatives also appear in higher derivatives, but the work to compute is not shared across orders. Taylor Mode Taylor-mode AD generalizes Function Taylor propagation rule -first-order forward mode to compute the first y = z + cw y[k] = z[k] + cw[k] -K derivatives exactly with a time cost of only Pk -O(K 2 ) or O(K log K), depending on the op- y =z∗w y[k] = h j=0 z[j] w[k−j] i - Pk−1 -erations involved. Instead of providing rules y = z/w y[k] = w10 zk − j=0 z[j] w[k−j] -for propagating perturbation vectors, one pro- Pk - y = exp(z) ỹ[k] = j=1 y[k−j] z̃[j] -vides rules for propagating truncated Taylor Pk -series. Some example rules are shown in ta- s = sin(z) s̃[k] = j=1 z̃[j] c[k−j] - Pk -ble 1. For more details see the Appendix and c = cos(z) c̃[k] = j=1 −z̃[j] s[k−j] +first-order forward mode to compute the first <> <> +K derivatives exactly with a time cost of only <> +O(K 2 ) or O(K log K), depending on the op- <> << y[k] = h j=0 z[j] w[k−j] i>> + <> +erations involved. Instead of providing rules <> <> +for propagating perturbation vectors, one pro- <> + <> <<ỹ[k] = j=1 y[k−j] z̃[j]>> +vides rules for propagating truncated Taylor <> +series. Some example rules are shown in ta- <> <> + <> +ble 1. For more details see the Appendix and <> <> Griewank & Walther (2008, Chapter 12). We provide an open source implementation of Table 1: Rules for propagating Taylor polynomial Taylor mode AD in the JAX Python library coefficients through standard functions. These rules (Bradbury et al., 2018). generalize standard first-order derivatives. Notation - z[i] = i!1 zi and ỹ[i] = i!i zi . + <> and <<ỹ[i] = i!i zi>>. 5 Experiments - 100 10 + We consider three different tasks in which continuous- - Training Error (%) -depth or continuous time models might have computa- λ = 3.02e-03 - Average NFE - 80 λ=0 + +depth or continuous time models might have computa- + + tional advantages over standard discrete-depth models: -supervised learning, continuous generative modeling of 5 -time-series (Rubanova et al., 2019), and density estima- 60 +supervised learning, continuous generative modeling of +time-series (Rubanova et al., 2019), and density estima- tion using continuous normalizing flows (Grathwohl et al., 2019). Unless specified otherwise, we use the standard - 40 + dopri5 Runge-Kutta 4(5) solver (Dormand & Prince, -1980; Shampine, 1986). 0 50 100 150 - Training Epoch +1980; Shampine, 1986). <
> + 5.1 Supervised Learning Figure 3: Number of function evalua- tions (NFE) and training error during We construct a model for MNIST classification: it takes in training. Speed regularization (solid) @@ -946,16 +952,15 @@ dynamics given by a simple MLP, then applies a linear without su classification layer. In fig. 3 we compare the NFE and ing error. training error of a model with and without regularizing R3 . - 5 5 + 5.2 Continuous Generative Time Series Models -As in Rubanova et al. (2019), we use the Latent ODE z20 0 z20 0 + +As in Rubanova et al. (2019), we use the Latent ODE architecture for modelling trajectories of ICU patients using the PhysioNet Challenge 2012 dataset (Silva -et al., 2012). This variational autoencoder architec- −5 −5 -ture uses an RNN recognition network, and models −5 0 5 −5 0 5 - z1 z1 +et al., 2012). This variational autoencoder architec- +ture uses an RNN recognition network, and models the state dynamics using an ODE in a latent space. - (a) Unregularized (b) Regularized In the supervised learning setting described in the previous section only the final state affects model pre- Figure 4: Regularizing dynamics in a la- dictions. In contrast, time-series models’ predictions tent ODE modeling PhysioNet clinical data. @@ -965,8 +970,9 @@ might expect speed regularization to be ineffective duce average NFE from 281 to due to these extra constraints on the dynamics. How- incurring an 8% increase in loss. ever, fig. 4 shows that, without changing their overall shape the latent dynamics can be adjusted to reduce their NFE by a factor of 3. - 4 + 5.3 Density Estimation with Continuous Normalizing Flows + Our third task is unsupervised density estimation, using a scalable variant of continuous normalizing flows called FFJORD (Grathwohl et al., 2019). We fit the MINIBOONE tabular dataset from Papamakarios et al. (2017) and the MNIST image dataset (LeCun et al., 2010). We use the respective @@ -976,14 +982,9 @@ become prohibitively expensive throughout training. Table 2 shows that we can re for only a 0.6% increase in log-likelihood measured in bits/dim. How to train your Neural ODE We compare against the approach of Finlay et al. (2020), who design two regularization terms specifically for stabilizing the dynamics of FFJORD models: - Z t1 - 2 - K(θ) = kf (z(t), t, θ)k2 dt (3) - t0 - Zt1 - 2 - B(θ) = k| ∇z f (z(t), t, θ)k2 dt,  ∼ N (0, I) (4) - t0 + + <> + The first term is designed to encourage straight-line paths, and the second, stochastic, term is designed to reduce overfitting. Finlay et al. (2020) used fixed-step solvers during training for some datasets. We compare these two regularization on training with each of adaptive and fixed-step solvers, and @@ -994,24 +995,15 @@ What does the trade off between accuracy and speed look like? Ideally, we could time a lot without substantially reducing model performance. Indeed, this is demonstrated in all three settings we explored. Figure 5 shows that generally, model performance starts getting substantially worse only after a 50% reduction in solver speed when controlling R2 . - ×10−3 103 - 17.0 - Unregularized Loss Regularization (λ) - 0.25 - 3.4 - 10−1 - 0.15 13.5 - 3.2 10−5 - 0.05 - 10.0 - 3.0 10−9 - 30 60 90 0 75 150 225 300 80 120 160 - Average NFE Average NFE Average NFE - (a) MNIST Classification (b) PhysioNet Time-Series (c) Miniboone Density Estimation + + <
> + Figure 5: Tuning the regularization of R2 trades off between training loss and solver speed in three different applications of neural ODEs. Horizontal axes show average number of function evaluations, and vertical axes show unregularized training loss, both at the end of training. + 6.2 Order of regularization vs. order of solver + Which order of total derivatives should we regularize for a particular solver? As mentioned earlier, we conjecture that the best choice would be to match the order of the solver being used. Regularizing too low an order might needlessly constrain the dynamics and make it harder to fit the data, while @@ -1026,7 +1018,6 @@ orders above K = 3 gave little benefit. <
> - <
> Figure 7 investigates the relationship between RK and the quantity it is meant to be a surrogate for: NFE. We observe a clear monotonic relationship between the two, for all orders of solver and regularization. @@ -1061,6 +1052,7 @@ Although the field of numerical ODE solvers is extremely mature, as far as we kn been almost no work specifically on tuning differential equations to be faster to solve. The closest <
> + Figure 8: Figure 8c We observe that the actual solver error is about equally well-calibrated for regularized dynamics as random dynamics, indicating that regularization does not make the solver overconfident. Figure 8b: There is negligible overfitting of solver speed. ??: Speed regularization @@ -1220,14 +1212,13 @@ known, unhelpfully, as Tensor coefficients. For a sufficiently smooth vector valued function f : Rn → Rm and the polynomial - x(t) = x[0] + x[1] t + x[2] t2 + x[3] t3 + · · · + x[d] td ∈ Rn (5) + << x(t) = x[0] + x[1] t + x[2] t2 + x[3] t3 + · · · + x[d] td ∈ Rn>> (5) we are interested in the d-truncated Taylor expansion - y(t) = f (x(t)) + O(td+1 ) (6) - + <> (6) - ≡ y[0] + y[1] t + y[2] t + y[3] t + · · · + y[d] t ∈ R (7) + <<≡ y[0] + y[1] t + y[2] t + y[3] t + · · · + y[d] t ∈ R >> (7) with the notation that <> is the Taylor coefficient, which is the normalized derivative coefficient. @@ -1280,11 +1271,11 @@ eqs. (16) to (18) and (20), involve terms previously computed for lower order te In general, it will be useful to consider that the yk derivative coefficients is a function of all lower order input derivatives - yk = yk (x0 , . . . , xk ). (22) + <>. (22) We provide the API to compute this in JAX by indexing the k-output of jet - yk = jet(f, x0 , (x1 , . . . , xk ))[k]. + <>. A.2 Relationship with Differential Equations @@ -1311,6 +1302,7 @@ can use jet and the relationship xk+1 = yk to recursively compute the coefficien polynomial. Algorithm 1 Taylor Coefficients for ODE Solution by Recursive Jet + <> A.3 Regularizing Taylor Terms @@ -1341,6 +1333,7 @@ optimize our model we only need to compute the gradient of the regularization te method gives the gradient of the ODE solution as a solution to an augmented ODE. <
> + Figure 9: Left: The dynamics and a trajectory of a neural ODE trained on a toy supervised learning problem. The dynamics are poorly approximated by a 6th-order local Taylor series, and requires 92 NFE by a solve by a 5th-order Runge-Kutta solver. Right: Regularizing the 6th-order derivatives of @@ -1350,14 +1343,14 @@ trajectories gives dynamics that are easier to solve numerically, requiring only The dynamics function f : Rd × R → Rd is given by an MLP as follows - z1 = σ(x) - h1 = W1 [z1 ; t] + b1 - z2 = σ(h1 ) - y = W2 [z2 ; t] + b2 + <> + <

> + <> + <> -Where [·; ·] denotes concatenation of a scalar onto a column vector. The parameters are W1 ∈ -Rh×d , b1 ∈ Rh and W2 ∈ Rd×h , b2 ∈ Rd . Here we use 100 hidden units, i.e. h = 100. We have -d = 784, the dimension of an MNIST image. +Where <<[·; ·]>> denotes concatenation of a scalar onto a column vector. The parameters are <>, <> and <> , <> . Here we use 100 hidden units, i.e.<< h = 100>>. We have +<>, the dimension of an MNIST image. We train with a batch size of 100 for 160 epochs. We use the standard training set of 60,000 images, and the standard test set of 10,000 images as a validation/test set. We optimize our model using SGD with momentum with β = 0.9. Our learning rate schedule is 1e-1 for the first 60 epochs, 1e-2 until @@ -1408,6 +1401,7 @@ FFJORD were trained with double precision for purposes of reproducibility. <
> + Figure 10: The difference in NFE is tracked by the variance of NFE. In fig. 10 we note that there is a striking correspondence in the variance of NFE across individual @@ -1427,9 +1421,11 @@ similarly on the time-series modelling task we see that we get a similar pareto compared to IWAE loss. The pareto curves are plotted for R3 , R2 respectively. <
> + Figure 11: MNIST Classification <
> + Figure 12: Physionet Time-Series C.3 Wall-clock Time @@ -1448,6 +1444,7 @@ and an estimate of <> Table 3: Classification on MNIST + <> These are combined with a weighted average and integrated along the solution trajectory. @@ -1471,8 +1468,14 @@ while Finlay et al. (2020) penalizes the respective norms of the matrix ∇z f ( f (z(t), t) separately. Table 4: Density Estimation on Tabular Data (MINIBOONE) + <
> +<> <> <> + + +< <> <> + How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization @@ -1501,7 +1504,8 @@ f (z(t), t) separately. brings neural ODEs closer to practical relevance in large-scale applications. - <
> + <
> + Figure 1. Optimal transport map and a generic normalizing flow. Indeed, it was observed that there is a striking similarity @@ -1569,19 +1573,19 @@ by formulating the forward pass of a deep network as the Suppose we are g solution of a ordinary differential equation. Initial work such as the normal distribution. Change of variables tells us along these lines was motivated by the similarity of the eval- that the distribution pθ (x) may be evaluated through uation of one layer of a ResNet and the Euler discretization -of an ODE. Suppose the block in the t-th layer of a ResNet log pθ (x) = log q (z(x, T )) + log det | ∇ z(x, T )| (2) +of an ODE. Suppose the block in the t-th layer of a ResNet <> (2) is given by the function f (x, t; θ), where θ are the block’s parameters. Then the evaluation of this layer of the ResNet Evaluating the log determinant of the Jacobian is difficult. is simply xt+1 = xt + f (xt , t; θ). Now, instead consider Grathwohl et al. (2019) exploit the following identity from the following ODE fluid mechanics (Villani, 2003, p 114) - <> (ODE) log det | ∇ z(x, t)| = div (f ) (z(x, t), t)) (3) + <> (ODE) <> (3) -The Euler discretization of this ODE with step-size τ is where div (·) is the divergence operator, div (f ) (x) = -zt+1 = zt + τ f (zt , t; θ), which is nearly identical to the i ∂xi fi (x). By the fundamental theorem of calculus, we +The Euler discretization of this ODE with step-size <<τ>> is where <> is the divergence operator, <>, which is nearly identical to the i ∂xi fi (x)>>. By the fundamental theorem of calculus, we forward evaluation of the ResNet’s layer (setting step-size 1 In the normalizing flow literature divergence is typically writ- -τ = 1 gives equality). Armed with this insight, Chen et al. ten explicitly as the trace of the Jacobian, however we use div (·) +<<τ = 1>> gives equality). Armed with this insight, Chen et al. ten explicitly as the trace of the Jacobian, however we use div (·) (2018) suggested a method for training neural networks which is more common elsewhere. <
> @@ -1590,31 +1594,26 @@ Figure 2. Log-likelihood (measured in bits/dim) on the validation set as a funct may then rewrite (2) in integral form From this simple motivating example, the need for regular- ity of the vector field is apparent. Without placing demands - Z T on the vector field f , it is entirely possible that the learned - log pθ (x) = log q (z(x, T )) + div (f ) (z(x, s), s) ds - 0 + <> dynamics will be poorly conditioned. This is not just a theo- (4) retical exercise: because the dynamics must be solved with Remark 2.1 (Divergence trace estimate). In (Grathwohl a numerical integrator, poorly conditioned dynamics will et al., 2019), the divergence is estimated using an unbiased lead to difficulties during numerical integration of (ODE). Monte-Carlo trace estimate (Hutchinson, 1990; Avron & Indeed, later we present results demonstrating a clear corre- Toledo, 2011), lation between the number of time steps an adaptive solver - takes to solve (ODE), and the regularity of f . -  T  - div (f ) (x) = E  ∇ f (x) (5) How can the regularity of the vector field be measured? One - ∼N (0,1) + takes to solve (ODE), and the regularity of f .  + <> (5) How can the regularity of the vector field be measured? One motivating approach is to measure the force experienced by a particle z(t) under the dynamics generated by the vector By using the substitution (4), the task of maximizing log- field f , which is given by the total derivative of f with likelihood shifts from choosing pθ to minimize (1), to learn- respect to time ing the flow generated by a vector field f . This results in a -normalizing flow with a free-form Jacobian and reversible df (z, t) ∂f (z, t) -dynamics, and was named FFJORD by Grathwohl et al.. = ∇ f (z, t) · ż + (6) - dt ∂t - ∂f (z, t) -2.2. The need for regularity = ∇ f (z, t) · f (z, t) + (7) - ∂t +normalizing flow with a free-form Jacobian and reversible +dynamics, and was named FFJORD by Grathwohl et al.. <> (6) + +2.2. The need for regularity <> (7) + The vector field learned through FFJORD that maximizes Well conditioned flows will place constant, or nearly con- the log-likelihood is not unique, and raises troubling prob- stant, force on particles as they travel. Thus, in this work we lems related to the regularity of the flow. For a simple propose regularizing the dynamics with two penalty terms, @@ -1666,8 +1665,8 @@ minimizing tary materials. and continuous normalizing flows is apparent: the optimal transport problem (11) is a regularized form of the continu- <> (9b) ous normalizing flow optimization problem (1). We there- - ρ0 (x) = p, (9c) fore expect that adding a kinetic energy regularization term - ρT (z) = q. (9d) to FFJORD will encourage solution trajectories to prefer + <<ρ0 (x) = p>>, (9c) fore expect that adding a kinetic energy regularization term + <<ρT (z) = q>>. (9d) to FFJORD will encourage solution trajectories to prefer straight lines with constant speed. The objective function (18a) is a measure of the kinetic energy of the flow. The constraint (18b) ensures probability @@ -1694,6 +1693,7 @@ FFJORD <> <
> + Figure 3. Number of function evaluations vs Jacobian Frobenius norm of flows on CIFAR10 during training with vanilla FFJORD, using an adaptive ODE solver. @@ -1722,18 +1722,19 @@ For these reasons, we also propose regularizing the Jacobian Conveniently, i through its Frobenius norm. The Frobenius norm k · kF of a <> must be computed during the estimate of the prob- real matrix A can be thought of as the `2 norm of the matrix ability distribution under the flow, in the Monte-Carlo esti- A vectorized mate of the divergence term (5). Thus Jacobian Frobenius - kAkF = a2ij (12) norm regularization is available with essentially no extra + <> (12) norm regularization is available with essentially no extra computational cost. Equivalently it may be computed as 5. Algorithm description - kAkF = tr(AAT ) (13) All together, we propose modifying the objective function + <> (13) All together, we propose modifying the objective function of the FFJORD continuous normalizing flow (Grathwohl and is the Euclidean norm of the singular values of a matrix. et al., 2019) with the two regularization penalties of Sec- In trace form, the Frobenius norm lends itself to estimation tions 3 & 4. The proposed method is called RNODE, short using a Monte-Carlo trace estimator (Hutchinson, 1990; for regularized neural ODE. Pseudo-code of the method is <
> + Table 1. Log-likelihood (in bits/dim) and training time (in hours) on validation images with uniform dequantization. Results on clean images are found in the supplemental materials. For comparison we report both the results of the original FFJORD paper (Grathwohl et al., 2019) and our own independent run of FFJORD (“vanilla”) on CIFAR10 and MNIST. Vanilla FFJORD did not train on ImageNet64 @@ -1742,6 +1743,7 @@ comparable log-likelihood as FFJORD but is significantly faster. <
> + Figure 4. Quality of generated samples samples on 5bit CelebA-HQ64 with RNODE. Here temperature annealing (Kingma & Dhariwal, 2018) with T = 0.7 was used to generate visually appealing images. For full sized CelebA-HQ256 samples, consult the supplementary materials. @@ -1774,16 +1776,15 @@ solved by the numerical integrator is use f <> (RNODE) architecture to that of Grathwohl et al. (2019). The dynamics (Kingma & Dhariwal, 2018) trained with 40 GPUs for a week; in contrast we train with four GPUs in just under a week. - - - + <
> + Figure 5. Ablation study of the effect of the two regularizers, comparing two measures of flow regularity during training with a fixed step-size ODE solver. Figure 5a: mean Jacobian Frobenius norm as a function of training epoch. Figure 5b: mean kinetic energy of the flow as a function of training epoch. Figure 5c: number of function evaluations. -are defined by a neural network f (z, t; θ(t)) : Rd × R+ 7→ step size by a factor of two until the discrete dynamics were -Rd where θ(t) is piecewise constant in time. On MNIST we stable and achieved good performance. The Runge-Kutta +are defined by a neural network <> where <<θ(t)>> is piecewise constant in time. On MNIST we stable and achieved good performance. The Runge-Kutta use 10 pieces; CIFAR10 uses 14; downsampled ImageNet 4(5) adaptive solver was used on the two larger datasets. We uses 18; and CelebA-HQ uses 26 pieces. Each piece is a have also observed that RNODE improves the training time 4-layer deep convolutional network comprised of 3x3 ker- of the adaptive solvers as well, requiring many fewer func- @@ -1792,7 +1793,7 @@ have 64 hidden dimensions, and time t is concatenated to fixed gri the spatial input z. The integration time of each piece is of function evaluations. At test time RNODE uses the same [0, 1]. Weight matrices are chosen to imitate the multi-scale adaptive solver as FFJORD. architecture of Real NVP (Dinh et al., 2017), in that im- - We always initialize RNODE so that f (z, t) = 0; thus train- + We always initialize RNODE so that <>; thus train- ages are ‘squeezed’ via a permutation to halve image height ing begins with an initial identity map. This is done by zero- and width but quadruple the number of channels. Diver- @@ -1822,11 +1823,10 @@ fixed-grid four stage Runge-Kutta solver suffices for RN- FFJORD. T ODE during training on MNIST and CIFAR10, using a bits per dimension ( − d1 log2 p(x), a normalized measure step size of 0.25. The step size was determined based on of log-likelihood) on the validation set as a function of a simple heuristic of starting with 0.5 and decreasing the training epoch, for both datasets. Visual inspection of the - 3 - sample quality reveals no qualitative difference between - GeForce RTX 2080 Ti +sample quality reveals no qualitative difference between <
> + Figure 6. Quality of generated samples samples with and without regularization on MNIST, left, and CIFAR10, right. regularized and unregularized approaches; refer to Figure 6. encourages flows to travel a minimal distance. In addition, @@ -2047,9 +2047,9 @@ tion (OT) problem in Eulerian coordinates is Finally, if we assume that {xi }N i=1 are iid sampled from p, <> (18b) we obtain the empirical objective function - ρ0 (x) = p, (18c) + <<ρ0 (x) = p>>, (18c) - ρT (z) = q. (18d) <> (22) + <<ρT (z) = q>>. (18d) <> (22) The connection between continuous normalizing flows (CNF) and OT becomes transparent once we rewrite (18) in @@ -2063,17 +2063,17 @@ fields f one has that the solution of the continuity equation Here we prese The relation ρt = z(·, t)]p means that for arbitrary test function φ we have that - φ(x)ρt (x, t)dx = φ(z(x, t))p(x)dx + <<φ(x)ρt (x, t)dx = φ(z(x, t))p(x)dx>> Therefore (18) can be rewritten as - min kf (z(x, t), t)k2 p(x) dxdt (19a) + <> (19a) - subject to ż(x, t) = f (z(x, t), t), (19b) + <>, (19b) - z(x, 0) = x, (19c) + <>, (19c) - z(·, T )]p = q. (19d) + <>. (19d) Note that ρt is eliminated in this formulation. The terminal condition (18d) is trivial to implement in Eulerian coordi- @@ -2092,9 +2092,11 @@ therefore <> <
> + Figure 7. Quality of FFJORD RNODE generated images on ImageNet-64. <
> + Figure 8. Quality of FFJORD RNODE generated images on CelebA-HQ. We use temperature annealing, as described in (Kingma & Dhariwal, 2018), to generate visually appealing images, with T = 0.5, . . . , 1. @@ -2102,3 +2104,4498 @@ Table 2. Additional results and model statistics of FFJORD RNODE. Here we report validation images with uniform variational dequantization (ie perturbed by uniform noise). We also report number of trainable model parameters. <
> + +<> <> <> + + +<> <> <> + + A guide to convolution arithmetic for deep + learning + + The authors of this guide would like to thank David Warde-Farley, + Guillaume Alain and Caglar Gulcehre for their valuable feedback. We + are likewise grateful to all those who helped improve this tutorial with + helpful comments, constructive criticisms and code contributions. Keep + them coming! + Special thanks to Ethan Schoonover, creator of the Solarized color + scheme, 1 whose colors were used for the figures. + + Feedback + Your feedback is welcomed! We did our best to be as precise, infor- + mative and up to the point as possible, but should there be any thing you + feel might be an error or could be rephrased to be more precise or com- + prehensible, please don’t refrain from contacting us. Likewise, drop us a + line if you think there is something that might fit this technical report + and you would like us to discuss – we will make our best effort to update + this document. + + Source code and animations + The code used to generate this guide along with its figures is available + on GitHub. 2 There the reader can also find an animated version of the + figures. + + + 1 Introduction 5 + 1.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . .6 + 1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 + + 2 Convolution arithmetic 12 + 2.1 No zero padding, unit strides . . . . . . . . . . . . . . . . . . . .12 + 2.2 Zero padding, unit strides . . . . . . . . . . . . . . . . . . . . . .13 + 2.2.1 Half (same) padding . . . . . . . . . . . . . . . . . . . . .13 + 2.2.2 Full padding . . . . . . . . . . . . . . . . . . . . . . . . .13 + 2.3 No zero padding, non-unit strides . . . . . . . . . . . . . . . . . .15 + 2.4 Zero padding, non-unit strides . . . . . . . . . . . . . . . . . . . .15 + + 3 Pooling arithmetic 18 + + 4 Transposed convolution arithmetic 19 + 4.1 Convolution as a matrix operation . . . . . . . . . . . . . . . . .20 + 4.2 Transposed convolution . . . . . . . . . . . . . . . . . . . . . . .20 + 4.3 No zero padding, unit strides, transposed . . . . . . . . . . . . .21 + 4.4 Zero padding, unit strides, transposed . . . . . . . . . . . . . . .22 + 4.4.1 Half (same) padding, transposed . . . . . . . . . . . . . .22 + 4.4.2 Full padding, transposed . . . . . . . . . . . . . . . . . . .22 + 4.5 No zero padding, non-unit strides, transposed . . . . . . . . . . .24 + 4.6 Zero padding, non-unit strides, transposed . . . . . . . . . . . . .24 + + 5 Miscellaneous convolutions 28 + 5.1 Dilated convolutions . . . . . . . . . . . . . . . . . . . . . . . . .28 + + + Chapter 1 + + + Introduction + + + Deep convolutional neural networks (CNNs) have been at the heart of spectac- + ular advances in deep learning. Although CNNs have been used as early as the + nineties to solve character recognition tasks (Le Cunet al., 1997), their current + widespread application is due to much more recent work, when a deep CNN + was used to beat state-of-the-art in the ImageNet image classification challenge + (Krizhevskyet al., 2012). + Convolutional neural networks therefor e constitute a very useful tool for ma- + chine learning practitioners. However, learning to use CNNs for the first time + is generally an intimidating experience. A convolutional layer’s output shape + is affected by the shape of its input as well as the choice of kernel shape, zero + padding and strides, and the relationship between these properties is not triv- + ial to infer. This contrasts with fully-connected layers, whose output size is + independent of the input size. Additionally, CNNs also usually feature apool- + ingstage, adding yet another level of complexity with respect to fully-connected + networks. Finally, so-called transposed convolutional layers (also known as frac- + tionally strided convolutional layers) have been employed in more and more work + as of late (Zeileret al., 2011; Zeiler and Fergus, 2014; Longet al., 2015; Rad- + for det al., 2015; Visinet al., 2015; Imet al., 2016), and their relationship with + convolutional layers has been explained with various degrees of clarity. + This guide’s objective is twofold: + + 1.Explain the relationship between convolutional layers and transposed con- + volutional layers. + 2.Provide an intuitive underst and ing of the relationship between input shape, + kernel shape, zero padding, strides and output shape in convolutional, + pooling and transposed convolutional layers. + + In order to remain broadly applicable, the results shown in this guide are + independent of implementation details and apply to all commonly used machine + learning frameworks, such as Theano (Bergstraet al., 2010; Bastienet al., 2012), + + + + Torch (Collobertet al., 2011), Tensorflow (Abadiet al., 2015) and Caffe (Jia et al., 2014). + + This chapter briefly reviews the main building blocks of CNNs, namely dis- + crete convolutions and pooling. for an in-depth treatment of the subject, see + Chapter 9 of the Deep Learning textbook (Goodfellowet al., 2016). + + + 1.1 Discrete convolutions + + The bread and butter of neural networks is affine transformations: a vector + is received as input and is multiplied with a matrix to produce an output (to + which a bias vector is usually added before passing the result through a non- + linearity). This is applicable to any type of input, be it an image, a sound + clip or an unordered collection of features: whatever their dimensionality, their + representation can always be flattened into a vector before the transfomation. + Images, sound clips and many other similar kinds of data have an intrinsic + structure. More formally, they share these important properties: + + They are stored as multi-dimensional arrays. + They feature one or more axes for which ordering matters (e.g., width and + height axes for an image, time axis for a sound clip). + One axis, called the channel axis, is used to access different views of the + data (e.g., the red, green and blue channels of a color image, or the left + and right channels of a stereo audio track). + + These properties are not exploited when an affine transformation is applied; + in fact, all the axes are treated in the same way and the topological information + is not taken into account. Still, taking advantage of the implicit structure of + the data may prove very h and y in solving some tasks, like computer vision and + speech recognition, and in these cases it would be best to preserve it. This is + where discrete convolutions come into play. + A discrete convolution is a linear transformation that preserves this notion + of ordering. It is sparse (only a few input units contribute to a given output + unit) and reuses parameters (the same weights are applied to multiple locations + in the input). + Figure 1.1 provides an example of a discrete convolution. The light blue + grid is called the input feature map. To keep the drawing simple, a single input + feature map is represented, but it is not uncommon to have multiple feature + maps stacked one onto another. 1 A kernel(shaded area) of value + + <
> + + Figure 1.1: Computing the output values of a discrete convolution. + + + + <
> + + + Figure 1.2: Computing the output values of a discrete convolution for N = 2, i1 =i2 = 5, k1 =k2 = 3, s1 =s2 = 2, and p1 =p2 = 1. + + + + + + + + slides across the input feature map. At each location, the product between + each element of the kernel and the input element it overlaps is computed and + the results are summed up to obtain the output in the current location. The + procedure can be repeated using different kernels to for m as many output feature + maps as desired (Figure 1.3). The final outputs of this procedure are called + output feature maps.2 If there are multiple input feature maps, the kernel will + have to be 3-dimensional – or, equivalently each one of the feature maps will + be convolved with a distinct kernel – and the resulting feature maps will be + summed up elementwise to produce the output feature map. + The convolution depicted in Figure 1.1 is an instance of a 2-D convolution, + but it can be generalized to N-D convolutions. for instance, in a 3-D convolu- + tion, the kernel would be a cuboid and would slide across the height, width and + depth of the input feature map. + The collection of kernels defining a discrete convolution has a shape corre- + sponding to some permutation of(n;m;k 1 ;:::;k N ), where + + + <> + + The following properties affect the output size oj of a convolutional layer + along axis j: + + <> + + for instance, Figure 1.2 shows a 3x3 kernel applied to a 5x5 input padded + with a 1x1 border of zeros using 2x2 strides. + Note that strides constitute a for m of subsampling. As an alternative to + being interpreted as a measure of how much the kernel is translated, strides can + also be viewed as how much of the output is retained. for instance, moving + the kernel by hops of two is equivalent to moving the kernel by hops of one but + retaining only odd output elements (Figure 1.4). + 1 An example of this is what was referred to earlier as channels for images and sound clips. + 2 While there is a distinction between convolution and cross-correlation from a signal pro- + cessing perspective, the two become interchangeable when the kernel is learned. for the sake + of simplicity and to stay consistent with most of the machine learning literature, the term + convolution will be used in this guide. + + <
> + + Figure 1.3: A convolution mapping from two input feature maps to three output + feature maps using a32 3x3 collection of kernels w. In the left pathway, + input feature map 1 is convolved with kernel w1;1 and input feature map 2 is + convolved with kernel w1;2 , and the results are summed together elementwise + to for m the first output feature map. The same is repeated for the middle and + right pathways to for m the second and third feature maps, and all three output + feature maps are grouped together to for m the output. + + <
> + + Figure 1.4: An alternative way of viewing strides. Instead of translating the + 3x3 kernel by increments ofs= 2(left), the kernel is translated by increments + of1 and only one ins= 2output elements is retained (right). + + + 1.2 Pooling + + In addition to discrete convolutions themselves,pooling operations make up + another important building block in CNNs. Pooling operations reduce the size + of feature maps by using some function to summarize subregions, such as taking + the average or the maximum value. + Pooling works by sliding a window across the input and feeding the content + of the window to a pooling function. In some sense, pooling works very much + like a discrete convolution, but replaces the linear combination described by the + kernel with some other function. Figure 1.5 provides an example for average + pooling, and Figure 1.6 does the same for max pooling. + The following properties affect the output size j of a pooling layer along + axisj: + + <> + + + + <
> + + + Figure 1.5: Computing the output values of a 3x3 average pooling operation on a 5x5 input using 1x1 strides. + + <
> + + + Figure 1.6: Computing the output values of a 3x3 max pooling operation on a 5X5 input using 1X1 strides. + + + + Convolution arithmetic + + + The analysis of the relationship between convolutional layer properties is eased + by the fact that they don’t interact across axes, i.e., the choice of kernel size, + stride and zero padding along axis j only affects the output size of axis j. + Because of that, this chapter will focus on the following simplified setting: + + 2-D discrete convolutions (N= 2), + square inputs (i1 =i2 =i), + square kernel size (k1 =k2 =k), + same strides along both axes (s1 =s2 =s), + same zero padding along both axes (p1 =p2 =p). + + This facilitates the analysis and the visualization, but keep in mind that the + results outlined here also generalize to the N-D and non-square cases. + + + 2.1 No zero padding, unit strides + + The simplest case to analyze is when the kernel just slides across every position + of the input (i.e.,s= 1 and p= 0). Figure 2.1 provides an example for i= 4 + and k= 3. + One way of defining the output size in this case is by the number of possible + placements of the kernel on the input. Let’s consider the width axis: the kernel + starts on the leftmost part of the input feature map and slides by steps of one + until it touches the right side of the input. The size of the output will be equal + to the number of steps made, plus one, accounting for the initial position of the + kernel (Figure 2.8a). The same logic applies for the height axis. + More formally, the following relationship can be inferred: + + Relationship 1.for any i,k and p, and for s= 1, + + <> + + + + 2.2 Zero padding, unit strides + + To factor in zero padding (i.e., only restricting tos= 1), let’s consider its effect + on the effective input size: padding with p zeros changes the effective input size + from i to i+ 2p. In the general case, Relationship 1 can then be used to infer + the following relationship: + + Relationship 2.for any i,k and p, and for s= 1, + + <> + + Figure 2.2 provides an example for i= 5,k= 4 and p= 2. + In practice, two specific instances of zero padding are used quite extensively + because of their respective properties. Let’s discuss them in more detail. + + 2.2.1 Half (same) padding + Having the output size be the same as the input size (i.e.,o=i) can be a + desirable property: + + Relationship 3.for any i and for k o d (k= 2n+ 1; n2N), + s= 1 and p=b k=2 c=n, + + <> + + This is sometimes referred to as half(or same) padding. Figure 2.3 provides an + example for i= 5,k= 3 and (therefor e) p= 1. + + 2.2.2 Full padding + While convolving a kernel generally decreases the output size with respect to + the input size, sometimes the opposite is required. This can be achieved with + proper zero padding: + + Relationship 4.for any i and k, and for p=kx1 and s= 1, + + <> + + + <
> + + Figure 2.1: (No padding, unit strides) Convolving a 3x3 kernel over a 4x4 + input using unit strides (i.e.,i= 4,k= 3,s= 1 and p= 0). + + + <
> + + Figure 2.2: (Arbitrary padding, unit strides) Convolving a 4x4 kernel over a + 5x5 input padded with a 2x2 border of zeros using unit strides (i.e.,i= 5, + k= 4,s= 1 and p= 2). + + + <
> + + + Figure 2.3: (Half padding, unit strides) Convolving a 3x3 kernel over a 5x5 + input using half padding and unit strides (i.e.,i= 5,k= 3,s= 1 and p= 1). + + + <
> + + + Figure 2.4: (Full padding, unit strides) Convolving a 3x3 kernel over a 5x5 + input using full padding and unit strides (i.e.,i= 5,k= 3,s= 1 and p= 2). + + + This is sometimes referred to as full padding, because in this setting every + possible partial or complete superimposition of the kernel on the input feature + map is taken into account. Figure 2.4 provides an example for i= 5,k= 3 and + (therefore) p= 2. + + + 2.3 No zero padding, non-unit strides + + All relationships derived so far only apply for unit-strided convolutions. Incorporating + non unitary strides requires another inference leap. To facilitate + the analysis, let’s momentarily ignore zero padding (i.e.,s >1 and p= 0). + Figure 2.5 provides an example for i= 5,k= 3 and s= 2. + Once again, the output size can be defined in terms of the number of possible + placements of the kernel on the input. Let’s consider the width axis: the kernel + starts as usual on the leftmost part of the input, but this time it slides by steps + of sizes until it touches the right side of the input. The size of the output is + again equal to the number of steps made, plus one, accounting for the initial + position of the kernel (Figure 2.8b). The same logic applies for the height axis. + From this, the following relationship can be inferred: + + Relationship 5.for any i,k and s, and for p= 0, + + <> + + The floor function accounts for the fact that sometimes the last possible step + does not coincide with the kernel reaching the end of the input, i.e., some input + units are left out (see Figure 2.7 for an example of such a case). + + + 2.4 Zero padding, non-unit strides + + The most general case (convolving over a zero padded input using non-unit + strides) can be derived by applying Relationship 5 on an effective input of size + i+ 2p, in analogy to what was done for Relationship 2: + + Relationship 6.for any i,k,p and s, + + <> + + As before, the floor function means that in some cases a convolution will produce + the same output size for multiple input sizes. More specifically, ifi+ 2pkis + a multiple ofs, then any input size j=i+a; a2 f0;:::; sx1 g will produce + the same output size. Note that this ambiguity applies only for s >1. + + <
> + + Figure 2.6 shows an example with i= 5,k= 3,s= 2 and p= 1, while + + <
> + + Figure 2.7 provides an example for i= 6,k= 3,s= 2 and p= 1. Interestingly, + + despite having different input sizes these convolutions share the same output + size. While this doesn’t affect the analysis for convolutions, this will complicate + the analysis in the case of transposed convolutions. + + + <
> + + Figure 2.5: (No zero padding, arbitrary strides) Convolving a 3x3 kernel over + a 5x5 input using 2x2 strides (i.e.,i= 5,k= 3,s= 2 and p= 0). + + <
> + + Figure 2.6: (Arbitrary padding and strides) Convolving a 3x3 kernel over a + 5x5 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5, + k= 3,s= 2 and p= 1). + + <
> + + Figure 2.7: (Arbitrary padding and strides) Convolving a 3x3 kernel over a + 6x6 input padded with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6, + k= 3,s= 2 and p= 1). In this case, the bottom row and right column of the + zero padded input are not covered by the kernel. + + (a) The kernel has to slide two steps (b) The kernel has to slide one step of + to the right to touch the right side of size two to the right to touch the right + the input ( and equivalently downwards). side of the input ( and equivalently down- + Adding one to account for the initial ker- wards). Adding one to account for the + nel position, the output size is 3x3. initial kernel position, the output size is 2x2. + + <
> + + Figure 2.8: Counting kernel positions. + + + Chapter 3 + + Pooling arithmetic + + In a neural network, pooling layers provide invariance to small translations of + the input. The most common kind of pooling is max pooling, which consists + in splitting the input in (usually non-overlapping) patches and outputting the + maximum value of each patch. Other kinds of pooling exist, e.g., mean or + average pooling, which all share the same idea of aggregating the input locally + by applying a non-linearity to the content of some patches (Boureauet al., + 2010a,b, 2011; Saxeet al., 2011). + Some readers may have noticed that the treatment of convolution arithmetic + only relies on the assumption that some function is repeatedly applied onto + subsets of the input. This means that the relationships derived in the previous + chapter can be reused in the case of pooling arithmetic. Since pooling does not + involve zero padding, the relationship describing the general case is as follows: + + Relationship 7.for any i,k and s, + + <> + + This relationship holds for any type of pooling. + + + Chapter 4 + + Transposed convolution arithmetic + + + The need for transposed convolutions generally arises from the desire to use a + transfor mation going in the opposite direction of a normal convolution, i.e., from + something that has the shape of the output of some convolution to something + that has the shape of its input while maintaining a connectivity pattern that + is compatible with said convolution. for instance, one might use such a trans- + for mation as the decoding layer of a convolutional autoencoder or to project + feature maps to a higher-dimensional space. + Once again, the convolutional case is considerably more complex than the + fully-connected case, which only requires to use a weight matrix whose shape has + been transposed. However, since every convolution boils down to an efficient im- + plementation of a matrix operation, the insights gained from the fully-connected + case are useful in solving the convolutional case. + Like for convolution arithmetic, the dissertation about transposed convolu- + tion arithmetic is simplified by the fact that transposed convolution properties + don’t interact across axes. + The chapter will focus on the following setting: + + 2-D transposed convolutions (N= 2), + square inputs (i1 =i2 =i), + square kernel size (k1 =k2 =k), + same strides along both axes (s1 =s2 =s), + same zero padding along both axes (p1 =p2 =p). + + Once again, the results outlined generalize to the N-D and non-square cases. + + + + + 4.1 Convolution as a matrix operation + + Take for example the convolution represented in Figure 2.1. If the input and + output were to be unrolled into vectors from left to right, top to bottom, the + convolution could be represented as a sparse matrix C where the non-zero elements + are the elements w i;j of the kernel (with i and j being the row and column + of the kernel respectively): + + <> + + This linear operation takes the input matrix flattened as a 16-dimensional + vector and produces a 4-dimensional vector that is later reshaped as the 2x2 + output matrix. + Using this representation, the backward pass is easily obtained by trans- + posingC; in other words, the error is backpropagated by multiplying the loss + withCT . This operation takes a 4-dimensional vector as input and produces + a 16-dimensional vector as output, and its connectivity pattern is compatible + withCby construction. + Notably, the kernel w defines both the matrices C and CT used for the + for ward and backward passes. + + + 4.2 Transposed convolution + + Let’s now consider what would be required to go the other way around, i.e., + map from a 4-dimensional space to a 16-dimensional space, while keeping the + connectivity pattern of the convolution depicted in Figure 2.1. This operation + is known as a transposed convolution. + Transposed convolutions – also called fractionally strided convolutions or + deconvolutions 1 – work by swapping the for ward and backward passes of a con- + volution. One way to put it is to note that the kernel defines a convolution, but + whether it’s a direct convolution or a transposed convolution is determined by + how the for ward and backward passes are computed. + for instance, although the kernel w defines a convolution whose for ward and + backward passes are computed by multiplying with C and CT respectively, it + also defines a transposed convolution whose for ward and backward passes are + computed by multiplying withCT and (CT )T =C respectively. 2 + Finally note that it is always possible to emulate a transposed convolution + with a direct convolution. The disadvantage is that it usually involves adding + 1 The term “deconvolution” is sometimes used in the literature, but we advocate against it + on the grounds that a deconvolution is mathematically defined as the inverse of a convolution, + which is different from a transposed convolution. + 2 The transposed convolution operation can be thought of as the gradient of some convolution + with respect to its input, which is usually how transposed convolutions are implemented + in practice. + + + many columns and rows of zeros to the input, resulting in a much less efficient + implementation. + Building on what has been introduced so far, this chapter will proceed some- + what backwards with respect to the convolution arithmetic chapter, deriving the + properties of each transposed convolution by referring to the direct convolution + with which it shares the kernel, and defining the equivalent direct convolution. + + + 4.3 No zero padding, unit strides, transposed + + The simplest way to think about a transposed convolution on a given input is + to imagine such an input as being the result of a direct convolution applied on + some initial feature map. The transposed convolution can be then considered as + the operation that allows to recover the shape 3 of this initial feature map. + Let’s consider the convolution of a 3x3 kernel on a 4x4 input with unitary + stride and no padding (i.e.,i= 4,k= 3,s= 1 and p= 0). As depicted in + Figure 2.1, this produces a 2x2 output. The transpose of this convolution will + then have an output of shape 4x4 when applied on a 2x2 input. + Another way to obtain the result of a transposed convolution is to apply an + equivalent – but much less efficient – direct convolution. The example described + so far could be tackled by convolving a 3x3 kernel over a 2x2 input padded + with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,k0 =k,s0 = 1 and + p0 = 2), as shown in Figure 4.1. Notably, the kernel’s and stride’s sizes remain + the same, but the input of the transposed convolution is now zero padded. 4 + One way to understand the logic behind zero padding is to consider the + connectivity pattern of the transposed convolution and use it to guide the design + of the equivalent convolution. for example, the top left pixel of the input of the + direct convolution only contribute to the top left pixel of the output, the top + right pixel is only connected to the top right output pixel, and so on. + To maintain the same connectivity pattern in the equivalent convolution it is + necessary to zero pad the input in such a way that the first (top-left) application + of the kernel only touches the top-left pixel, i.e., the padding has to be equal to + the size of the kernel minus one. + Proceeding in the same fashion it is possible to determine similar observa- + tions for the other elements of the image, giving rise to the following relationship: + 3 Note that the transposed convolution does not guarantee to recover the input itself, as it + is not defined as the inverse of the convolution, but rather just returns a feature map that has + the same width and height. + 4 Note that although equivalent to applying the transposed matrix, this visualization adds + a lot of zero multiplications in the for m of zero padding. This is done here for illustration + purposes, but it is inefficient, and software implementations will normally not perfor m the + useless zero multiplications. + + Relationship 8.A convolution described bys= 1,p= 0 and k + has an associated transposed convolution described byk0 =k,s0 =s + and p0 = kx1 and its output size is + + <> + + Interestingly, this corresponds to a fully padded convolution with unit strides. + + + 4.4 Zero padding, unit strides, transposed + + Knowing that the transpose of a non-padded convolution is equivalent to con- + volving a zero padded input, it would be reasonable to suppose that the trans- + pose of a zero padded convolution is equivalent to convolving an input padded + withlesszeros. + It is indeed the case, as shown in Figure 4.2 for i= 5,k= 4 and p= 2. + for mally, the following relationship applies for zero padded convolutions: + + Relationship 9.A convolution described by s= 1,k and phas an + associated transposed convolution described by k0 =k,s0 =s and + p0 =kp1 and its output size is + + <> + + 4.4.1 Half (same) padding, transposed + By applying the same inductive reasoning as befor e, it is reasonable to expect + that the equivalent convolution of the transpose of a half padded convolution + is itself a half padded convolution, given that the output size of a half padded + convolution is the same as its input size. Thus the following relation applies: + + Relationship 10.A convolution described byk= 2n+1; n2N, + s= 1 and p=bk=2c=nh as an associated transposed convolution + described byk0 =k,s0 =s and p0 =p and its output size is + + <> + + + <
> + + Figure 4.3 provides an example for i= 5,k= 3 and (therefor e)p= 1. + + 4.4.2 Full padding, transposed + Knowing that the equivalent convolution of the transpose of a non-padded con- + volution involves full padding, it is unsurprising that the equivalent of the trans- + pose of a fully padded convolution is a non-padded convolution: + + <
> + + Figure 4.1: The transpose of convolving a 3x3 kernel over a 4x4 input using + unit strides (i.e.,i= 4,k= 3,s= 1 and p= 0). It is equivalent to convolving + a 3x3 kernel over a 2x2 input padded with a 2x2 border of zeros using unit + strides (i.e.,i0 = 2,k0 =k,s0 = 1 and p0 = 2). + + <
> + + Figure 4.2: The transpose of convolving a 4x4 kernel over a 5x5 input padded + with a 2x2 border of zeros using unit strides (i.e.,i= 5,k= 4,s= 1 and + p= 2). It is equivalent to convolving a 4x4 kernel over a 6x6 input padded + with a 1x1 border of zeros using unit strides (i.e.,i0 = 6,k0 =k,s0 = 1 and + p0 = 1). + + <
> + + Figure 4.3: The transpose of convolving a 3x3 kernel over a 5x5 input using + half padding and unit strides (i.e.,i= 5,k= 3,s= 1 and p= 1). It is + equivalent to convolving a 3x3 kernel over a 5x5 input using half padding + and unit strides (i.e.,i0 = 5,k0 =k,s0 = 1 and p0 = 1). + + + + Relationship 11.A convolution described bys= 1,k and p= kx1 + has an associated transposed convolution described byk0 =k,s0 =s + and p0 = 0 and its output size is + + <
> + + Figure 4.4 provides an example for i= 5,k= 3 and (therefor e)p= 2. + + + 4.5 No zero padding, non-unit strides, transposed + + Using the same kind of inductive logic as for zero padded convolutions, one + might expect that the transpose of a convolution with s >1 involves an equiv- + alent convolution with s <1. As will be explained, this is a valid intuition, + which is why transposed convolutions are sometimes called fractionally strided + convolutions. + Figure 4.5 provides an example for i= 5,k= 3 and s= 2which helps + understand what fractional strides involve: zeros are inserted between input + units, which makes the kernel move around at a slower pace than with unit + strides. 5 + for the moment, it will be assumed that the convolution is non-padded + (p= 0) and that its input size i is such that ixk is a multiple ofs. In that + case, the following relationship holds: + + Relationship 12.A convolution described byp= 0,k and s and + whose input size is such that ixk is a multiple ofs, has an associated + transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 = kx1 , + where~i0 is the size of the stretched input obtained by adding sx1 + zeros between each input unit, and its output size is + + <> + + 4.6 Zero padding, non-unit strides, transposed + + When the convolution’s input sizeiis such thati+ 2pkis a multiple ofs, + the analysis can extended to the zero padded case by combining Relationship 9 + and Relationship 12: + 5 Doing so is inefficient and real-world implementations avoid useless multiplications by + zero, but conceptually it is how the transpose of a strided convolution can be thought of. + + <
> + + Figure 4.4: The transpose of convolving a 3x3 kernel over a 5x5 input using + full padding and unit strides (i.e.,i= 5,k= 3,s= 1 and p= 2). It is equivalent + to convolving a 3x3 kernel over a77input using unit strides (i.e.,i0 = 7, + k0 =k,s0 = 1 and p0 = 0). + + <
> + + Figure 4.5: The transpose of convolving a 3x3 kernel over a 5x5 input using + 2x2 strides (i.e.,i= 5,k= 3,s= 2 and p= 0). It is equivalent to convolving + a 3x3 kernel over a 2x2 input (with1zero inserted between inputs) padded + with a 2x2 border of zeros using unit strides (i.e.,i0 = 2,~i0 = 3,k0 =k,s0 = 1 + and p0 = 2). + + <
> + + Figure 4.6: The transpose of convolving a 3x3 kernel over a 5x5 input padded + with a 1x1 border of zeros using 2x2 strides (i.e.,i= 5,k= 3,s= 2 and + p= 1). It is equivalent to convolving a 3x3 kernel over a 3x3 input (with + 1zero inserted between inputs) padded with a 1x1 border of zeros using unit + strides (i.e.,i0 = 3,~i0 = 5,k0 =k,s0 = 1 and p0 = 1). + + + + Relationship 13.A convolution described byk,s and p and whose + input sizeiis such tha ti+2pk is a multiple of s has an associated + transposed convolution described by~i0 ,k0 =k,s0 = 1 and p0 = + kp1, where ~i0 is the size of the stretched input obtained by + adding sx1 zeros between each input unit, and its output size is + + <> + + + <
> + + Figure 4.6 provides an example for i= 5,k= 3,s= 2 and p= 1. + The constraint on the size of the inputican be relaxed by introducing + another parametera2 f0;:::; sx1 gthat allows to distinguish between thes + different cases that all lead to the samei0 : + + Relationship 14.A convolution described byk,s and phas an + associated transposed convolution described bya,~i0 ,k0 =k,s0 = 1 + and p0 =kp1, where~i0 is the size of the stretched input obtained + by adding sx1 zeros between each input unit, and a= (i+ 2pk) + modsrepresents the number of zeros added to the bottom and right + edges of the input, and its output size is + + <> + + + <
> + + Figure 4.7 provides an example for i= 6,k= 3,s= 2 and p= 1. + + <
> + + Figure 4.7: The transpose of convolving a 3x3 kernel over a 6x6 input padded + with a 1x1 border of zeros using 2x2 strides (i.e.,i= 6,k= 3,s= 2 and + p= 1). It is equivalent to convolving a 3x3 kernel over a 2x2 input (with + 1zero inserted between inputs) padded with a 1x1 border of zeros (with an + additional border of size1added to the bottom and right edges) using unit + strides (i.e.,i0 = 3,~i0 = 5,a= 1,k0 =k,s0 = 1 and p0 = 1). + + + Chapter 5 + + + Miscellaneous convolutions + + 5.1 Dilated convolutions + + Readers familiar with the deep learning literature may have noticed the term + “dilated convolutions” (or “atrous convolutions”, from the French expressioncon- + volutions à trous) appear in recent papers. Here we attempt to provide an in- + tuitive underst and ing of dilated convolutions. for a more in-depth description + and to underst and in what contexts they are applied, see Chenet al.(2014); Yu + and Koltun (2015). + Dilated convolutions “inflate” the kernel by inserting spaces between the ker- + nel elements. The dilation “rate” is controlled by an additional hyperparameter + d. Implementations may vary, but there are usually dx1 spaces inserted between + kernel elements such thatd= 1corresponds to a regular convolution. + Dilated convolutions are used to cheaply increase the receptive field of output + units without increasing the kernel size, which is especially effective when multi- + ple dilated convolutions are stacked one after another. for a concrete example, + see Oordet al.(2016), in which the proposed WaveNet model implements an + autoregressive generative model for raw audio which uses dilated convolutions + to condition new audio frames on a large context of past audio frames. + To underst and the relationship tying the dilation rated and the output size + o, it is useful to think of the impact ofdon theeffective kernel size. A kernel + of sizekdilated by a factordhas an effective size + + <> + + This can be combined with Relationship 6 to for m the following relationship for + dilated convolutions: + + Relationship 15.for any i,k,p and s, and for a dilation rated, + + <> + + + <
> + Figure 5.1: (Dilated convolution) Convolving a 3x3 kernel over a77input + with a dilation factor of 2 (i.e.,i= 7,k= 3,d= 2,s= 1 and p= 0). + + + Figure 5.1 provides an example for i= 7,k= 3 and d= 2. + + + + Bibliography + + + Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, + G. S., Davis, A., Dean, J., Devin, M.,et al.(2015). Tensorflow: Large- + scale machine learning on heterogeneous systems. Software available from + tensorflow.org. + Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, + A., Bouchard, N., Warde-Farley, D., and Bengio, Y. (2012). Theano: new + features and speed improvements.arXiv preprint arXiv:1211.5590. + Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, + G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: A cpu and + gpu math compiler in python. InProc. 9th Python in Science Conf, pages + 1–7. + Boureau, Y., Bach, F., LeCun, Y., and Ponce, J. (2010a). Learning mid-level + features for recognition. InProc. International Conference on Computer Vi- + sion and Pattern Recognition (CVPR’10). IEEE. + Boureau, Y., Ponce, J., and LeCun, Y. (2010b). A theoretical analysis of feature + pooling in vision algorithms. InProc. International Conference on Machine + learning (ICML’10). + Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the + locals: multi-way local pooling for image recognition. InProc. International + Conference on Computer Vision (ICCV’11). IEEE. + Chen, L.-C., Pap and reou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2014). + Semantic image segmentation with deep convolutional nets and fully con- + nected crfs.arXiv preprint arXiv:1412.7062. + Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011). Torch7: A matlab-like + environment for machine learning. InBigLearn, NIPS Workshop, number + EPFL-CONF-192376. + Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. Book in + preparation for MIT Press. + + + + Im, D. J., Kim, C. D., Jiang, H., and Memisevic, R. (2016). Generating images + with recurrent adversarial networks.arXiv preprint arXiv:1602.05110. + Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar- + rama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast + feature embedding. InProceedings of the ACM International Conference on + Multimedia, pages 675–678. ACM. + Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification + with deep convolutional neural networks. InAdvances in neural infor mation + processing systems, pages 1097–1105. + Le Cun, Y., Bottou, L., and Bengio, Y. (1997). Reading checks with multilayer + graph transfor mer networks. InAcoustics, Speech, and Signal Processing, + 1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages + 151–154. IEEE. + Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for + semantic segmentation. InProceedings of the IEEE Conference on Computer + Vision and Pattern Recognition, pages 3431–3440. + Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., + Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). Wavenet: A + generative model for raw audio.arXiv preprint arXiv:1609.03499. + Radfor d, A., Metz, L., and Chintala, S. (2015). Unsupervised representa- + tion learning with deep convolutional generative adversarial networks.arXiv + preprint arXiv:1511.06434. + Saxe, A., Koh, P. W., Chen, Z., Bh and , M., Suresh, B., and Ng, A. (2011). + On r and om weights and unsupervised feature learning. In L. Getoor and + T. Scheffer, editors,Proceedings of the 28th International Conference on Ma- + chine Learning (ICML-11), ICML ’11, pages 1089–1096, New York, NY, USA. + ACM. + Visin, F., Kastner, K., Courville, A. C., Bengio, Y., Matteucci, M., and Cho, + K. (2015). Reseg: A recurrent neural network for object segmentation. + Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated con- + volutions.arXiv preprint arXiv:1511.07122. + Zeiler, M. D. and Fergus, R. (2014). Visualizing and underst and ing convolu- + tional networks. InComputer vision–ECCV 2014, pages 818–833. Springer. + Zeiler, M. D., Taylor, G. W., and Fergus, R. (2011). Adaptive deconvolutional + networks for mid and high level feature learning. InComputer Vision (ICCV), + 2011 IEEE International Conference on, pages 2018–2025. IEEE. + +<> <> <> + + +<> <> <> + + A Survey of Model Compression and Acceleration for Deep Neural Networks + + Yu Cheng, Duo Wang, Pan Zhou Member IEEE, and Tao Zhang Senior Member IEEE + + Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model + recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment + in devices with low memory resources or in applications with to billions [4]. + + strict latency requirements. Therefore, a natural thought is to As larger neural networks with more layers and nodes + + without significantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech- + niques for compacting and accelerating CNNs model developed. tion, recent years witnessed significant progress in virtual + These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre- + parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle + ferred/compact convolutional filters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced. + For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efficient deep learning methods can have + performance, related applications, advantages, and drawbacks significant impacts on distributed systems, embedded devices, + etc. Then we will go through a few very recent additional and FPGA for Artificial Intelligence. For example, the ResNet- + successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion floating number multiplications matrix, the main datasets used for evaluating the model per- + formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant + this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than + on this topic. 75% of parameters and 50% computational time. For devices + Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte + Model Compression and Acceleration, resources, how to compact the models used on them is also + important. + Achieving these goal calls for joint solutions from many + + I. INTRODUCTION + + disciplines, including but not limited to machine learning, op- + In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing, + lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works + achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which + These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community + billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years. + very high computation capability plays a key role in their We classify these approaches into four categories: pa- + success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans- + achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional filters, and knowledge distil- + using a network containing 60 million parameters with five lation. The parameter pruning and sharing based methods + convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to + it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor- + ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to + example is the top face verification results on the Labeled estimate the informative parameters of the deep CNNs. The + Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional filters + containing hundreds of millions of parameters, using a mix design special structural convolutional filters to reduce the + of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge + distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft + Way, Redmond, WA 98052, USA. compact neural network to reproduce the output of a larger + Duo Wang and Tao Zhang are with the Department of Automation, network. + Tsinghua University, Beijing 100084, China. In Table I, we briefly summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074, + China. rank factorization and knowledge distillation approaches can IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2 + + + TABLE I + + <
> + + be used in DNN models with fully connected layers and + convolutional layers, achieving comparable performances. On + the other hand, methods using transferred/compact filters are + designed for models with convolutional layers only. Low-rank + factorization and transfered/compact filters based approaches + provide an end-to-end pipeline and can be easily implemented + in CPU/GPU environment, which is straightforward. while + parameter pruning & sharing use different methods such as + vector quantization, binary coding and sparse constraints to + perform the task. Generally it will take several steps to achieve + the goal. + + <
> + Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output + + eter pruning/sharing low-rank factorization can be extracted is the compression model. + from pre-trained ones or trained from scratch. While the + transferred/compact filter and knowledge distillation models + can only support train from scratch. These methods are inde- memory usage and float point operations with little loss in + pendently designed and complement each other. For example, classification accuracy. + transferred layers and parameter pruning & sharing can be The method proposed in [10] quantized the link weights + used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the + used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce + speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con- + properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the + small-weight connections. Finally, the network was retrained + to learn the final weights for the remaining sparse connections. + + II. PARAMETER PRUNING AND SHARING + + This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importancefitting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classified into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix. instance, BinaryConnect [12], BinaryNet [13] and XNORNet- + works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization activation during the model training. The systematic study in + Network quantization compresses the original network by [15] showed that networks trained with back propagation could + reducing the number of bits required to represent each weight. be resilient to specific weight distortions, including binary + Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights. + quantization to the parameter values. Vanhouckeet al.[8] Drawbacks: the accuracy of the binary nets is significantly + showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet. + in significant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina- + work in [9] used 16-bit fixed-point representation in stochastic rization schemes are based on simple matrix approximations + rounding based CNN training, which significantly reduced and ignore the effect of binarization on the accuracy loss. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3 + + + To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of + Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear + directly minimizes the loss with respect to the binary weights. transformsf(x;M) =(Mx), where()is an element-wise + The work in [17] reduced the time on float point multiplication nonlinear operator,xis the input vector, andMis themn + in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense + converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing + significant changes. matrix-vector products inO(mn)time. Thus, an intuitive + way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing structural matrix. Anmn matrix that can be described + Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured + network complexity and to address the over-fitting issue. An matrix. Typically, the structure should not only reduce the + early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference + [18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and + Surgeon [20] methods reduced the number of connections gradient computations. + based on the Hessian of the loss function, and their work sug- Following this direction, the work in [30], [31] proposed a + gested that such pruning gave higher accuracy than magnitude- simple and efficient approach based on circulant projections, + while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training procedure of those methods followed the way training from <>, a circulant matrix R^2 R^dxd is defined + as: <> + scratch manner. A recent trend in this direction is to prune redundant, <> non-informative weights in a pre-trained CNN model. For <> + example, Srinivas and Babu [21] explored the redundancy <> among neurons, and proposed a data-free pruning method to + remove redundant neurons. Hanet al.[22] proposed to reduce <> + the total number of parameters and operations in the entire thus the memory cost becomesO(d)instead of O(d^2) network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourier used a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan- In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fully on soft weight-sharing was proposed, which included both connected layers. The Adaptive Fast food transform matrix quantization and pruning in one simple (re-)training procedure. R2Rnd was defined as:The above pruning schemes typically produce connections + pruning in CNNs. <> (2) + There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices. 2 + with sparsity constraints. Those sparsity constraints are typ- <> is a random permutation matrix, and H denotes + ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con- + norm regularizers. The work in [25] imposed group sparsity nected layer with d inputs and n outputs using the Adaptive + constraint on the convolutional filters to achieve structured Fast food transform reduces the storage and the computational + brain Damage, i.e., pruning entries of the convolution kernels costs from O(n^d) to O(n) and from O(n^d) to O(n*log(d)), + in a group-wise fashion. In [26], a group-sparse regularizer respectively. + on neurons was introduced during the training stage to learn The work in [29] showed the effectiveness of the new + compact CNNs with reduced filters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their + structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured + filters, channels or even layers. In the filter-level pruning, all matrix classes, including block and multi-level Toeplitz-like + the above works used l2-norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34]. + usedl1 -norm to select and prune unimportant filters. Following this idea, [35] proposed a general structured effi- + Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs. + and sharing. First, pruning with l1 or l2 regularization requires Drawbacks: one problem of this kind of approaches is that + more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the + pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand, + which demands fine-tuning of the parameters and could be how to find a proper structural matrix is difficult. There is no + cumbersome for some applications. theoretical way to derive it out. + + C. Designing Structural Matrix + + III. LOW-RANK FACTORIZATION AND SPARSITY + + + In architectures that contain fully-connected layers, it is Convolution operations contribute the bulk of most com- + critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4 + + + TABLE II + COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES + ON ILSVRC-2012. + + <
> + + + <
> + + Fig. 2. A typical framework of the low-rank regularization method. The left + is the original convolutional layer and the right is the low-rank constraint + convolutional layer with rank-K. + + would improve the compression rate as well as the overall + speedup. For the convolution kernels, it can be viewed as a + 4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic + the intuition that there is a significant amount of redundancy parameters in deep models using the low-rank method. [42] + in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the final weight + remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted + it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite + help. the fully connected layer for designing compact multi-task + It has been a long time for using low-rank filters to acceler- deep learning architectures. + ate convolution, for example, high dimensional DCT (discrete Drawbacks: low-rank approaches are straightforward for + cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements + to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti- + respectively. Learning separable 1D filters was introduced fied units and maxout. However, the implementation is not + by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which + idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current + approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and + kernels were proposed in [37]. They achieved 2speedup thus cannot perform global parameter compression, which + for a single convolutional layer with 1% drop in classification is important as different layers hold different information. + accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to + decomposition schemes, reporting a 4.5speedup with 1% achieve convergence when compared to the original model. + drop in accuracy in text recognition. + The low-rank approximation was done layer by layer. The IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS + parameters of one layer were fixed after it was done, and the CNNs are parameter efficient due to exploring the trans-layers above were fine-tuned based on a reconstruction error lation invariant property of the representations to the input criterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-fitting. Although a strong theory lowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant property used nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu- decomposition for training low-rank constrained CNNs from tional filters to compress CNN models is motivated by recent scratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input,()be a network or layer and T() be the both the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is defined as:Low-rank) can be used to train CNNs from scratch. However, + there are few differences between them. For example, finding <> (3) + the best low-rank approximation in CP decomposition is an ill- + posed problem, and the best rank-K (K is the rank number) indicating that transforming the input x by the transform T() + approximation may not exist sometimes. While for the BN and then passing it through the network or layer () should + scheme, the decomposition always exists. We perform a simple give the same result as first mapping x through the network + comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq. + speedup and the compression rates are used to measure their (10), the transforms <> and <> are not necessarily the + performances. same as they operate on different objects. According to this + As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or filters + be viewed as a 2D matrix and thus the above mentioned () to compress the whole network models. From empirical + methods can also be applied there. There are several classical observation, deep CNNs also benefit from using a large set of + works on exploiting low-rankness in fully connected layers. convolutional filters by applying certain transformT()to a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5 + + + small set of base filters since it acts as a regularizer for the TABLE III + model. A SIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND + Following this direction, there are many recent reworks + proposed to build a convolutional layer from a set of base <
> + filters [43]–[46]. What they have in common is that the + transform T() lies in the family of functions that only operate + in the spatial domain of the convolutional filters. For example, + the work in [45] found that the lower convolution layers of + CNNs learned redundant filters to extract both positive and + negative phase information of an input signal, and definedT() Drawbacks: there are few issues to be addressed for ap-to be the simple negation function: proaches that apply transform constraints to convolutional fil- + + <> (4) + + ters. First, these methods can achieve competitive performance x for wide/flat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional filter andW is the filter x ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2compression Using a compact filter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric filters with compact blocks to improve the classification accuracy. The intuition is that the learning the speed, which significantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing33convolution into two11to useful convolutional filters instead of redundant ones. convolutions was used in [48], which achieved significantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace33convolution with11convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The + transformT()was define as: + + <> (5) + + V. KNOWLEDGE DISTILLATION + + To the best of our knowledge, exploiting knowledge transfer + where were the multi-bias factors. The work in [47] con- (KT) to compress model was first proposed by Caruanaet + side red a combination of rotation by a multiple of 90 and al.[50]. They trained a compressed/ensemble model of strong + horizontal/vertical flipping with: classifiers with pseudo-data labeled, and reproduced the output + of the original larger network. But the work is limited to + + <> (6) + + shallow models. The idea has been recently adopted in [51] + whereWT was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide + original filters with angle2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model + transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The + was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from + can achieve good classification performance. a large teacher model into a small one by learning the class + The work in [44] definedT()as the set of translation distributions output via softmax. + functions applied to 2D filters: The work in [52] introduced a KD compression framework, + which eased the training of deep networks by following a + + <> (7) + + student-teacher paradigm, in which the student was penalized + whereT(;x;y)denoted the translation of the first operand by according to a softened version of the teacher’s output. The + (x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into + at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained + can be used to 1) improve the classification accuracy as a to predict the output and the classification labels. Despite + regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various + parameter efficiency by flexibly varying their architectures to image classification tasks. The work in [53] aimed to address + compress networks. the network compression problem by taking advantage of + Table III briefly compares the performance of different depth neural networks. It proposed an approach to train thin + methods with transferred convolutional filters, using VGGNet but deep networks, called FitNets, to compress wide and + (16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended + on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In + observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher + little or no drop in classification accuracy. network, FitNet made the student mimic the full feature maps + + + of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec- + the capacities of teacher and student may differ greatly. ture such as GoogleNet or Network in Network, can achieve + All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting + 10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully + experimental results show that these methods match or outper- optimized the utilization of the computing resources inside + form the teacher’s performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62] + parameters and multiplications. and motivated them to increase the depth and width of the + There are several extension along this direction of dis- network while keeping the computational budget constant. + tillation knowledge. The work in [54] trained a parametric The work in [63] targeted the Residual Network based + student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called + proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory + neural networks for the student model. Different from previous setup to train short networks and used deep networks at test + works which represented the knowledge using the soften label time. It started with very deep networks, while during training, + probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers + neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this + information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual + The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed + instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers + network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best + are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional + mations between neural network specifications. Zagoruyko networks with adaptive inference graphs to adaptively define + et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66]. + assumption of FitNet. They transferred the attention maps that Other approaches to reduce the convolutional overheads in-are summaries of the full activations. clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help significantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classification tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral filters [70]. Those worksperformance competitive with other type of approaches. only aim to speed up the computation but not reduce the + memory storage. + + VI. OTHER TYPES OF APPROACHES + + We first summarize the works utilizing attention-based + methods. Note that attention-based mechanism [58] can reduce + + VII. BENCHMARKS , EVALUATION AND DATABASES + computations significantly by learning to selectively focus or In the past five years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to first standard models include LeNets [71], All-CNN-nets [72] andfind the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been significantly reduced. layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected + a sparse combination of the experts to process each input. In The standard criteria to measure the quality of model + [61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the + which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters + selected and executed a subset of D2NN neurons based on the in the original model Manda is that of the compressed + input. model M , then the compression rate (M;M ) of M over + There have been other attempts to reduce the number of Mis aparameters of neural networks by replacing the fully connected (M;M ) = : (8)a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7 + + + TABLE IV or low rank factorization based methods. If you need + SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION . and transferred convolutional filters approaches could be + considered. + For applications in some specific domains, methods with low-rank factorization [40] human prior (like the transferred convolutional filters, Network in network [73] low-rank factorization [40] + <
> structural matrix) sometimes have benefits. For example, + when doing medical images classification, transferred Residual networks [75] compact filters [49], stochastic depth [63] convolutional filters could work well as medical images parameter sharing [24] + (like organ) do have the rotation transformation property. + Usually the approaches of pruning & sharing could give parameter pruning [20], [22] reasonable compression rate while not hurt the accuracy. + Thus for applications which requires stable model accu- + Another widely used measurement is the index space saving racy, it is better to utilize pruning & sharing. + defined in several papers [30], [35] as If your problem involves small/medium size datasets, you + can try the knowledge distillation approaches. The com-aa + <> (9) pressed student model can take the benefit of transferring a knowledge from teacher model, making it robust datasets + where a and a are the number of the dimension of the index which are not large. + space in the original model and that of the compressed model, As we mentioned before, techniques of the four groups + respectively. are orthogonal. It is reasonable to combine two or three + Similarly, given the running timesofMands ofM , of them to maximize the performance. For some spe- + the speedup rate <> is defined as: cific applications, like object detection, which requires + s both convolutional and fully connected layers, you can + <> (10) + compress the convolutional layers with low rank based + Most work used the average training time per epoch to measure method and the fully connected layers with a pruning + the running time, while in [30], [35], the average testing time technique. + was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster + computation for both the training and the testing stages. Techniques for deep model compression and acceleration + Good compression methods are expected to achieve almost are still in the early stage and the following challenges still + the same performance as the original model with much smaller need to be addressed. + parameters and less computational time. However, for different Most of the current state-of-the-art approaches are built + applications with different CNN designs, the relation between on well-designed CNN models, which have limited free- + parameter size and computational time may be different. dom to change the configuration (e.g., network structural, + For example, it is observed that for deep CNNs with fully hyper-parameters). To handle more complicated tasks, + connected layers, most of the parameters are in the fully it should provide more plausible ways to configure the + connected layers; while for image classification tasks, float compressed models. + point operations are mainly in the first few convolutional layers Pruning is an effective way to compress and acceler- + since each filter is convolved with the whole image, which is ate CNNs. The current pruning techniques are mostly + usually very large at the beginning. Thus compression and designed to eliminate connections between neurons. On + acceleration of the network should focus on different type of the other hand, pruning channel can directly reduce the + layers for different applications. feature map width and shrink the model into a thinner + one. It is efficient but also challenging because removing + VIII. D ISCUSSION AND CHALLENGES channels might dramatically change the input of the + following layer.In this paper, we summarized recent efforts on compressing + and accelerating deep neural networks (DNNs). Here we dis- As we mentioned before, methods of structural matrix + and transferred convolutional filters impose prior humancuss more details about how to choose different compression knowledge to the model, which could significantly affectapproaches, and possible challenges/solutions on this area. the performance and stability. It is critical to investigate + how to control the impact of those prior knowledge.A. General Suggestions The methods of knowledge distillation provide many ben- + There is no golden rule to measure which approach is the efits such as directly accelerating model without special + best. How to choose the proper method is really depending hardware or implementations. It is still worthy developing + on the applications and requirements. Here are some general KD-based approaches and exploring how to improve their + guidance we can provide: performances. + If the applications need compacted models from pre- Hardware constraints in various of small platforms (e.g., + trained models, you can choose either pruning & sharing mobile, robotic, self-driving car) are still a major problem IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8 + + + to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g., + use of the limited computational source and how to design video and image frames [88], [89]). + special compression methods for such platforms are still + challenges that need to be addressed. IX. ACKNOWLEDGMENTS + Despite the great achievements of these compression ap- + proaches, the black box mechanism is still the key barrier The authors would like to thank the reviewers and broader + to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular, + is still an important problem. we would like to thank Hong Zhao from the Department of + Automation of Tsinghua University for her help on modifying + C. Possible Solutions the paper. This research is supported by National Science + Foundation of China with Grant number 61401169.To solve the hyper-parameters configuration problem, we + can rely on the recent learning-to-learn strategies [76], [77]. + This framework provides a mechanism allowing the algorithm REFERENCES + to automatically learn how to exploit structure in the problem [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with of interest. Very recently, leveraging reinforcement learning deep convolutional neural networks,” inNIPS, 2012. + to efficiently sample the design space and improve the model [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the + compression has also been tried [78]. gap to human-level performance in face verification,” inCVPR, 2014. + [3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efficiency benefit on both adaptive feature sharing in multi-task networks with applications in + CPU and GPU because no special implementation is required. person attribute classification,”CoRR, vol. abs/1611.05377, 2016. + But it is also challenging to handle the input configuration. [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, + M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel distributed deep networks,” inNIPS, 2012. + pruning methods [79], which focus on imposing sparse con- [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image + straints on weights during training. However, training from recognition,”CoRR, vol. abs/1512.03385, 2015. + [6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In deep convolutional networks using vector quantization,”CoRR, vol. + [80], the authors provided an iterative two-step algorithm to abs/1412.6115, 2014. + effectively prune channels in each layer. [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized + convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models on Computer Vision and Pattern Recognition (CVPR), 2016. + and transferring it to the student models is useful for the [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of + knowledge distillation (KD) approaches. Instead of directly re- neural networks on cpus,” inDeep Learning and Unsupervised Feature + Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl- [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep + edge of neurons could be helpful. One can derive a way to learning with limited numerical precision,” inProceedings of the + select essential neurons related to the task [81], [82]. The 32Nd International Conference on International Conference on Machine + Learning - Volume 37, ser. ICML’15, 2015, pp. 1737–1746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing + or samples, that implies these regions or samples share some deep neural networks with pruning, trained quantization and huffman + common properties that may relate to the task. coding,”International Conference on Learning Representations (ICLR), + 2016. For methods with the convolutional filters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network + matrix, we can conclude that the transformation lies in the quantization,”CoRR, vol. abs/1612.01543, 2016. + family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep + neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is in Neural Information Processing Systems 28: Annual Conference on + to provide a generalization of the aforementioned approaches Neural Information Processing Systems 2015, December 7-12, 2015, + in two aspects: 1) instead of limiting the transformation to Montreal, Quebec, Canada, 2015, pp. 3123–3131. + [13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predefined transformations, let it be the works with weights and activations constrained to +1 or -1,”CoRR, vol. + whole family of spatial transformations applied on 2D filters abs/1602.02830, 2016. + or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: + Imagenet classification using binary convolutional neural networks,” in model parameters. ECCV, 2016. + Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, + some general/unified approaches is one direction. Wanget al. “Deep neural networks are robust to weight binarization and other non- + [83] presented a feature map dimensionality reduction method linear distortions,”CoRR, vol. abs/1606.01981, 2016. + [16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen- networks,”CoRR, vol. abs/1611.01600, 2016. + erated from different filters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks + information of the original network. The idea can be applied with few multiplications,”CoRR, vol. abs/1510.03009, 2015. + [18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The construction with back-propagation,” inAdvances in Neural Information + work in [84] proposed a one-shot whole network compression Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177–185. + scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information + processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and fine-tuning to make deep Damage, pp. 598–605. + CNNs work in mobile devices. [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives + Despite the classification task, people are also adapting the for network pruning: Optimal brain surgeon,” inAdvances in Neural + Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164– compacted models in other tasks [85]–[87]. We would like to 171. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9 + + + + [21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net- + networks,” inProceedings of the British Machine Vision Conference works,”arXiv preprint arXiv:1602.07576, 2016. + 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp. [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural + 31.1–31.12. networks,” inAdvances In Neural Information Processing Systems, 2016, + [22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and pp. 1082–1090. + connections for efficient neural networks,” inProceedings of the 28th [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and + International Conference on Neural Information Processing Systems, ser. improving convolutional neural networks via concatenated rectified + NIPS’15, 2015. linear units,”arXiv preprint arXiv:1603.05201, 2016. + [23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com- [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in + pressing neural networks with the hashing trick.” JMLR Workshop and deep neural networks,”arXiv preprint arXiv:1604.00676, 2016. + Conference Proceedings, 2015. [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic + [24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural symmetry in convolutional neural networks,” inProceedings of the + network compression,”CoRR, vol. abs/1702.04008, 2017. 33rd International Conference on International Conference on Machine + [25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain Learning - Volume 48, ser. ICML’16, 2016. + damage,” in2016 IEEE Conference on Computer Vision and Pattern [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception- + Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, resnet and the impact of residual connections on learning.”CoRR, vol. + pp. 2554–2564. abs/1602.07261, 2016. + [26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, + cnns,” inEuropean Conference on Computer Vision, Amsterdam, the small, low power fully convolutional neural networks for real-time object + Netherlands, October 2016, pp. 662–677. detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016. + [27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ + sparsity in deep neural networks,” inAdvances in Neural Information inProceedings of the 12th ACM SIGKDD International Conference on + Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, Knowledge Discovery and Data Mining, ser. KDD ’06, 2006, pp. 535– + I. Guyon, and R. Garnett, Eds., 2016, pp. 2074–2082. 541. + [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in + filters for efficient convnets,”CoRR, vol. abs/1608.08710, 2016. Advances in Neural Information Processing Systems 27: Annual Confer- + [29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for ence on Neural Information Processing Systems 2014, December 8-13 + small-footprint deep learning,” inAdvances in Neural Information Pro- 2014, Montreal, Quebec, Canada, 2014, pp. 2654–2662. + cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a + and R. Garnett, Eds., 2015, pp. 3088–3096. neural network,”CoRR, vol. abs/1503.02531, 2015. + [30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and + Chang, “An exploration of parameter redundancy in deep networks with Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550, + circulant projections,” inInternational Conference on Computer Vision 2014. + (ICCV), 2015. [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling, + [31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and “Bayesian dark knowledge,” inAdvances in Neural Information Process- + S. Chang, “Fast neural networks with circulant projections,”CoRR, vol. ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, + abs/1502.03436, 2015. and R. Garnett, Eds., 2015, pp. 3420–3428. + [32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression + and Z. Wang, “Deep fried convnets,” inInternational Conference on by distilling knowledge from neurons,” inProceedings of the Thirtieth + Computer Vision (ICCV), 2015. AAAI Conference on Artificial Intelligence, February 12-17, 2016, + [33]J. Chun and T. Kailath,Generalized Displacement Structure for Block- Phoenix, Arizona, USA., 2016, pp. 3560–3566. + Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel- [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning + berg: Springer Berlin Heidelberg, 1991, pp. 215–236. via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015. + [34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention: + in low-rank tensor formats via cross approximation,”SIAM J. Scientific Improving the performance of convolutional neural networks via atten- + Computing, vol. 37, no. 2, 2015. tion transfer,”CoRR, vol. abs/1612.03928, 2016. + [35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc: [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by + A structured efficient linear layer,” inInternational Conference on jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014. + Learning Representations (ICLR), 2016. [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and + [36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable A. C. Courville, “Dynamic capacity networks,” inProceedings of the + filters,” in2013 IEEE Conference on Computer Vision and Pattern 33nd International Conference on Machine Learning, ICML 2016, New + Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754– York City, NY, USA, June 19-24, 2016, 2016, pp. 2549–2558. + 2761. [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, + [37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, and J. Dean, “Outrageously large neural networks: The sparsely-gated + “Exploiting linear structure within convolutional networks for efficient mixture-of-experts layer,” 2017. + evaluation,” inAdvances in Neural Information Processing Systems 27, [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and + Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. J. Odobez, “Deep dynamic neural networks for multimodal gesture + Weinberger, Eds., 2014, pp. 1269–1277. segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell., + [38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional vol. 38, no. 8, pp. 1583–1597, 2016. + neural networks with low rank expansions,” inProceedings of the British [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, + Machine Vision Conference. BMVA Press, 2014. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” + [39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit- inComputer Vision and Pattern Recognition (CVPR), 2015. + sky, “Speeding-up convolutional neural networks using fine-tuned cp- [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep + decomposition,”CoRR, vol. abs/1412.6553, 2014. Networks with Stochastic Depth, 2016. + [40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual + with low-rank regularization,” vol. abs/1511.06067, 2015. networks with separated stochastic depth,”CoRR, vol. abs/1612.01230, + [41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, 2016. + “Predicting parameters in deep learning,” in Advances in Neural [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and + Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, R. Feris, “Blockdrop: Dynamic inference paths in residual networks,” + Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 2148–2156. inCVPR, 2018. + [Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer- + files/nips26/1053.pdf ence graphs,” 2018. + [42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab- [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional + hadran, “Low-rank matrix factorization for deep neural network training networks through FFTs, 2014. + with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net- + Acoustics, Speech and Signal Processing, 2013. works,” in2016 IEEE Conference on Computer Vision and Pattern IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10 + + + + Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong, + pp. 4013–4021. M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X. + [69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S. Yu, “Ibm research and columbia university trecvid-2012 multimedia + Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol. event detection (med), multimedia event recounting (mer), and semantic + abs/1611.05138, 2016. indexing (sin) systems,” 2012. + [70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving + pooling in deep networks,” inProceedings of the IEEE Conference on + Computer Vision and Pattern Recognition, 2018. Yu Cheng(yu.cheng@microsoft.com) currently is a + [71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Researcher at Microsoft. Before that, he was a Re- + applied to document recognition,” inProceedings of the IEEE, 1998, pp. search Staff Member at IBM T.J. Watson Research + 2278–2324. Center. Yu got his Ph.D. from Northwestern Univer- + [72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried- sity in 2015 and bachelor from Tsinghua University + miller, “Striving for simplicity: The all convolutional net,”CoRR, vol. in 2010. His research is about deep learning in + abs/1412.6806, 2014. general, with specific interests in the deep generative + [73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014. model, model compression, and transfer learning. + [74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for He regularly serves on the program committees of + large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014. top-tier AI conferences such as NIPS, ICML, ICLR, + [75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image CVPR and ACL. + recognition,”arXiv preprint arXiv:1512.03385, 2015. + [76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, + D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient + descent by gradient descent,” inNeural Information Processing Systems + (NIPS), 2016. Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference ceived the B.S. degree in automation from theon Learning Representations 2016, 2016. Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018. Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in are about deep learning, particularly in few-shotdeep networks,” pp. 2270–2278, 2016. learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on robotics vision.Computer Vision (ICCV), Oct 2017. + [81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep + neural networks,”ECCV, 2018. + [82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric + learning via cross sample similarities transfer,” inProceedings of the + Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), + New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852– Pan Zhou(panzhou@hust.edu.cn) is currently an + 2859. associate professor with School of Electronic In- + [83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature formation and Communications, Wuhan, China. He + map for portable deep model,” inProceedings of the 34th International received his Ph.D. in the School of Electrical and + Conference on Machine Learning, ser. Proceedings of Machine Learning Computer Engineering at the Georgia Institute of + Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Technology in 2011. Before that, he received his + Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. B.S. degree in theAdvanced Classof HUST, and + 3703–3711. a M.S. degree in the Department of Electronics + [84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression and Information Engineering from HUST, Wuhan, + of deep convolutional neural networks for fast and low power mobile China, in 2006 and 2008, respectively. His current + applications,”CoRR, vol. abs/1511.06530, 2015. research interest includes big data analytics and + [85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient machine learning, security and privacy, and information networks. + object detection models with knowledge distillation,” inAdvances in + Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, + S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, + Eds., 2017, pp. 742–751. + [86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob- + “Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE tained his B.S., M.S., and Ph.D. degrees from Ts- + Conference on Computer Vision and Pattern Recognition (CVPR), June inghua University, Beijing, China, in 1993, 1995, + 2018. and 1999, respectively, and another Ph.D. degree + [87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, from Saga University, Saga, Japan, in 2002, all in + Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy control engineering. He is currently a Professor with + trade-offs for modern convolutional object detectors,” in2017 IEEE the Department of Automation, Tsinghua University. + Conference on Computer Vision and Pattern Recognition, CVPR 2017, He serves the Associate Dean, School of Information + Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296–3297. Science and Technology and Head of the Department + [88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence of Automation. His current research interests include + modeling for video event detection,” in The IEEE Conference on artificial intelligence, robotics, image processing, + Computer Vision and Pattern Recognition (CVPR), June 2014. control theory, and control of spacecraft. + +<> <> <> + + +<> <> <> + + Analysis and Design of Echo State Networks + + Mustafa C. Ozturk + can@cnel.ufl.edu + + Dongming Xu + dmxu@cnel.ufl.edu + + Jose C. Principe + principe@cnel.ufl.edu + + Computational NeuroEngineering Laboratory, Department of Electrical and + Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A. + + + The design of echo state network (ESN) parameters relies on the selec- + tion of the maximum eigenvalue of the linearized system around zero + (spectral radius). However, this procedure does not quantify in a sys- + tematic manner the performance of the ESN in terms of approximation + error. This article presents a functional space approximation framework + to better understand the operation of ESNs and proposes an information- + theoretic metric, the average entropy of echo states, to assess the richness + of the ESN dynamics. Furthermore, it provides an interpretation of the + ESN dynamics rooted in system theory as families of coupled linearized + systems whose poles move according to the input signal dynamics. With + this interpretation, a design methodology for functional approximation + is put forward where ESNs are designed with uniform pole distributions + covering the frequency spectrum to abide by the richness metric, irre- + spective of the spectral radius. A single bias parameter at the ESN input, + adapted with the modeling error, configures the ESN spectral radius to + the input-output joint space. Function approximation examples compare + the proposed design methodology versus the conventional design. + + + 1 Introduction + + Dynamic computational models require the ability to store and access the + time history of their inputs and outputs. The most common dynamic neural + architecture is the time-delay neural network (TDNN) that couples delay + lines with a nonlinear static architecture where all the parameters (weights) + are adapted with the backpropagation algorithm. The conventional delay + line utilizes ideal delay operators, but delay lines with local first-order re- + cursive filters have been proposed by Werbos (1992) and extensively stud- + ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera, + 1993). Chains of first-order integrators are interesting because they effec- + tively decrease the number of delays necessary to create time embeddings + + + (Principe, 2001). Recurrent neural networks (RNNs) implement a differ- + ent type of embedding that is largely unexplored. RNNs are perhaps the + most biologically plausible of the artificial neural network (ANN) models + (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990), + but are not well understood theoretically (Siegelmann & Sontag, 1991; + Siegelmann, 1993; Kremer, 1995). One of the main practical problems with + RNNs is the difficulty to adapt the system weights. Various algorithms, + such as backpropagation through time (Werbos, 1990) and real-time recur- + rent learning (Williams & Zipser, 1989), have been proposed to train RNNs; + however, these algorithms suffer from computational complexity, resulting + in slow training, complex performance surfaces, the possibility of instabil- + ity, and the decay of gradients through the topology and time (Haykin, + 1998). The problem of decaying gradients has been addressed with spe- + cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter- + native second-order training methods based on extended Kalman filtering + (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov, + Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp + et al., 1998) provide more reliable performance and have enabled practical + applications in identification and control of dynamical systems (Kechri- + otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado, + Kambhampati, & Warwick, 1995). + Recently,twonewrecurrentnetworktopologieshavebeenproposed:the + echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and + the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨ + 2002). ESNs possess a highly interconnected and recurrent topology of + nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001) + and contain information about the history of input and output patterns. + The outputs of these internal PEs (echo states) are fed to a memoryless but + adaptive readout network (generally linear) that produces the network out- + put. The interesting property of ESN is that only the memoryless readout is + trained, whereas the recurrent topology has fixed connection weights. This + reduces the complexity of RNN training to simple linear regression while + preserving a recurrent topology, but obviously places important constraints + in the overall architecture that have not yet been fully studied. Similar ideas + have been explored independently by Maass and formalized in the LSM + architecture. LSMs, although formulated quite generally, are mostly im- + plemented as neural microcircuits of spiking neurons (Maass et al., 2002), + whereas ESNs are dynamical ANN models. Both attempt to model biolog- + ical information processing using similar principles. We focus on the ESN + formulation in this letter. + + The echo state condition is defined in terms of the spectral radius (the + largest among the absolute values of the eigenvalues of a matrix, denoted + by·) of the reservoir’s weight matrix (W<1). This condition states + that the dynamics of the ESN is uniquely controlled by the input, and the + effect of the initial states vanishes. The current design of ESN parameters + relies on the selection of spectral radius. However, there are many possible + weight matrices with the same spectral radius, and unfortunately they do + not all perform at the same level of mean square error (MSE) for functional + approximation. A similar problem exists in the design of the LSM. LSMs + have been shown to possess universal approximation given the separation + property (SP) for the liquid (reservoir in ESNs) and the approximation + property (AP) for the readout (Maass et al., 2002). SP is quantified by a + kernel-quality measure proposed in Maass, Legenstein, and Bertschinger + (2005) that is based on the rank of a matrix formed by the system states + corresponding to different input signals. The kernel quality is a measure + for the complexity and diversity of nonlinear operations carried out by the + liquid on its input stream in order to boost the classification power of a + subsequent linear decision hyperplane (Maass et al., 2005). A variation of + SP has been proposed in Bertschinger and Natschlager (2004), and it has¨ + been argued that complex calculations can be best carried out by networks + on the boundary between ordered and chaotic dynamics. + + In this letter,we are interested in studying the ESN for functional approx- + imation (filters that map input function su(·) of time on output function sy(·) + of time). We see two major shortcomings with the current ESN approach + that uses echo state condition as a design principle. First, the impact of fixed + reservoir parameters for function approximation means that the informa- + tion about the desired response is conveyed only to the output projection. + This is not optimal, and strategies to select different reservoirs for different + applications have not been devised. Second, imposing a constraint only on + the spectral radius is a weak condition to properly set the parameters of + the reservoir, as experiments show (different randomizations with the same + spectral radius perform differently for the same problem; see Figure 2). + This letter aims to address these two problems by proposing a frame- + work, a metric, and a design principle for ESNs. The framework is a signal + processing interpretation of basis and projections in functional spaces to + describe and understand the ESN architecture. According to this interpre- + tation, the ESN states implement a set of basis functionals (representation + space) constructed dynamically by the input, while the readout simply + projects the desired response onto this representation space. The metric + to describe the richness of the ESN dynamics is an information-theoretic + quantity, the average state entropy (ASE). Entropy measures the amount of + information contained in a given random variable (Shannon, 1948). Here, + the random variable is the instantaneous echo state from which the en- + tropy for the overall state (vector) is estimated. The probability density + function (pdf) in a differential geometric framework should be thought of + as a volume form; that is, in our case, the pdf of the state vector describes + the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946) + established information as a coordinate free metric in the state manifold. + Therefore, entropy becomes a global descriptor of information that quanti- + fies the volume of the manifold defined by the random variable. Due to the + time dependency of the states, the state entropy averaged over time (ASE) + is an appropriate estimate of the volume of the state manifold. + The design principle specifies that one should consider independently + thecorrelationamongthebasisandthespectralradius.In the absence of any + information about the desired response, the ESN states should be designed + with the highest ASE, independent of the spectral radius. We interpret the + ESN dynamics as a combination of time-varying linear systems obtained + from the linearization of the ESN nonlinear PE in a small, local neighbor- + hood of the current state. The design principle means that the poles of the + linearized ESN reservoir should have uniform pole distributions to gener- + ate echo states with the most diverse pole locations (which correspond to + the uniformity of time constants). Effectively, this will create the least cor- + related bases for a given spectral radius, which corresponds to the largest + volume spanned by the basis set. When the designer has no other informa- + tion about the desired response to set the basis, this principle distributes + the system’s degrees of freedom uniformly in space. It approximates for + ESNs the well-known property of orthogonal basis. The unresolved issue + that ASE does not quantify is how to set the spectral radius, which depends + again on the desired mapping. The concept of memory depth as explained + in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the + issues associated with the spectral radius. The correlation time of the de- + sired response (as estimated by the first zero of the autocorrelation function) + gives an indication of the type of spectral radius required (long correlation + time requires high spectral radius). Alternatively, a simple adaptive bias is + added at the ESN input to control the spectral radius integrating the infor- + mation from the input-output joint space in the ESN bases. For sigmoidal + PEs, the bias adjusts the operating points of the reservoir PEs, which has + the net effect of adjusting the volume of the state manifold as required to + approximate the desired response with a small error. This letter shows that + ESNs designed with this strategy obtain systematically better results in a + set of experiments when compared with the conventional ESN design. + + + 2 Analysis of Echo State Networks + + 2.1 Echo States as Bases and Projections.Let us consider the ar- + chitecture and recursive update equation of a typical ESN more closely. + Consider the recurrent discrete-time neural network given in Figure 1 + with M input units, N internal PEs, and L output units. The value of + the input unit at time n is <> , of internal + units are <> , and of output units are <> . The connection weights are given in anN×M + weight matrixWin =(win ) for connections between the input and the inter- ij + nalPEs,in an N×N matrix W=(wij ) for connections between the internal + PEs, in an L×N matrix <> for connections from PEs to the ij + Input Layer Dynamical Reservoir Read-out + + <
> + + Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed- + weight (W<1) recurrent network and a linear readout. The recurrent net- + work is a reservoir of highly interconnected dynamical components, states of + which are called echo states. The memoryless linear readout is trained to pro- + duce the output. + + + output units, and in an N× L matrix <> for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The + activation of the internal PEs (echo state) is updated according to + + <>, (2.1) + + where f=(f1 ,f2 ,...,fN ) are the internal PEs’ activation functions.Here, all + i ’s are hyperbolic tangent functions ( ex − ). The output from the readout ex +e−x + network is computed according to + + <>, (2.2) + + where <> are the output unit’s nonlinear functions <> (Jaeger, 2001, 2002a). + Generally, the readout is linear so f_out is identity. + ESNs resemble the RNN architecture proposed in Puskorius and + Feldkamp (1996) and also used by Sanchez (2004) in brain-machine + interfaces. The critical difference is the dimensionality of the hidden re- + current PE layer and the adaptation of the recurrent weights. We submit + that the ideas of approximation theory in functional spaces (bases and pro- + jections), so useful in adaptive signal processing (Principe, 2001), should + be utilized to understand the ESN architecture. Let h(u(t)) be a real-valued + function of a real-valued vector + + <>. + + In functional approximation, the goal is to estimate the behavior ofh(u(t)) + as a combination of simpler functions ϕi (t), called the basis functionals, + such that its approximant,hˆ(u(t)), is given by + + <>. + + Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of + the central questions in practical functional approximation is how to choose + the set of bases to approximate a given desired signal. In signal processing, + thechoicenormallygoesforacompletesetoforthogonalbasis,independent + of the input. When the basis set is complete and can be made as large + as required, fixed bases work wonders (e.g., Fourier decompositions). In + neural computing, the basic idea is to derive the set of bases from the + input signal through a multilayered architecture. For instance, consider a + single hidden layer TDNN with NPEs and a linear output. The hidden- + layer PE outputs can be considered a set of nonorthogonal basis functionals + dependent on the input, + + <> + + bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi- + mation produced by the TDNN is then + + <>, (2.3) + + whereai ’s are the weights of the output layer. Notice that thebij ’s adapt + the bases and theai ’s adapt the projection in the projection space. Here the + goal is to restrict the number of bases (number of hidden layer PEs) because + their number is coupled with the number of parameters to adapt, which + has an impact on generalization and training set size, for example. Usually, + since all of the parameters of the network are adapted, the best basis in the + joint (input and desired signals) space as well as the best projection can be + achieved and represents the optimal solution. The output of the TDNN is + a linear combination of its internal representations, but to achieve a basis + set (even if nonorthogonal), linear independence among theϕi (u(t))’s must + be enforced. Ito, Shah and Pon, and others have shown that this is indeed + the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside + the scope of this article. + + The ESN (and the RNN) architecture can also be studied in this frame- + work. The states of equation 2.1 correspond to the basis set, which are + recursively computed from the input, output, and previous states through + Win ,W,andWback . Notice, however, that none of these weight matrices is + adapted, that is, the functional bases in the ESN are uniquely defined by the + input and the initial selection of weights. In a sense, ESNs are trading the + adaptive connections in the RNN hidden layer by a brute force approach + of creating fixed diversified dynamics in the hidden layer. + For an ESN with a linear readout network, the output equation (y(n+ + 1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and + ai ’s are replaced by the echo states and the readout weights, respectively. + The readout weights are adapted in the training data, which means that the + ESN is able to find the optimal projection in the projection space, just like + the RNN or the TDNN. + + A similar perspective of basis and projections for information processing + in biological networks has been proposed by Pouget and Sejnowski (1997). + They explored the possibility that the response of neurons in parietal cortex + serves as basis functions for the transformations from the sensory input + to the motor responses. They proposed that “the role of spatial represen- + tations is to code the sensory inputs and posture signals in a format that + simplifies subsequent computation, particularly in the generation of motor + commands”. + + The central issue in ESN design is exactly the nonadaptive nature of + the basis set. Parameter sets in the reservoir that provide linearly inde- + pendent states and possess a given spectral radius may define drastically + different projection spaces because the correlation among the bases is not + constrained. A simple experiment was designed to demonstrate that the se- + lection of the ESN parameters by constraining the spectral radius is not the + most suitable for function approximation. Consider a 100-unit ESN where + the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let + the ESN generate the seventh power of the input signal. Different realiza- + tions of a randomly connected 100-unit ESN were constructed where the + entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025, + and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input + weights are set to+1or,−1 with equal probabilities, andWback is set to + zero. Input is applied for 300 time steps, and the echo states are calculated + using equation 2.1. The next step is to train the linear readout. One method + + <
> + + Figure 2: Performances of ESNs for different realizations ofWwith the same + weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba- + bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius + of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results + show that for each set of random weights that provide the same spectral ra- + dius, the correlation or degree of redundancy among the bases will change, and + different performances are encountered in practice. + + + to determine the optimal output weight matrix,Wout , in the mean square + error (MSE) sense (where MSE is defined by <>) is to use 2 the Wiener solution given by Haykin (2001): + + <> + + Here,E[.] denotes the expected value operator, andddenotes the desired + signal. Figure 2 depicts the MSE values for 50 different realizations of + the ESNs. As observed, even though each ESN has the same sparseness + and spectral radius, the MSE values obtained vary greatly among differ- + ent realizations. The minimum MSE value obtained among the 50 realiza- + tions is 5.9x10 −9 , whereas the maximum MSE is 8.9x10 −5 . This experiment + demonstrates that a design strategy that is based solely on the spectral + radius is not sufficient to specify the system architecture for function ap- + proximation. This shows that for each set of random weights that provide + thesamespectralradius,thecorrelationordegreeofredundancyamongthe + bases will change, and different performances are encountered in practice. + + 2.2 ESN Dynamics as a Combination of Linear Systems. + + It is well known that the dynamics of a nonlinear system can be approximated by + that of a linear system in a small neighborhood of an equilibrium point + (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis + with hyperbolic tangent nonlinearities and approximate the ESN dynam- + ics by the dynamics of the linearized system in the neighborhood of the + current system state. Hence, when the system operating point varies over + time, the linear system approximating the ESN dynamics changes. We are + particularly interested in the movement of the poles of the linearized ESN. + Consider the update equation for the ESN without output feedback given + by + + <>. + + Linearizing the system around the current statex(n), one obtains the + Jacobian matrix, <>, defined by + + <> + + Here,net i(n) is the ith entry of the vector <<(W_in u(n+1)+Wx(n))>>, and w_ij + denotes the (i,j)th entry of W. The poles of the linearized system at time + n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the + amplitude of each PE changes, the local slope changes, and so the poles of + A. The transfer function of a linear system <> is <> + Adjoint <<(zI−A)>>. The poles of the transfer function can be obtained by solving <>. + The solution corresponds to the eigenvalues of A. + + + the linearized system are time varying, although the parameters of ESN are + fixed. In order to visualize the movement of the poles, consider an ESN with + 100 states. The entries of the internal weight matrix are chosen to be 0, + 0.4 and −0.4 with probabilities 0.9, 0.05, and 0.05.W is scaled such that a + spectral radius of 0.95 is obtained. Input weights are set to +1 or −1 with + equal probabilities. A sinusoidal signal with a period of 100 is fed to the + system, and the echo states are computed according to equation 2.1. Then + the Jacobian matrix and the eigenvalues are calculated using equation 2.5. + Figure 3 shows the pole tracks of the linearized ESN for different input + values. A single ESN with fixed parameters implements a combination of + many linear systems with varying pole locations, hence many different + time constants that modulate the richness of the reservoir of dynamics as a + function of input amplitude. Higher-amplitude portions of the signal tend + to saturate the nonlinear function and cause the poles to shrink toward + the origin of thez-plane (decreases the spectral radius), which results in a + system with a large stability margin. When the input is close to zero, the + poles of the linearized ESN are close to the maximal spectral radius chosen, + decreasing the stability margin. When compared to their linear counterpart, + an ESN with the same number of states results in a detailed coverage of + thez-plane dynamics, which illustrates the power of nonlinear systems. + Similar results can be obtained using signals of different shapes at the ESN + input. + A key corollary of the above analysis is that the spectral radius of an + ESN can be adjusted using a constant bias signal at the ESN input without + changing the recurrent connection matrix,W. The application of a nonzero + constant bias will move the operating point to regions of the sigmoid func- + tion closer to saturation and always decrease the spectral radius due to the + shape of the nonlinearity. 2 The relevance of bias in terms of overall system + performance has also been discussed in Jaeger (2002b) and Bertschinger + and Natschlager (2004), but here we approach it from a system theory per-¨ + spective and explain its effect on reservoir dynamics. + + 3 Average State Entropy as a Measure of the Richness of ESN Reservoir + + Previous research was aware of the influence of diversity of the recurrent + layer outputs on the overall performance of ESNs and LSMs. Several met- + rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al., + + + 2 Assume W has nondegenerate eigenvalues and corresponding linearly independent + eigenvectors. Then consider the eigendecomposition of W, where <>,Pis the + eigenvectormatrixandDisthediagonalmatrixofeigenvalues <> of W.SinceF(n)andD + are diagonal, <> is the eigendecomposition + of <>. Here, each entry of <>, is an eigenvalue of J. Therefore, + <> since <>. + + <
> + + Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input + goes through a cycle. An ESN with fixed parameters implements a combination + of linear systems with varying pole locations. (A) One cycle of sinusoidal signal + with a period of 100. (B–E) The positions of poles of the linearized systems + when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative + pole locations show the movement of the poles as the input changes. Due to + the varying pole locations, different time constants modulate the richness of + the reservoir of dynamics as a function of input amplitude. Higher-amplitude + signals tend to saturate the nonlinear function and cause the poles to shrink + toward the origin of thez-plane (decreases the spectral radius), which results in + a system with a large stability margin. When the input is close to zero, the poles + ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing + the stability margin. An ESN with more states results in a detailed coverage of + thez-plane dynamics, which illustrates the power of nonlinear systems, when + compared to their linear counterpart. + + Here, our approach of bases and projections leads to a new metric. + We propose the instantaneous state entropy to quantify the distribution of + instantaneous amplitudes across the ESN states. Entropy of the instanta- + neous ESN states is appropriate to quantify performance in function ap- + proximation because the ESN output is a mere weighted combination of + the instantaneous value of the ESN states. If the echo state’s instantaneous + amplitudes are concentrated on only a few values across the ESN state dy- + namic range, the ability to approximate an arbitrary desired response by + weighting the states is limited (and wasteful due to redundancy between + the different states), and performance will suffer. On the other hand, if the + ESN states provide a diversity of instantaneous amplitudes, it is much eas- + ier to achieve the desired mapping. Hence, the instantaneous entropy of the + states appears as a good measure to quantify the richness of dynamics with + instantaneous mappers. Due to the time structure of signals, the average + state entropy (ASE), defined as the state entropy averaged over time, will be + the parameter used to quantify the diversity in the dynamical reservoir of + the ESN. Moreover, entropy has been proposed as an appropriate measure + of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE + measures the volume of the echo state manifold spanned by trajectories. + Renyi’squadraticentropyisemployedherebecauseitisaglobalmeasure + of information. In addition, an efficient nonparametric estimator of Renyi’s + entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe, + Xu, & Fisher, 2000). Renyi’s entropy with parameterγfor a random variable + X with a <> is given by Renyi (1970): + + + <> + + + Renyi’s quadratic entropy is obtained forγ=2 (forγ→1, Shannon’s en- + tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un- + known pdf to be estimated, Parzen windowing approximates the underly- + ing pdf by + + <> + + whereKσ is the kernel function with the kernel sizeσ. Then the Renyi’s + quadratic entropy can be estimated by (Principe et al., 2000) + + <> + + + The instantaneous state entropy is estimated using equation 3.1 where + thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T + of an ESN withNinternal PEs. Results will be shown with a gaussian kernel + with kernel size chosen to be 0.3 of the standard deviation of the entries + of the state vector. We will show that ASE is a more sensitive parameter to + quantify the approximation properties of ESNs by experimentally demon- + strating that ESNs with different spectral radius and even with the same + spectral radius display different ASEs. + + Let us consider the same 100-unit ESN that we used in the previous + section built with three different spectral radii 0.2, 0.5, 0.8 with an input + signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks. + The instantaneous state entropy is also calculated at each time step using + equation 3.1 and plotted in Figure 4B. First, note that the instantaneous + state entropy changes over time with the distribution of the echo states as + we would expect, since state entropy is dependent on the input signal that + also changes in this case. Second, as the spectral radius increases in the + simulation, the diversity in the echo states increases. For the spectral radius + of 0.2, echo state’s instantaneous amplitudes are concentrated on only a + few values, which is wasteful due to redundancy between different states. + In practice, to quantify the overall representation ability over time, we will + use ASE, which takes values−0.735,−0.007, and 0.335 for the spectral + radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral + radius, several ASEs are possible. Figure 4C shows ASEs from 50 different + realizations of ESNs with the same spectral radius of 0.5, which means that + ASE is a finer descriptor of the dynamics of the reservoir. Although we + have presented an experiment with sinusoidal signal, similar results are + obtained for other inputs as long as the input dynamic range is properly + selected. + + Maximizing ASE means that the diversity of the states over time is the + largest and should provide a basis set that is as uncorrelated as possible. + This condition is unfortunately not a guarantee that the ESN so designed + will perform the best, because the basis set in ESNs is created independent + of the desired response and the application may require a small spectral + radius. However, we maintain that when the desired response is not ac- + cessible for the design of the ESN bases or when the same reservoir is + to be used for a number of problems, the default strategy should be to + maximize the ASE of the state vector. The following section addresses + the design of ESNs with high ASE values and a simple mechanism to + adjust the reservoir dynamics without changing the recurrent connection + weights. + + 4 Designing Echo State Networks + + 4.1 Design of the Echo State Recurrent Connections.According to the + interpretation of ESNs as coupled linear systems, the design of the internal + connection matrix, W, will be based on the distribution of the poles of the + linearized system around zero state. Our proposal is to design the ESN + such that the linearized system has uniform pole distribution inside the + unit circle of thez-plane. With this design scenario, the system dynamics + will include uniform coverage of time constants arising from the uniform + distribution of the poles, which also decorrelates as much as possible the + basis functionals. This principle was chosen by analogy to the identification + oflinearsystemsusingKautzfilters(Kautz,1954),whichshowsthatthebest + approximation of a given transfer function by a linear system with finite + order is achieved when poles are placed in the neighborhood of the spectral + resonances. When no information is available about the desired response, + we should uniformly spread the poles to anticipate good approximation to + arbitrary mappings. + + We again use a maximum entropy principle to distribute the poles inside + the unit circle uniformly. The constraints of a circle as boundary conditions + for discrete linear systems and complex conjugate locations are easy to + include for the pole distribution (Thogula, 2003). The poles are first initial- + ized at random locations; the quadratic Renyi’s entropy is calculated by + equation 3.1, and poles are moved such that the entropy of the new dis- + tribution is increased over iterations (Erdogmus & Principe, 2002). This + method is efficient to find uniform coverage of the unit circle with an arbi- + trary number of poles. The system with the uniform pole locations can be + interpreted using linear system theory. The poles that are close to the unit + circle correspond to many sharp bandpass filters specializing in different + frequency regions, whereas the inner poles realize filters of larger frequency + support. Moreover, different orientations (angles) of the poles create filters + of different center frequencies. + + Now the problem is to construct an internal weight matrix from the pole + locations (eigenvalues ofW). In principle, we would like to create a sparse + + <
> + + Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs + ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8, + from top to bottom, respectively. The diversity of echo states increases when the + spectral radius increases. Within the dynamic range of the echo states, systems + with smaller spectral radius can generate only uneven representations, while + forW=0.8, outputs of echo states almost uniformly distribute within their + dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1. + Information contained in the echo states is changing over time according to the + input amplitude. Therefore, the richness of representation is controlled by the + input amplitude. Moreover, the value of ASE increases with spectral radius. + (C) ASEs from 50 different realizations of ESNs with the same spectral radius + of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the + reservoir than the spectral radius. + + matrix, so we started with the sparsest matrix (with an inverse), which is + the direct canonical structure given by (Kailath, 1980) + + <> + + The characteristic polynomial of W_i's + + <>, (4.2) + + wherepi ’s are the eigenvalues andai ’s are the coefficients of the character- + istic polynomial ofW. Here, we know the pole locations of the linear system + obtained from the linearization of the ESN, so using equation 4.2, we can + obtain the characteristic polynomial and constructWmatrix in the canon- + ical form using equation 4.1. We will call the ESN constructed based on + the uniform pole principle ASE-ESN. All other possible solutions with the + same eigenvalues can be obtained byQ−1 WQ,whereQis any nonsingular + matrix. + + To corroborate our hypothesis, we would like to show that the linearized + ESN designed with the recurrent weight matrix having the eigenvalues + uniformly distributed inside the unit circle creates higher ASE values for a + given spectral radius compared to other ESNs with random internal con- + nection weight matrices. We will consider an ESN with 30 states and use our + procedure to create theWmatrix for ASE-ESN for different spectral radii + between <<[0.1, 0.95]>>. Similarly, we constructed ESNs with sparse randomW + matrices with different sparseness constraints. This corresponds to a weight + distribution having the values 0, c and −c with probabilities <> ,<<(1−p_1)/2>>, + and <<(1−p_1)/2>>, wherep1 defines the sparseness ofWandcis a constant + that takes a specific value depending on the spectral radius. We also created + Wmatrices with values uniformly distributed between−1 and 1 (U-ESN) + and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then, + for differentWin matrices, we run the ASE-ESNs with the sinusoidal input + given in section 3 and calculate ASE. Figure 5 compares the ASE values + averaged over 1000 realizations. As observed from the figure, the ASE-ESN + with uniform pole distribution generates higher ASE on average for all + spectral radii compared to ESNs with sparse and uniform random connec- + tions. This approach is indeed conceptually similar to Jeffreys’ maximum + entropy prior (Jeffreys, 1946): it will provide a consistently good response + for the largest class of problems. Concentrating the poles of the linearized + + + <
> + + Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith + uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN + with uniformly distributed weights between−1 and 1. Randomly generated + weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the + networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole + distribution generates a higher ASE on average for all spectral radii compared + to ESNs with random connections. + + + system in certain regions of the space provides good performance only if + the desired response has energy in this part of the space, as is well known + from the theory of Kautz filters (Kautz, 1954). + + 4.2 Design of the Adaptive Bias. + + In conventional ESNs, only the output weights are trained, optimizing the + projections of the desired response onto the basis functions (echo states). + Since the dynamical reservoir is fixed, + the basis functions are only input dependent. However, since function ap- + proximation is a problem in the joint space of the input and desired signals, + a penalty in performance will be incurred. From the linearization analysis + that shows the crucial importance of the operating point of the PE non- + linearity in defining the echo state dynamics, we propose to use a single + external adaptive bias to adjust the effective spectral radius of an ESN. No- + tice that according to linearization analysis, bias can reduce only spectral + radius. The information for adaptation of bias is the MSE in training, which + modulates the spectral radius of the system with the information derived + from the approximation error. With this simple mechanism, some informa- + tionfromtheinput-outputjointspaceisincorporatedinthedefinitionofthe + projection space of the ESN. The beauty of this method is that the spectral + radius can be adjusted by a single parameter that is external to the system + without changing reservoir weights. + + The training of bias can be easily accomplished. Indeed, since the pa- + rameter space is only one-dimensional, a simple line search method can be + efficiently employed to optimize the bias. Among different line search al- + gorithms, we will use a search that uses Fibonacci numbers in the selection + of points to be evaluated (Wilde, 1964). The Fibonacci search method min- + imizes the maximum number of evaluations needed to reduce the interval + of uncertainty to within the prescribed length. In our problem, a bias value + is picked according to Fibonacci search. For each value of bias, training + data are applied to the ESN, and the echo states are calculated. Then the + corresponding optimal output weights and the objective function (MSE) + are evaluated to pick the next bias value. + Alternatively, gradient-based methods can be utilized to optimize the + bias, due to simplicity and low computational cost. System update equation + with an external bias signal,b,isgivenby + + <>. + + The update equation forbis given by + + <> + + Here,Ois the MSE defined previously. This algorithm may suffer from + similar problems observed in gradient-based methods in recurrent net- + works training. However, we observed that the performance surface is + rather simple. Moreover, since the search parameter is one-dimensional, + the gradient vector can assume only one of the two directions. Hence, im- + precision in the gradient estimation should affect the speed of convergence + but normally not change the correct gradient direction. + + 5 Experiments + + This section presents a variety of experiments in order to test the validity + of the ESN design scheme proposed in the previous section. + + 5.1 Short-Term Memory Capacity. + + This experiment compares the shortterm memory (STM) capacity of ESNs + with the same spectral radius using + the framework presented in Jaeger (2002a). Consider an ESN with a sin- + gle input signal, <>, optimally trained with the desired signal <>, + for a given delayk. Denoting the optimal output signalyk (n), thek-delay + STM capacity of a network,MC k , is defined as a squared correlation coef- + ficient betweenu <<(n−k)>> and <> (Jaeger, 2002a). The STM capacity, MC, + of the network is defined as <>. STM capacity measures how accu- + rately the delayed versions of the input signal are recovered with optimally + trained output units. Jaeger (2002a) has shown that the memory capacity + for recalling an independent and identically distributed (i.i.d.) input by an + Nunit RNN with linear output units is bounded by N. + We use ESNs with 20 PEs and a single input unit. ESNs are driven + by an i.i.d. random input signal,<>, that is uniformly distributed over + [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions + of the input, <>. We used four different ESNs: R-ESN, + U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN + used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47, + −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a + sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof + U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spec- + tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed + with uniform poles. BASE-ESN has the same recurrent weight matrix as + ASE-ESN and an adaptive bias at its input. In each ESN, the input weights + are set to 0.1 or−0.1 with equal probability, and direct connections from the + input to the output are allowed, whereasWback is set to 0 (Jaeger, 2002a). + The echo states are calculated using equation 2.1 for 200 samples of the + input signal, and the first 100 samples corresponding to initial transient + are eliminated. Then the output weight matrix is calculated using equation + 2.4. For the BASE-ESN, the bias is trained for each task. All networks are + run with a test input signal, and the corresponding output andMC k are + calculated. Figure 6 shows thek-delay STM capacity (averaged over 100 + trials) of each ESN for delays 1,...,40 for the test signal. The STM capac- + ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, + and 16.90, respectively. First, ESNs with uniform pole distribution (ASE- + ESN and BASE-ESN) haveMCs that are much longer than the randomly + generated ESN given in Jaeger (2002a) in spite of all having the same spec- + tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical + maximumvalueofN=20.AcloserlookatthefigureshowsthatR-ESNper- + forms slightly better than ASE-ESN for delays less than 9. In fact, for small + k, large ASE degrades the performance because the tasks do not need long + memory depth. However, the drawback of high ASE for smallkis recov- + ered in BASE-ESN, which reduces the ASE to the appropriate level required + for the task. Overall, the addition of the bias to the ASE-ESN increases the + STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly + better STM compared to R-ESN with only three different weight values, + although it has more distinct weight values compared to R-ESN. It is also + significant to note that theMCwill be very poor for an ESN with smaller + spectral radius even with an adaptive bias, since the problem requires large + ASE and bias can only reduce ASE. This experiment demonstrates the + + <
> + + Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed + using the test signal. The results are averaged over 100 different realizations of + each ESN type with the specifications given in the text for differentWandWin + matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are + 13.09, 13.55, 16.70, and 16.90, respectively. + + + suitability of maximizing ASE in tasks that require a substantial memory + length. + + 5.2 Binary Parity Check. + + The effect of the adaptive bias was marginal + in the previous experiment since the nature of the problem required large + ASE values. However, there are tasks in which the optimal solutions re- + quire smaller ASE values and smaller spectral radius. Those are the tasks + where the adaptive bias becomes a crucial design parameter in our design + methodology. + Consider an ESN with 100 internal units and a single input unit. ESN is + drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal + is to train an ESN to generate them-bit parity corresponding to lastmbits + received, wheremis 3,...,8. Similar to the previous experiments, we used + the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly + connected ESN where the entries ofWmatrix are set to 0, 0.06,−0.06 + with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse + connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN + are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 + with equal probability, and direct connections from the input to the output + are allowed whereasWback is set to 0. The echo states are calculated using + equation 2.1 for 1000 samples of the input signal, and the first 100 samples + corresponding to the initial transient are eliminated.Then the output weight + + <
> + + Figure 7: The number of wrong decisions made by each ESN form=3,...,8 + in the binary parity check problem. The results are averaged over 100 differ- + ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin + matrices with the specifications given in the text. The total numbers of wrong + decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and + 699. + + matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias + is trained for each task. The binary decision is made by a threshold detector + that compares the output of the ESN to 0.5. Figure 7 shows the number of + wrong decisions (averaged over 100 different realizations) made by each + ESN for <>. + The total numbers of wrong decisions for <> of R-ESN, ASE- + ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs + poorly since the nature of the problem requires a short time constant for + fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the + R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions. + BASE-ESN performs a lot better than ASE-ESN and slightly better than + the R-ESN since the adaptive bias reduces the spectral radius effectively. + Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN, + since the task requires access to longer input history, which compromises + the need for fast response. Indeed, the bias in the BASE-ESN takes effect + when there are errors (m>4) and when the task benefits from smaller + spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and + 2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide + range of bias values that result in similar MSE values (between 0 and 3). In + summary, this experiment clearly demonstrates the power of the bias signal + to configure the ESN reservoir according to the mapping task. + + 5.3 System Identification. + This section presents a function approxima- + tion task where the aim is to identify a nonlinear dynamical system. The + unknown system is defined by the difference equation + + <>, + + where + + <>. + + The input to the system is chosen to be <>. + We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with + 30 internal units and a single input unit. TheWmatrix of each ESN is scaled + suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN + where the entries ofWmatrix are set to 0, 0.35,−0.35 with probabilities 0.8, + 0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or−1 with + equal probability, and direct connections from the input to the output are + allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated + using equation 2.4. The MSE values (averaged over 100 realizations) for R- + ESN and ASE-ESN are 1.23x10 −5 and 1.83x10 −6 , respectively. The addition + of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10^−6 + to 3.27x10^−9 . + + 6 Discussion + + The great appeal of echo state networks (ESNs) and liquid state machine + (LSM) is their ability to construct arbitrary mappings of signals with rich + and time-varying temporal structures without requiring adaptation of the + free parameters of the recurrent layer. The echo state condition allows the + recurrent connections to be fixed with training limited to the linear output + layer. However, the literature did not elucidate on how to properly choose + the recurrent parameters for system identification applications. Here, we + provide an alternate framework that interprets the echo states as a set + of functional bases formed by fixed nonlinear combinations of the input. + The linear readout at the output stage simply computes the projection of + the desired output space onto this representation space. We further in- + troduce an information-theoretic criterion, ASE, to better understand and + evaluate the capability of a given ESN to construct such a representation + layer. The average entropy of the distribution of the echo states quantifies + thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest + to achieve the smallest correlation among the bases and be able to cope with + arbitrary mappings. However, not all function approximation problems re- + quire the same memory depth, which is coupled to the spectral radius. The + effective spectral radius of an ESN can be optimized for the given problem + with the help of an external bias signal that is adapted using the joint input- + output space information. The interesting property of this method when + applied to ESN built from sigmoidal nonlinearities is that it allows the fine + tuning of the system dynamics for a given problem with a single external + adaptive bias input and without changing internal system parameters. In + our opinion, the combination of the largest possible ASE and the adapta- + tion of the spectral radius by the bias produces the most parsimonious pole + location of the linearized ESN when no knowledge about the mapping is + available to optimally locate the bass functionals. Moreover, the bias can be + easily trained with either a line search method or a gradient-based method + since it is one-dimensional. We have illustrated experimentally that the de- + sign of the ESN using the maximization of ASE with the adaptation of the + spectral radius by the bias has provided consistently better performance + across tasks that require different memory depths. This means that these + two parameters’ design methodology is preferred to the spectral radius + criterion proposed by Jaeger, and it is still easily incorporated in the ESN + design. + + Experiments demonstrate that the ASE for ESN with uniform linearized + poles is maximized when the spectral radius of the recurrent weight matrix + approaches one (instability). It is interesting to relate this observation with + the computational properties found in dynamical systems “at the edge of + chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993; + Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨ + tomata rules are evolved to perform a complex computation, evolution will + tend to select rules with “critical” parameter values, which correlate with + a phase transition between ordered and chaotic regimes. Recently, similar + conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨ + Langton’s interpretation of edge of chaos was questioned by Mitchell et al. + (1993). Here, we provide a system-theoretic view and explain the computa- + tional behavior with the diversity of dynamics achieved with linearizations + that have poles close to the unit circle. According to our results, the spectral + radiusoftheoptimalESNinfunctionapproximationisproblemdependent, + and in general it is impossible to forecast the computational performance + as the system approaches instability (the spectral radius of the recurrent + weight matrix approaches one). However, allowing the system to modu- + late the spectral radius by either the output or internal biasing may allow + a system close to instability to solve various problems requiring different + spectral radii. + + Our emphasis here is mostly on ESNs without output feedback connec- + tions. However, the proposed design methodology can also be applied to + ESNs with output feedback. Both feedforward and feedback connections + contribute to specify the bases to create the projection space. At the same + time, there are applications where the output feedback contributes to the + system dynamics in a different fashion. For example, it has been shown that + a fixed weight (fully trained) RNN with output feedback can implement a + family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992). + In meta-learning, the role of output feedback in the network is to bias the + system to different regions of dynamics, providing multiple input-output + mappings required (Santiago & Lendaris, 2004). However, results could not + be replicated with ESNs (Prokhorov, 2005). We believe that more work has + to be done on output feedback in the context of ESNs but also suspect that + the echo state condition may be a restriction on the system dynamics for + this type of problem. + + There are many interesting issues to be researched in this exciting new + area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s + representation layer in an unsupervised fashion. In fact, we can easily adapt + withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild, + and Principe (2003): extra weights linking the outputs of recurrent states to + maximize output entropy. Output entropy maximization is a well-known + metric to create independent components (Bell & Sejnowski, 1995), and + here it means that the echo states will become as independent as possible. + This would circumvent the linearization of the dynamical system to set the + recurrent weights and would fine-tune continuously in an unsupervised + manner the parameters of the ESN among different inputs. However, it + goes against the idea of a fixed ESN reservoir. + + The reservoir of recurrent PEs can be thought of as a new form of a time- + to-space mapping. Unlike the delay line that forms an embedding (Takens, + 1981), this mapping may have the advantage of filtering noise and produce + representations with better SNRs to the peaks of the input, which is very + appealing for signal processing and seems to be used in biology. However, + further theoretical work is necessary in order to understand the embedding + capabilities of ESNs. One of the disadvantages of the ESN correlated basis + is in the design of the readout. Gradient-based algorithms will be very + slow to converge (due to the large eigenvalue spread of modes), and even + if recursive methods are used, their stability may be compromised by the + condition number of the matrix. However, our recent results incorporating + anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of + solving this problem. + + Finally we would like to briefly comment on the implications of these + models to neurobiology and computational neuroscience. The work by + Pouget and Sejnowski (1997) has shown that the available physiological + data are consistent with the hypothesis that the response of a single neuron + in the parietal cortex serves as a basis function generated by the sensory + input in a nonlinear fashion. In other words, the neurons transform the + sensory input into a format (representation space) such that the subsequent + computation is simplified. Then, whenever a motor command (output of + the biological system) needs to be generated, this simple computation to + read out the neuronal activity is done. There is an intriguing similarity + betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski + and our interpretation of echo states in ESN. We believe that similar ideas + can be applied to improve the design of microcircuit implementations of + LSMs. First, the framework of functional space interpretation (bases and + projections) is also applicable to microcircuits. Second, the ASE measure + may be directly utilized for LSM states because the states are normally low- + pass-filtered before the readout. However, the control of ASE by changing + the liquid dynamics is unclear. Perhaps global control of thresholds or bias + current will be able to accomplish bias control as in ESN with sigmoid + PEs. + + + Acknowledgments + + This work was partially supported by NSFECS-0422718, NSFCNS-0540304, + and ONR N00014-1-1-0405. + + + References + + Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer. + Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor- + ical perception, and probability learning: Some applications of a neural model. + Psychological Review, 84, 413–451. + Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach + to blind separation and blind deconvolution.Neural Computation, 7(6), 1129– + 1159. + Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨ + in recurrent neural networks.Neural Computation, 16(7), 1413–1436. + Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal + of Physics, 14(1), 1–13. + de Vries, B. (1991).Temporal processing with neural networks—the development of the + gamma model. Unpublished doctoral dissertation, University of Florida. + Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural + network for system identification and control.IEEE Proceedings of Control Theory + and Applications, 142(4), 307–314. + Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179–211. + Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation: + Stochastic information gradient.Signal Processing Letters, 10(8), 242–245. + Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for + adaptive system training.IEEE Transactions on Neural Networks, 13(5), 1035–1044. + Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream + Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle + (Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 29–53). Dordrecht, + Netherlands: Kluwer. 136 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle + River, NJ. Prentice Hall. + Haykin, S. (2001).Adaptive filter theory(4th ed.). Upper Saddle River, NJ: Prentice + Hall. + Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa- + tion, 9(8), 1735–1780. + Hopfield, J. (1984). Neurons with graded response have collective computational + properties like those of two-state neurons.Proceedings of the National Academy of + Sciences, 81, 3088–3092. + Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math- + ematics, 5(1), 189–203. + Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural + networks(Tech. Rep. No. 148). Bremen: German National Research Center for + Information Technology. + Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152). + Bremen: German National Research Center for Information Technology. + Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL, + EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German + National Research Center for Information Technology. + Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems + and saving energy in wireless communication.Science, 304(5667), 78–80. + Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems. + Proceedings of the Royal Society of London, A 196, 453–461. + Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall. + Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit + Theory, 1(3), 29–39. + Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks + for adaptive communication channel equalization.IEEE Transactions on Neural + Networks, 5(2), 267–278. + Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks. + IEEE Transactions on Neural Networks, 6(5), 1000–1004. + Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation + theory(2nd ed.). New York: Springer-Verlag. + Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 12–37. + Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the + computational power and generalization capability of neural microcircuits. In + L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing + systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press. + Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨ + stable states: A new framework for neural computation based on perturbations. + Neural Computation, 14(11), 2531–2560. + Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos: + Evolving cellular automata to perform computations.Complex Systems, 7, 89– + 130. + Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. + Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293– + 301). Singapore: World Scientific. Analysis and Design of Echo State Networks 137 + + + Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex + using basis functions.Journal of Cognitive Neuroscience, 9(2), 222–237. + Principe, J. (2001). Dynamic neural networks and optimal signal processing. In + Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6– + 28). Boca Raton, FL: CRC Press. + Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new + class of adaptive IIR filters with restricted feedback.IEEE Transactions on Signal + Processing, 41(2), 649–656. + Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin + (Ed.),Unsupervised adaptive filtering(pp. 265–319). Hoboken, NJ: Wiley. + Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter- + national Joint Conference on Neural Networks(pp. 1463–1466). Montreal, Canada. + Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed + weights in recurrent neural networks: An overview. InProc. of International Joint + Conference on Neural Networks(pp. 2018–2022). Honolulu, Hawaii. + Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys- + tems with Kalman filter trained recurrent networks.IEEE Transactions on Neural + Networks, 5(2), 279–297. + Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap- + plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 1407–1420. + Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, + M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with + echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and + Signal Processing. Philadelphia. + Renyi, A. (1970).Probability theory. New York: Elsevier. + Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis. + Unpublished doctoral dissertation, University of Florida. + Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net- + works: Reformulating fixed weight neural networks. InProc. of International Joint + Conference on Neural Networks(pp. 189–194). Budapest, Hungary. + Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in + multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 10–18. + Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical + Journal, 27, 623–656. + Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc- + toral dissertation, Rutgers University. + Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied + Mathematics Letters, 4(6), 77–80. + Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended + Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process- + ing systems, 1(pp. 133–140). San Mateo, CA: Morgan Kaufmann. + Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. + Young (Eds.),Dynamical systems and turbulence(pp. 366–381). Berlin: Springer. + Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub- + lished master’s thesis, University of Florida. + Werbos, P. (1990). Backpropagation through time: What it does and how to do it. + Proceedings of IEEE, 78(10), 1550–1560. 138 M. Ozturk, D. Xu, and J. Pr´ıncipe + + + Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua- + tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 65–89). New + York: Van Nostrand Reinhold. + Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall. + Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running + fully recurrent neural networks.Neural Computation, 1, 270–280. +<> <> <> + + +<> <> <> + Bayesian Compression for Deep Learning + + Christos Louizos Karen Ullrich Max Welling + University of Amsterdam University of Amsterdam University of Amsterdam + TNO Intelligent Imaging k.ullrich@uva.nl CIFAR + c.louizos@uva.nl m.welling@uva.nl + + + Abstract + + Compression and computational efficiency in deep learning have become a problem + of great significance. In this work, we argue that the most principled and effective + way to attack this problem is by adopting a Bayesian point of view, where through + sparsity inducing priors we prune large parts of the network. We introduce two + novelties in this paper: 1) we use hierarchical priors to prune nodes instead of + individual weights, and 2) we use the posterior uncertainties to determine the + optimal fixed point precision to encode the weights. Both factors significantly + contribute to achieving the state of the art in terms of compression rates, while + still staying competitive with methods designed to optimize for speed or energy + efficiency. + + + 1 Introduction + + While deep neural networks have become extremely successful in in a wide range of applications, + often exceeding human performance, they remain difficult to apply in many real world scenarios. For + instance, making billions of predictions per day comes with substantial energy costs given the energy + consumption of common Graphical Processing Units (GPUs). Also, real-time predictions are often + about a factor100away in terms of speed from what deep NNs can deliver, and sending NNs with + millions of parameters through band limited channels is still impractical. As a result, running them on + hardware limited devices such as smart phones, robots or cars requires substantial improvements on + all of these issues. For all those reasons, compression and efficiency have become a topic of interest + in the deep learning community. + While all of these issues are certainly related, compression and performance optimizing procedures + might not always be aligned. As an illustration, consider the convolutional layers of Alexnet, which + account for only 4% of the parameters but 91% of the computation [68]. Compressing these layers + will not contribute much to the overall memory footprint. + There is a variety of approaches to address these problem settings. However, most methods have + the common strategy of reducing both the neural network structure and the effective fixed point + precision for each weight. A justification for the former is the finding that NNs suffer from significant + parameter redundancy [14]. Methods in this line of thought are network pruning, where unnecessary + connections are being removed [40,24,21], or student-teacher learning where a large network is + used to train a significantly smaller network [5, 27]. + From a Bayesian perspective network pruning and reducing bit precision for the weights is aligned + with achieving high accuracy, because Bayesian methods search for the optimal model structure + (which leads to pruning with sparsity inducing priors), and reward uncertain posteriors over parameters + through the bits back argument [28] (which leads to removing insignificant bits). This relation is + made explicit in the MDL principle [20] which is known to be related to Bayesian inference. + + In this paper we will use the variational Bayesian approximation for Bayesian inference which has + also been explicitly interpreted in terms of model compression [28]. By employing sparsity inducing + priors for hidden units (and not individual weights) we can prune neurons including all their ingoing + and outgoing weights. This avoids more complicated and inefficient coding schemes needed for + pruning or vector quantizing individual weights. As an additional Bayesian bonus we can use the + variational posterior uncertainty to assess which bits are significant and remove the ones which + fluctuate too much under approximate posterior sampling. From this we derive the optimal fixed + point precision per layer, which is still practical on chip. + + 2 Variational Bayes and Minimum Description Length + + A fundamental theorem in information theory is the minimum description length (MDL) principle [20]. + It relates to compression directly in that it defines the best hypothesis to be the one that communicates + the sum of the model (complexity costLC ) and the data misfit (error costLE ) with the minimum + number of bits [59,60]. It is well understood that variational inference can be reinterpreted from an + MDL point of view [56,72,28,30,19]. More specifically, assume that we are presented with a dataset QD that consists from N input-output pairs <>. Let <> + be a parametric model, e.g. a deep neural network, that maps inputs x to their corresponding outputs + y using parameters w governed by a prior distribution <>. In this scenario, we wish to approximate + the intractable posterior distribution <> with a fixed form approximate + posterior <> by optimizing the variational parameters according to: + + <> + + where <> denotes the entropy and <> is known as the evidence-lower-bound (ELBO) or negative + variational free energy. As indicated in eq.1, <> naturally decomposes into a minimum cost for + communicating the targets <> under the assumption that the sender and receiver agreed on a n=1 prior <> and that the receiver knows the inputs <> and form of the parametric model. n=1 + By using sparsity inducing priors for groups of weights that feed into a neuron the Bayesian mecha- + nism will start pruning hidden units that are not strictly necessary for prediction and thus achieving + compression. But there is also a second mechanism by which Bayes can help us compress. By + explicitly entertaining noisy weight encodings through <> we can benefit from the bits-back + argument [28,30] due to the entropy term; this is in contrast to infinitely precise weights that lead to + <>. Nevertheless in practice, the data misfit termLE is intractable for neural network + models under a noisy weight encoding, so as a solution Monte Carlo integration is usually employed. + Continuous q(w) allow for the reparametrization trick [36,58]. Here, we replace sampling from + q(w) by a deterministic function of the variational parameters and random samples from some + noise variables: + + <>; (2) + + where <>. By applying this trick, we obtain unbiased stochastic gradients of the ELBO + with respect to the variational parameters, thus resulting in a standard optimization problem that is + fit for stochastic gradient ascent. The efficiency of the gradient estimator resulting from eq. 2 can be + further improved for neural networks by utilizing local reparametrizations [37] (which we will use in + our experiments); they provide variance reduction in an efficient way by locally marginalizing the + weights at each layer and instead sampling the distribution of the pre-activations. + + 3 Related Work + + One of the earliest ideas and most direct approaches to tackle efficiency is pruning. Originally + introduced by [40], pruning has recently been demonstrated to be applicable to modern architectures + [25,21]. It had been demonstrated that an overwhelming amount of up to 99,5% of parameters + can be pruned in common architectures. There have been quite a few encouraging results obtained + by (empirical) Bayesian approaches that employ weight pruning [19,7,52,70,51]. Nevertheless, + + 2 In practice this term is a large constant determined by the weight precision. + + weight pruning is in general inefficient for compression since the matrix format of the weights is not + taken into consideration, therefore the Compressed Sparse Column (CSC) format has to be employed. + Moreover, note that in conventional CNNs most flops are used by the convolution operation. Inspired + by this observation, several authors proposed pruning schemes that take these considerations into + account [73, 74] or even go as far as efficiency aware architectures to begin with [32, 15, 31]. From + the Bayesian viewpoint, similar pruning schemes have been explored at [47, 53, 39, 34]. + Given optimal architecture, NNs can further be compressed by quantization. More precisely, there + are two common techniques. First, the set of accessible weights can be reduced drastically. As an + extreme example, [13,48,57,76] and [11] trained NN to use only binary or tertiary weights with + floating point gradients. This approach however is in need of significantly more parameters than + their ordinary counterparts. Work by [18] explores various techniques beyond binary quantization: + k-means quantization, product quantization and residual quantization. Later studies extent this set to + optimal fixed point [44] and hashing quantization [10]. [25] apply k-means clustering and consequent + center training. From a practical point of view, however, all these are fairly unpractical during + test time. For the computation of each feature map in a net, the original weight matrix must be + reconstructed from the indexes in the matrix and a codebook that contains all the original weights. + This is an expensive operation and this is why some studies propose a different approach than set + quantization. Precision quantization simply reduces the bit size per weight. This has a great advantage + over set quantization at inference time since feature maps can simply be computed with less precision + weights. Several studies show that this has little to no effect on network accuracy when using 16bit + weights [49,22,12,71,9]. Somewhat orthogonal to the above discussion but certainly relevant are + approaches that customize the implementation of CNNs for hardware limited devices[31, 4, 62]. + + + + 4 Bayesian compression with scale mixtures of normals + + + Consider the following prior over a parameter w where its scale z is governed by a distribution <>: + + + <>; (3) + + + with z2 serving as the variance of the zero-mean normal distribution over w. By treating the scales + of w as random variables we can recover marginal prior distributions over the parameters that have + heavier tails and more mass at zero; this subsequently biases the posterior distribution over w to + be sparse. This family of distributions is known as scale-mixtures of normals [6,2] and it is quite + general, as a lot of well known sparsity inducing distributions are special cases. + One example of the aforementioned framework is the spike-and-slab distribution [50], the golden + standard for sparse Bayesian inference. Under the spike-and-slab, the mixing density of the scales is a + Bernoulli distribution, thus the marginal <> has a delta “spike” at zero and a continuous “slab” over + the real line. Unfortunately, this prior leads to a computationally expensive inference since we have + to explore a space of2M models, whereMis the number of the model parameters. Dropout [29,67], + one of the most popular regularization techniques for neural networks, can be interpreted as positing a + spike and slab distribution over the weights where the variance of the “slab” is zero [17,45]. Another + example is the Laplace distribution which arises by considering <>. The mode of + the posterior distribution under a Laplace prior is known as the Lasso [69] estimator and has been + previously used for sparsifying neural networks at [73,61]. While computationally simple, the + Lasso estimator is prone to “shrinking" large signals [8] and only provides point estimates about + the parameters. As a result it does not provide uncertainty estimates, it can potentially overfit and, + according to the bits-back argument, is inefficient for compression. + For these reasons, in this paper we will tackle the problem of compression and efficiency in neural + networks by adopting a Bayesian treatment and inferring an approximate posterior distribution over + the parameters under a scale mixture prior. We will consider two choices for the prior over the scales + p(z); the hyperparameter free log-uniform prior [16,37] and the half-Cauchy prior, which results into + a horseshoe [8] distribution. Both of these distributions correspond to a continuous relaxation of the + spike-and-slab prior and we provide a brief discussion on their shrinkage properties at Appendix C. + + 4.1 Reparametrizing variational dropout for group sparsity + + One potential choice for p(z) is the improper log-uniform prior [37] <>. It turns out that + we can recover the log-uniform prior over the weightswif we marginalize over the scales z: + + <> (4) + + This alternative parametrization of the log uniform prior is known in the statistics literature as the + normal-Jeffreys prior and has been introduced by [16]. This formulation allows to “couple" the + scales of weights that belong to the same group (e.g. neuron or feature map), by simply sharing the + corresponding scale variablezin the joint prior 3 : + + <>; (5) + + where W is the weight matrix of a fully connected neural network layer with A being the dimen- + sionality of the input and B the dimensionality of the output. Now consider performing variational + inference with a joint approximate posterior parametrized as follows: + + <>; (6) + + where _i is the dropout rate [67,37,51] of the given group. As explained at [37,51], the multiplicative + parametrization of the approximate posterior over z suffers from high variance gradients; therefore + we will follow [51] and re-parametrize it in terms of <>, hence optimize w.r.t._2 . + The <> lower bound under this prior and approximate posterior becomes: + + <> (7) + + Under this particular variational posterior parametrization the negative KL-divergence from the + conditional prior <> to the approximate posterior <> is independent of z: + + <> (8) + + This independence can be better understood if we consider a non-centered parametrization of the + prior [55]. More specifically, consider reparametrizing the weights asw~ij =wij ; this will then result zi + into <>, where <>. Now if <> and <> + we perform variational inference under the p(W~)p(z)prior with an approximate posterior that has Q the form of <>, with <>, then we see that we ij arrive at the same expressions for the negative KL-divergence from the prior to the approximate + posterior. Finally, the negative KL-divergence from the normal-Jeffreys scale prior p(z) to the + Gaussian variational posterior q depends only on the “implied” dropout rate, <>, and zi z takes the following form [51]: + + <>; (9) + + where <> are the sigmoid and softplus functions respectively 4 and k1 = 0:63576,k2 = + 1:87320,k3 = 1:48695. We can now prune entire groups of parameters by simply specifying a thresh- + old for the variational dropout rate of the corresponding group, e.g.<>. It should be mentioned that this prior parametrization readily allows for a more flexible marginal pos- + terior over the weights as we now have a compound distribution, <>; this + is in contrast to the original parametrization and the Gaussian approximations employed by [37,51]. + Strictly speaking the result of eq. 4 only holds when each weight has its own scale and not when that scale is + shared across multiple weights. Nevertheless, in practice we obtain a prior that behaves in a similar way, i.e. it + biases the variational posterior to be sparse. + + <> + + Furthermore, this approach generalizes the low variance additive parametrization of variational + dropout proposed for weight sparsity at [51] to group sparsity (which was left as an open question + at [51]) in a principled way. + At test time, in order to have a single feedforward pass we replace the distribution overWat each + layer with a single weight matrix, the masked variational posterior mean: + + <>; (10) + + where m is a binary mask determined according to the group variational dropout rate andMW are + the means ofq (W~). We further use the variational posterior marginal variances 5 for this particular + posterior approximation: + + <>; (11) + + to assess the bit precision of each weight in the weight matrix. More specifically, we employed the + mean variance across the weight matrixW^ to compute the unit round off necessary to represent the + weights. This method will give us the amount significant bits, and by adding 3 exponent and 1 sign + bits we arrive at the final bit precision for the entire weight matrixW^6 . We provide more details at + Appendix B. + + 4.2 Group horseshoe with half-Cauchy scale priors + + Another choice for p(z) is a proper half-Cauchy distribution: <>; it + induces a horseshoe prior [8] distribution over the weights, which is a well known sparsity inducing + prior in the statistics literature. More formally, the prior hierarchy over the weights is expressed as + (in a non-centered parametrization): + + <>; (12) + + where0 is the free parameter that can be tuned for specific desiderata. The idea behind the horseshoe + is that of the “global-local" shrinkage; the global scale variablespulls all of the variables towards + zero whereas the heavy tailed local variableszi can compensate and allow for some weights to escape. + Instead of directly working with the half-Cauchy priors we will employ a decomposition of the + half-Cauchy that relies upon (inverse) gamma distributions [54] as this will allow us to compute + the negative KL-divergence from the scale priorp(z)to an approximate log-normal scale posterior + q (z)in closed form (the derivation is given in Appendix D). More specifically, we have that the + half-Cauchy prior can be expressed in a non-centered parametrization as: + + <>; (13) + + where <> correspond to the inverse Gamma and Gamma distributions in the scale + parametrization, and z follows a half-Cauchy distribution with scale k. Therefore we will re-express + the whole hierarchy as: + + <>; (14) + + It should be mentioned that the improper log-uniform prior is the limiting case of the horseshoe prior + when the shapes of the (inverse) Gamma hyperpriors on <> go to zero [8]. In fact, several well + known shrinkage priors can be expressed in this form by altering the shapes of the (inverse) Gamma + hyperpriors [3]. For the variational posterior we will employ the following mean field approximation: + + <>. + + Notice that the fact that we are using mean-field variational approximations (which we chose for simplicity) + can potentially underestimate the variance, thus lead to higher bit precisions for the weights. We leave the + exploration of more involved posteriors for future work. + + Where <> is a log-normal distribution. It should be mentioned that a similar form of non- + centered variational inference for the horseshoe has been also successfully employed for undirected + models at [q 33]. Notice that we can also apply local reparametrizations [37] when we are sampling + <> + i i and sa sb by exploiting properties of the log-normal distribution 7 and thus forming the + implied: + + <> (17) + + As a threshold rule for group pruning we will use the negative log-mode 8 of the local log-normal r.v. + <> , i.e. prune when <>, with <>. This ignores <> and <>, but nonetheless we found <> dependencies among the zi elements induced by the common scale + that it works well in practice. Similarly with the group normal-Jeffreys prior, we will replace the + distribution overWat each layer with the masked variational posterior mean during test time: + + <>; (19) + + wheremis a binary mask determined according to the aforementioned threshold,MW are the means + ofq(W~)and;2 are the means and variances of the local log-normals over <>. Furthermore, + similarly to the group normal-Jeffreys approach, we will use the variational posterior marginal + variances: + <>; (20) + + to compute the final bit precision for the entire weight matrix W. + + 5 Experiments + + We validated the compression and speed-up capabilities of our models on the well-known architectures + of LeNet-300-100 [41], LeNet-5-Caffe 9 on MNIST [42] and, similarly with [51], VGG [63]10 on + CIFAR 10 [38]. The groups of parameters were constructed by coupling the scale variables for each + filter for the convolutional layers and for each input neuron for the fully connected layers. We provide + the algorithms that describe the forward pass using local reparametrizations for fully connected + and convolutional layers with each of the employed approximate posteriors at appendix F. For the + horseshoe prior we set the scale 0 of the global half-Cauchy prior to a reasonably small value, e.g. + 0 = 1e5. This further increases the prior mass at zero, which is essential for sparse estimation + and compression. We also found that constraining the standard deviations as described at [46] and + “warm-up" [65] helps in avoiding bad local optima of the variational objective. Further details about + the experimental setup can be found at Appendix A. Determining the threshold for pruning can be + easily done with manual inspection as usually there are two well separated clusters (signal and noise). + We provide a sample visualization at Appendix E. + + 5.1 Architecture learning & bit precisions + + We will first demonstrate the group sparsity capabilities of our methods by illustrating the learned + architectures at Table 1, along with the inferred bit precision per layer. As we can observe, our + methods infer significantly smaller architectures for the LeNet-300-100 and LeNet-5-Caffe, compared + to Sparse Variational Dropout, Generalized Dropout and Group Lasso. Interestingly, we observe + that for the VGG network almost all of big 512 feature map layers are drastically reduced to around + 10 feature maps whereas the initial layers are mostly kept intact. Furthermore, all of the Bayesian + methods considered require far fewer than the standard 32 bits per-layer to represent the weights, + sometimes even allowing for 5 bit precisions. + + The product of log-normal r.v.s is another log-normal and a power of a log-normal r.v. is another log-normal. + Empirically, it slightly better separates the scales compared to the negative log-mean <>. + https://github.com/BVLC/caffe/tree/master/examples/mnist + The adapted CIFAR 10 version described athttp://torch.ch/blog/2015/07/30/cifar.html. + + Table 1: Learned architectures with Sparse VD [51], Generalized Dropout (GD) [66] and Group + Lasso (GL) [73]. Bayesian Compression (BC) with group normal-Jeffreys (BC-GNJ) and group + horseshoe (BC-GHS) priors correspond to the proposed models. We show the amount of neurons left + after pruning along with the average bit precisions for the weights at each layer. + + <
> + + 5.2 Compression Rates + + For the actual compression task we compare our method to current work in three different scenarios: + (i) compression achieved only by pruning, here, for non-group methods we use the CSC format + to store parameters; (ii) compression based on the former but with reduced bit precision per layer + (only for the weights); and (iii) the maximum compression rate as proposed by [25]. We believe + + Table 2: Compression results for our methods. “DC” corresponds to Deep Compression method + introduced at [25], “DNS” to the method of [21] and “SWS” to the Soft-Weight Sharing of [70]. + Numbers marked with * are best case guesses. + + <
> + + these to be relevant scenarios because (i) can be applied with already existing frameworks such as + Tensorflow [1], (ii) is a practical scheme given upcoming GPUs and frameworks will be designed to + work with low and mixed precision arithmetics [43,23]. For (iii), we perform k-means clustering on + the weights with k=32 and consequently store a weight index that points to a codebook of available + weights. Note that the latter achieves highest compression rate but it is however fairly unpractical at + test time since the original matrix needs to be restored for each layer. As we can observe at Table 2, + our methods are competitive with the state-of-the art for LeNet-300-100 while offering significantly + better compression rates on the LeNet-5-Caffe architecture, without any loss in accuracy. Do note + that group sparsity and weight sparsity can be combined so as to further prune some weights when a + particular group is not removed, thus we can potentially further boost compression performance at + e.g. LeNet-300-100. For the VGG network we observe that training from a random initialization + yielded consistently less accuracy (around 1%-2% less) compared to initializing the means of the + approximate posterior from a pretrained network, similarly with [51], thus we only report the latter + results 11 . After initialization we trained the VGG network regularly for 200 epochs using Adam with + the default hyperparameters. We observe a small drop in accuracy for the final models when using + the deterministic version of the network for prediction, but nevertheless averaging across multiple + samples restores the original accuracy. Note, that in general we can maintain the original accuracy on + VGG without sampling by simply finetuning with a small learning rate, as done at [51]. This will + still induce (less) sparsity but unfortunately it does not lead to good compression as the bit precision + remains very high due to not appropriately increasing the marginal variances of the weights. + + 5.3 Speed and energy consumption + + We demonstrate that our method is competitive with [73], denoted as GL, a method that explicitly + prunes convolutional kernels to reduce compute time. We measure the time and energy consumption + of one forward pass of a mini-batch with batch size 8192 through LeNet-5-Caffe. We average over10 4 + forward passes and all experiments were run with Tensorflow 1.0.1, cuda 8.0 and respective cuDNN. + We apply 16 CPUs run in parallel (CPU) or a Titan X (GPU). Note that we only use the pruned + architecture as lower bit precision would further increase the speed-up but is not implementable in + any common framework. Further, all methods we compare to in the latter experiments would barely + show an improvement at all since they do not learn to prune groups but only parameters. In figure 1 + we present our results. As to be expected the largest effect on the speed up is caused by GPU usage. + However, both our models and best competing models reach a speed up factor of around 8x. We + can further save about 3x energy costs by applying our architecture instead of the original one on a + GPU. For larger networks the speed-up is even higher: for the VGG experiments with batch size 256 + we have a speed-up factor of 51x. + + <
> + + Figure 1:Left:Avg. Time a batch of 8192 samples takes to pass through LeNet-5-Caffe. Numbers on + top of the bars represent speed-up factor relative to the CPU implementation of the original network. + Right:Energy consumption of the GPU of the same process (when run on GPU). + + 6 Conclusion + + We introduced Bayesian compression, a way to tackle efficiency and compression in deep neural + networks in a unified and principled way. Our proposed methods allow for theoretically principled + compression of neural networks, improved energy efficiency with reduced computation while naturally + learning the bit precisions for each weight. This serves as a strong argument in favor of Bayesian + methods for neural networks, when we are concerned with compression and speed up. + + 11 We also tried to finetune the same network with Sparse VD, but unfortunately it increased the error + considerably (around 3% extra error), therefore we do not report those results. + + 8 Acknowledgments + We would like to thank Dmitry Molchanov, Dmitry Vetrov, Klamer Schutte and Dennis Koelma for + valuable discussions and feedback. This research was supported by TNO, NWO and Google. + + + References + [1]M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, + M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv + preprint arXiv:1603.04467, 2016. + [2]D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions.Journal of the Royal Statistical + Society. Series B (Methodological), pages 99–102, 1974. + [3]A. Armagan, M. Clyde, and D. B. Dunson. Generalized beta mixtures of gaussians. InAdvances in neural + information processing systems, pages 523–531, 2011. + [4]E. Azarkhish, D. Rossi, I. Loi, and L. Benini. Neurostream: Scalable and energy efficient deep learning + with smart memory cubes.arXiv preprint arXiv:1701.06420, 2017. + [5]J. Ba and R. Caruana. Do deep nets really need to be deep? InAdvances in neural information processing + systems, pages 2654–2662, 2014. + [6] E. Beale, C. Mallows, et al. Scale mixing of symmetric distributions with zero means.The Annals of + Mathematical Statistics, 30(4):1145–1151, 1959. + [7]C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. + Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 + July 2015, 2015. + [8]C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals.Biometrika, 97 + (2):465–480, 2010. + [9]S. Chai, A. Raghavan, D. Zhang, M. Amer, and T. Shields. Low precision neural networks using subband + decomposition.arXiv preprint arXiv:1703.08595, 2017. + [10]W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural + networks.arXiv preprint arXiv:1506.04449, 2015. + [11]M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations + constrained to+1or1.arXiv preprint arXiv:1602.02830, 2016. + [12]M. Courbariaux, J.-P. David, and Y. Bengio. Training deep neural networks with low precision multiplica- + tions.arXiv preprint arXiv:1412.7024, 2014. + [13]M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary + weights during propagations. InAdvances in Neural Information Processing Systems, pages 3105–3113, + 2015. + [14]M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. InAdvances in + Neural Information Processing Systems, pages 2148–2156, 2013. + [15]X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference + complexity.arXiv preprint arXiv:1703.08651, 2017. + [16]M. A. Figueiredo. Adaptive sparseness using jeffreys’ prior.Advances in neural information processing + systems, 1:697–704, 2002. + [17]Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep + learning.ICML, 2016. + [18]Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector + quantization.ICLR, 2015. + [19]A. Graves. Practical variational inference for neural networks. InAdvances in Neural Information + Processing Systems, pages 2348–2356, 2011. + [20]P. D. Grünwald.The minimum description length principle. MIT press, 2007. + [21]Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. InAdvances In Neural + Information Processing Systems, pages 1379–1387, 2016. + [22]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical + precision.CoRR, abs/1502.02551, 392, 2015. + [23]P. Gysel. Ristretto: Hardware-oriented approximation of convolutional neural networks.Master’s thesis, + University of California, 2016. + [24]S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural networks. + InAdvances in Neural Information Processing Systems, pages 1135–1143, 2015. + [25]S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, + trained quantization and huffman coding.ICLR, 2016. + [26]K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on + imagenet classification. InProceedings of the IEEE International Conference on Computer Vision, pages + 1026–1034, 2015. + [27]G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint + arXiv:1503.02531, 2015. + [28]G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length + of the weights. InProceedings of the sixth annual conference on Computational learning theory, pages + 5–13. ACM, 1993. + [29]G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural + networks by preventing co-adaptation of feature detectors.arXiv preprint arXiv:1207.0580, 2012. + [30]A. Honkela and H. Valpola. Variational learning and bits-back coding: an information-theoretic view to + bayesian learning.IEEE Transactions on Neural Networks, 15(4):800–810, 2004. + [31]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. + Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint + arXiv:1704.04861, 2017. + [32]F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level + accuracy with 50x fewer parameters and< 0.5 mb model size.ICLR, 2017. + [33]J. B. Ingraham and D. S. Marks. Bayesian sparsity for intractable distributions. arXiv preprint + arXiv:1602.03807, 2016. + [34]T. Karaletsos and G. Rätsch. Automatic relevance determination for deep generative models.arXiv preprint + arXiv:1505.07765, 2015. + [35]D. Kingma and J. Ba. Adam: A method for stochastic optimization.International Conference on Learning + Representations (ICLR), San Diego, 2015. + [36]D. P. Kingma and M. Welling. Auto-encoding variational bayes.International Conference on Learning + Representations (ICLR), 2014. + [37]D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparametrization trick. + Advances in Neural Information Processing Systems, 2015. + [38]A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009. + [39]N. D. Lawrence. Note relevance determination. InNeural Nets WIRN Vietri-01, pages 128–133. Springer, + 2002. + [40]Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. InNIPs, + volume 2, pages 598–605, 1989. + [41]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. + Proceedings of the IEEE, 86(11):2278–2324, 1998. + [42]Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 1998. + [43]D. D. Lin and S. S. Talathi. Overcoming challenges in fixed point training of deep convolutional networks. + Workshop ICML, 2016. + [44]D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks. + arXiv preprint arXiv:1511.06393, 2015. + [45]C. Louizos. Smart regularization of deep architectures.Master’s thesis, University of Amsterdam, 2015. + [46]C. Louizos and M. Welling. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks. + ArXiv e-prints, Mar. 2017. + [47]D. J. MacKay. Probable networks and plausible predictions—a review of practical bayesian methods for + supervised neural networks.Network: Computation in Neural Systems, 6(3):469–505, 1995. + [48]N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with + fine-grained quantization.arXiv preprint arXiv:1705.01462, 2017. + [49]P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha. Deep neural networks are robust to + weight binarization and other non-linear distortions.arXiv preprint arXiv:1606.01981, 2016. + [50]T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the + American Statistical Association, 83(404):1023–1032, 1988. + [51]D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deep neural networks.arXiv + preprint arXiv:1701.05369, 2017. + [52]E. Nalisnick, A. Anandkumar, and P. Smyth. A scale mixture perspective of multiplicative noise in neural + networks.arXiv preprint arXiv:1506.03208, 2015. + [53]R. M. Neal.Bayesian learning for neural networks. PhD thesis, Citeseer, 1995. + [54]S. E. Neville, J. T. Ormerod, M. Wand, et al. Mean field variational bayes for continuous sparse signal + shrinkage: pitfalls and remedies.Electronic Journal of Statistics, 8(1):1113–1151, 2014. + [55]O. Papaspiliopoulos, G. O. Roberts, and M. Sköld. A general framework for the parametrization of + hierarchical models.Statistical Science, pages 59–73, 2007. + [56]C. Peterson. A mean field theory learning algorithm for neural networks.Complex systems, 1:995–1019, + 1987. + [57]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary + convolutional neural networks. InEuropean Conference on Computer Vision, pages 525–542. Springer, + 2016. + [58]D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in + deep generative models. InProceedings of the 31th International Conference on Machine Learning, ICML + 2014, Beijing, China, 21-26 June 2014, pages 1278–1286, 2014. + [59]J. Rissanen. Modeling by shortest data description.Automatica, 14(5):465–471, 1978. + [60]J. Rissanen. Stochastic complexity and modeling.The annals of statistics, pages 1080–1100, 1986. + [61]S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural + networks.arXiv preprint arXiv:1607.00485, 2016. + [62]S. Shi and X. Chu. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. + arXiv preprint arXiv:1704.07724, 2017. + [63]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. + ICLR, 2015. + [64]M. Sites. Ieee standard for floating-point arithmetic. 2008. + [65]C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders. + arXiv preprint arXiv:1602.02282, 2016. + [66]S. Srinivas and R. V. Babu. Generalized dropout.arXiv preprint arXiv:1611.06791, 2016. + [67]N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to + prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, + 2014. + [68]V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer. Efficient processing of deep neural networks: A tutorial and + survey.arXiv preprint arXiv:1703.09039, 2017. + [69]R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society. + Series B (Methodological), pages 267–288, 1996. + [70]K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression.ICLR, 2017. + [71]G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using low-precision + and sparsity.arXiv preprint arXiv:1610.00324, 2016. + [72]C. S. Wallace. Classification by minimum-message-length inference. InInternational Conference on + Computing and Information, pages 72–81. Springer, 1990. + [73]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In + Advances In Neural Information Processing Systems, pages 2074–2082, 2016. + [74]T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efficient convolutional neural networks using + energy-aware pruning.CVPR, 2017. + [75]S. Zagoruyko and N. Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016. + [76]C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization.ICLR, 2017. + + + Appendix + + A. Detailed experimental setup + + We implemented our methods in Tensorflow [1] and optimized the variational parameters using + Adam [35] with the default hyperparameters. The means of the conditional Gaussian <> + + + Table 3: Floating point formats Bits per Exponent + + <
> + + were initialized with the scheme proposed at [26], whereas the log of the standard deviations were + initialized by sampling from N(9;1e4). The parameters of q(z) were initialized such that the + overall mean of zise 1 and the overall variance is very low (1e^8); this ensures that all of the + groups are active during the initial training iterations. + As for the standard deviation constraints; for the LeNet-300-100 architecture we constrained the + standard deviation of the first layer to be 0:2 whereas for the LeNet-5-Caffe we constrained + the standard deviation of the first layer to be 0:5. The remaining standard deviations were left + unconstrained. For the VGG network we constrained the standard deviations of the 64 and 128 + feature map layers to be 0:1, the standard deviations of the 256 feature map layers to be0:2 + and left the rest of the standard deviations unconstrained. We also found beneficial the incorporation + of “warm-up” [65], i.e we annealed the negative KL-divergence from the prior to the approximate + posterior with a linear schedule for the first 100 epochs. We initialized the means of the approximate + posterior by the weights and biases obtained from a VGG network trained with batch normalization + and dropout on CIFAR 10. For our method we disabled batch-normalization during training. + As for preprocessing the data; for MNIST the only preprocessing we did was to rescale the digits to + lie at the [-1,1] range and for CIFAR 10 we used the preprocessed dataset provided by [75]. + Furthermore, do note that by pruning a given filter at a particular convolutional layer we can also + prune the parameters corresponding to that feature map for the next layer. This similarly holds for + fully connected layers; if we drop a given input neuron then the weights corresponding to that node + from the previous layer can also be pruned. + + B. Standards for Floating-Point Arithmetic + + Floating points values eventually need to be represented in a binary basis in a computer. The most + common standard today is the IEEE 754-2008 convention [64]. It definesx-bit base-2 formats, + officially referred to as binaryx, withx2 f16;32;64;128g. The formats are also widely known as + half, single, double and quadruple precision floats, respectively and used in almost all programming + languages as a standard. The format considers 3 kinds of bits: one sign bit,wexponent bits andp + precision bits. + + <
> + + Figure 2: A symbolic representation of the binaryxformat [64]. + + + The Sign bit determines the sign of the number to be represented. The exponentEis anw-bit signed + integer, e.g. for single precisionw= 8and thusE2[127;128]. In practice, exponents range from + is smaller since the first and the last number are reserved for special numbers. The true significand or + mantissa includes t bits on the right of the binary point. There is an implicit leading bit with value + one. A values is consequently decomposed as follows + + <> (21) + + <> (22) + + In table 3, we summarize common and less common floating point formats. + + There is however the possibility to design a self defined format. There are 3 important quantities + when choosing the right specification: overflow, underflow and unit round off also known as machine + precision. Each one can be computed knowing the number of exponent and significant bits. in + our work for example we consider a format that uses significantly less exponent bits since network + parameters usually vary between [-10,10]. We set the unit round off equal to the precision and thus + can compute the significant bits necessary to represent a specific weight. + Beyond designing a tailored floating point format for deep learning, recent work also explored the + possibility of deep learning with mixed formats [43,23]. For example, imagine the activations having + high precision while weights can be low precision. + + C. Shrinkage properties of the normal-Jeffreys and horseshoe priors + + <
> + + Figure 3: Comparison of the behavior of the log-uniform / normal-Jeffreys (NJ) prior and the + horseshoe (HS) prior (wheres= 1). Both priors behave similarly at zero but the normal-Jeffreys has + an extremely heavy tail (thus making it non-normalizable). + + In this section we will provide some insights about the behavior of each of the priors we employ by + following the excellent analysis of [8]; we can perform a change of variables and express the scale + mixture distribution of eq.3 in the main paper in terms of a shrinkage coefficient, + + <> (23) + + It is easy to observe that eq. 23 corresponds to a continuous relaxation of the spike-and-slab prior: + when <<= 0>> we have that <>, i.e. no shrinkage/regularization forw, when + <<= 1>> we have that <>, i.e.wis exactly zero, and when <<=1>> we have that <>. Now by examining the implied prior on the shrinkage coefficient for both + the log-uniform and the horseshoe priors we can better study their behavior. As it is explained at + the half-Cauchy prior onzcorresponds to a beta prior on the shrinkage coefficient, <>, + whereas the normal-Jeffreys / log-uniform prior onzcorresponds <> with <>. + The densities of both of these distributions can be seen at Figure 3b. As we can observe, the log- + uniform prior posits a distribution that concentrates almost all of its mass at either0or1, + essentially either pruning the parameter or keeping it close to the maximum likelihood estimate due + <>. In contrast the horseshoe prior maintains enough probability mass for + the in-between values of and thus can, potentially, offer better regularization and generalization. + + D. Negative KL-divergences for log-normal approximating posteriors + + Le <> be a log-normal approximating posterior. Here we will derive the negative + KL-divergences toq(z)from inverse gamma, gamma and half-normal distributions. + Letp(z)be an inverse gamma distribution, i.e. <>. The negative KL-divergence can + be expressed as follows: + + <> (24) + + + The second term is the entropy of the log-normal distribution which has the following form: + + <> (25) + + The first term is the negative cross-entropy of the log-normal approximate posterior from the inverse- + Gamma prior: + <> (26) + + <> (27) + + Since the natural logarithm of a log-normal distribution <> follows a normal distribution + <> we have that <>. Furthermore we have that <> then <>, therefore + <>. Putting everything together we have that: + + <> (28) + + Therefore the negative KL-divergence is: + + <> (29) + + Now let p(z) be a Gamma prior, i.e. <>. We have that the negative cross-entropy + changes to: + <> (30) + + <> (31) + + <> (32)2 + + Therefore the negative KL-divergence is: + + <> (33) + + Now, by employing the aforementioned we can express the negative KL-divergence from + <> to <> as follows: + + <> + + with the KL-divergence for the weight distribution <> given by eq.8 in the main paper. + + E. Visualizations + + <
> + + Figure 4: Distribution of the thresholds for the Sparse Variational Dropout 4a, Bayesian Compression + with group normal-Jeffreys (BC-GNJ) 4b and group Horseshoe (BC-GHS) 4c priors for the three + layer LeNet-300-100 architecture. It is easily observed that there are usually two well separable + groups with BC-GNJ and BC-GHS, thus making the choice for the threshold easy. Smaller values + indicate signal whereas larger values indicate noise (i.e. useless groups). + + <
> + + Figure 5: Distribution of the bit precisions for the Sparse Variational Dropout 5a, Bayesian Com- + pression with group normal-Jeffreys (BC-GNJ) 5b and group Horseshoe (BC-GHS) 5c priors for the + three layer LeNet-300-100 architecture. All of the methods usually require far fewer than 32bits for + the weights. + + F. Algorithms for the feedforward pass + + Algorithms 1, 2, 3, 4 describe the forward pass using local reparametrizations for fully connected and + convolutional layers with the approximate posteriors for the Bayesian Compression (BC) with group + normal-Jeffreys (BC-GNJ) and group Horseshoe (BC-GHS) priors employed at the experiments. For + the fully connected layers we coupled the scales for each input neuron whereas for the convolutional + we couple the scales for each output feature map.Mw ;w are the means and variances of each layer, + His a minibatch of activations of sizeK. For the first layer we have thatH=XwhereXis the + minibatch of inputs. For the convolutional layersNf are the number of convolutional filters,is the + convolution operator and we assume the [batch, height, width, feature maps] convention. + + Algorithm 1 Fully connected BC-GNJ layer h. + + <> + + Algorithm 2Convolutional BC-GNJ layerh. + + <> + + Algorithm 3 Fully connected BC-GHS layerh. + + <> + + Algorithm 4Convolutional BC-GHS layerh. + + <> + +<> <> <> + + +<> <> <> +Channel Pruning for Accelerating Very Deep Neural Networks +Yihui He* Xiangyu Zhang Jian Sun +Xifian Jiaotong University Megvii Inc. Megvii Inc. +Xifian, 710049, China Beijing, 100190, China Beijing, 100190, China +heyihui@stu.xjtu.edu.cn zhangxiangyu@megvii.com sunjian@megvii.com + +Abstract +In this paper, we introduce a new channel pruning method to accelerate very deep convolutional neural net.works. Given a trained CNN model, we propose an it.erative two-step algorithm to effectively prune each layer, by a LASSO regression based channel selection and least square reconstruction. We further generalize this algorithm to multi-layer and multi-branch cases. Our method re.duces the accumulated error and enhance the compatibility with various architectures. Our pruned VGG-16 achieves the state-of-the-art results by 5. speed-up along with only 0.3% increase of error. More importantly, our method is able to accelerate modern networks like ResNet, exception and suffers only 1.4%, 1.0% accuracy loss under 2. speed.up respectively, which is significant. +1. Introduction +Recent CNN acceleration works fall into three categories: optimized implementation (e.g., FFT [47]), quantization (e.g., BinaryNet [8]), and structured simplification that convert a CNN into compact one [22]. This work focuses on the last one. +Structured simplification mainly involves: tensor factorization [22], sparse connection [17], and channel pruning [48]. Tensor factorization factorizes a convolutional layer into several efficient ones (Fig. 1(c)). However, feature map width (number of channels) could not be reduced, which makes it difficult to decompose 1 . 1 convolutional layer favored by modern networks (e.g., GoogleNet [45], ResNet [18], Xception [7]). This type of method also intro.duces extra computation overhead. Sparse connection deactivates connections between neurons or channels (Fig. 1(b)). Though it is able to achieves high theoretical speed-up ratio, the sparse convolutional layers have an fiirregularfi shape which is not implementation friendly. In contrast, channel pruning directly reduces feature map width, which shrinks + +<
> + +Figure 1. Structured simplification methods that accelerate CNNs: +(a) a network with 3 conv layers. (b) sparse connection deactivates some connections between channels. (c) tensor factorization factorizes a convolutional layer into several pieces. (d) channel pruning reduces number of channels in each layer (focus of this paper). +a network into thinner one, as shown in Fig. 1(d). It is efficient on both CPU and GPU because no special implementation is required. +Pruning channels is simple but challenging because re.moving channels in one layer might dramatically change the input of the following layer. Recently, training-based channel pruning works [1, 48] have focused on imposing sparse constrain on weights during training, which could adaptively determine hyper-parameters. However, training from scratch is very costly and results for very deep CNNs on ImageNet have been rarely reported. Inference-time at.tempts [31, 3] have focused on analysis of the importance of individual weight. The reported speed-up ratio is very limited. +In this paper, we propose a new inference-time approach for channel pruning, utilizing redundancy inter channels. Inspired by tensor factorization improvement by feature maps reconstruction [52], instead of analyzing filter weights [22, 31], we fully exploits redundancy within feature maps. Specifically, given a trained CNN model, pruning each layer is achieved by minimizing reconstruction error on its output feature maps, as showed in Fig. 2. We solve this mini. + +<
> + +Figure 2. Channel pruning for accelerating a convolutional layer. We aim to reduce the width of feature map B, while minimizing the reconstruction error on feature map C. Our optimization algorithm (Sec. 3.1) performs within the dotted box, which does not involve nonlinearity. This figure illustrates the situation that two channels are pruned for feature map B. Thus corresponding channels of filters W can be removed. Furthermore, even though not directly optimized by our algorithm, the corresponding filters in the previous layer can also be removed (marked by dotted filters). c, n: number of channels for feature maps B and C, kh . kw : kernel size. +minimization problem by two alternative steps: channels selection and feature map reconstruction. In one step, we figure out the most representative channels, and prune redundant ones, based on LASSO regression. In the other step, we reconstruct the outputs with remaining channels with linear least squares. We alternatively take two steps. Further, we approximate the network layer-by-layer, with accumulated error accounted. We also discuss methodologies to prune multi-branch networks (e.g., ResNet [18], exception [7]). +For VGG-16, we achieve 4. acceleration, with only 1.0% increase of top-5 error. Combined with tensor factorization, we reach 5. acceleration but merely suffer 0.3% increase of error, which outperforms previous state-of-the.arts. We further speed up ResNet-50 and Xception-50 by 2. with only 1.4%, 1.0% accuracy loss respectively. + +2. Related Work + +There has been a significant amount of work on accelerating CNNs. Many of them fall into three categories: optimized implementation [4], quantization [40], and structured simplification [22]. +Optimized implementation based methods [35, 47, 27, 4] accelerate convolution, with special convolution algorithms like FFT [47]. Quantization [8, 40] reduces floating point computational complexity. +Sparse connection eliminates connections between neurons [17, 32, 29, 15, 14]. [51] prunes connections based on weights magnitude. [16] could accelerate fully connected layers up to 50.. However, in practice, the actual speed-up maybe very related to implementation. +Tensor factorization [22, 28, 13, 24] decompose weights into several pieces. [50, 10, 12] accelerate fully connected layers with truncated SVD. [52] factorize a layer into 3 . 3 and 1 . 1 combination, driven by feature map redundancy. +Channel pruning removes redundant channels on feature maps. There are several training-based approaches. [1, 48] regularize networks to improve accuracy. Channel-wise SSL [48] reaches high compression ratio for first few conv layers of LeNet [30] and AlexNet [26]. However, training-based approaches are more costly, and the effectiveness for very deep networks on large datasets is rarely exploited. +Inference-time channel pruning is challenging, as re.ported by previous works [2, 39]. Some works [44, 34, 19] focus on model size compression, which mainly operate the fully connected layers. Data-free approaches [31, 3] results for speed-up ratio (e.g., 5.) have not been reported, and requires long retraining procedure. [3] select channels via over 100 random trials, however it need long time to eval.ate each trial on a deep network, which makes it infeasible to work on very deep models and large datasets. [31] is even worse than naive solution from our observation sometimes (Sec. 4.1.1). + +3. Approach + +In this section, we first propose a channel pruning algorithm for a single layer, then generalize this approach to multiple layers or the whole model. Furthermore, we dis.cuss variants of our approach for multi-branch networks. + +3.1. Formulation + +Fig. 2 illustrates our channel pruning algorithm for a sin.gle convolutional layer. We aim to reduce the width of feature map B, while maintaining outputs in feature map +C. Once channels are pruned, we can remove correspond.ing channels of the filters that take these channels as in.put. Also, filters that produce these channels can also be removed. It is clear that channel pruning involves two key points. The first is channel selection, since we need to select most representative channels to maintain as much information. The second is reconstruction. We need to reconstruct the following feature maps using the selected channels. +Motivated by this, we propose an iterative two-step algorithm. In one step, we aim to select most representative channels. Since an exhaustive search is infeasible even for tiny networks, we come up with a LASSO regression based method to figure out representative channels and prune redundant ones. In the other step, we reconstruct the outputs with remaining channels with linear least squares. We alter.natively take two steps. +Formally, to prune a feature map with c channels, we consider applying n.c.kh .kw convolutional filters W on <> input volumes X sampled from this feature map, which produces N . n output matrix Y. Here, N is the number of samples, n is the number of output channels, and kh,kw are the kernel size. For simple representation, bias term is not included in our formulation. To prune the +.. +input channels from c to desired <>, while minimizing reconstruction error, we formulate our problem as follow: + +<> (1) + +F is Frobenius norm. <> matrix sliced from ith channel of input volumes X_i, i =1, ..., c. W_i is n . filter weights sliced from ith channel of W. is coefficient vector of length c for channel selection, and .i is ith entry of . Notice that, if .i =0, X_i will be no longer useful, which could be safely pruned from feature map. W_i could also be removed. Optimization Solving this minimization problem in Eqn. 1 is NP-hard. Therefore, we relax the l_0 to l_1 regularization: + +<> (2) + +. is a penalty coefficient. By increasing l, there will be more zero terms in and one can get higher speed-up ratio. We also add a constrain .i WiF =1 to this formulation, which avoids trivial solution. +Now we solve this problem in two folds. First, we fix W, solve for channel selection. Second, we fix , solve W to reconstruct error. +(i) The subproblem of . In this case, W is fixed. We solve for channel selection. This problem can be solved by LASSO regression [46, 5], which is widely used for model selection. + +<> (3) +. +Here Zi =XiWi (size N .n). We will ignore ith channels if .i =0. +(ii) The subproblem of W. In this case, is fixed. We utilize the selected channels to minimize reconstruction error. We can find optimized solution by least squares: + +<>. (4) + +Here <> (size N.). W is n reshaped W, <>. After obtained result W, it is reshaped back to W. Then we assign <>. Constrain <> satisfies. +We alternatively optimize (i) and (ii). In the beginning, W is initialized from the trained model, <>, namely no penalty, and <>. We gradually increase <> For each +change of <>, we iterate these two steps until k is stable. + +After <> satisfies, we obtain the final solution W from <> In practice, we found that the two steps iteration is time consuming. So we apply (i) multiple times, + +<> + +until <> satisfies. Then apply (ii) just once, to obtain + +<> + +the final result. From our observation, this result is comparable with two steps iterations. Therefore, in the following experiments, we adopt this approach for efficiency. +Discussion: Some recent works [48, 1, 17] (though train. +ing based) also introduce .1-norm or LASSO. However, we must emphasis that we use different formulations. Many of them introduced sparsity regularization into training loss, instead of explicitly solving LASSO. Other work [1] solved LASSO, while feature maps or data were not considered during optimization. Because of these differences, our ap.proach could be applied at inference time. + +3.2. Whole Model Pruning +Inspired by [52], we apply our approach layer by layer sequentially. For each layer, we obtain input volumes from the current input feature map, and output volumes from the output feature map of the un-pruned model. This could be formalized as: + +<> (5) + +Different from Eqn. 1, Y is replaced by Y . , which is from feature map of the original model. Therefore, the accumulated error could be accounted during sequential pruning. + +3.3. Pruning Multi.Branch Networks +The whole model pruning discussed above is enough for single-branch networks like LeNet [30], AlexNet [26] and VGG Nets [43]. However, it is insufficient for multi-branch networks like GoogLeNet [45] and ResNet [18]. We mainly focus on pruning the widely used residual structure (e.g., ResNet [18], Xception [7]). Given a residual block shown in Fig. 3 (left), the input bifurcates into shortcut and residual branch. On the residual branch, there are several convolutional layers (e.g., 3 convolutional layers which have spatial size of 1 . 1, 3 . 3, 1 . 1, Fig. 3, left). Other layers except the first and last layer can be pruned as is described previously. For the first layer, the challenge is that the large input feature map width (for ResNet, 4 times of its output) can it be easily pruned, since it is shared with shortcut. For the last layer, accumulated error from the shortcut is hard to be recovered, since there is no parameter on the shortcut. To address these challenges, we propose several variants of our approach as follows. + +<
> + +Figure 3. Illustration of multi-branch enhancement for residual block. Left: original residual block. Right: pruned residual block with enhancement, cx denotes the feature map width. Input channels of the first convolutional layer are sampled, so that the large input feature map width could be reduced. As for the last layer, rather than approximate Y2 , we try to approximate <> directly (Sec. 3.3 Last layer of residual branch). +Last layer of residual branch: Shown in Fig. 3, the output layer of a residual block consists of two inputs: feature map Y1 and Y2 from the shortcut and residual branch. We aim to recover Y1 +Y2 for this block. Here, Y1, Y2 are the original feature maps before pruning. Y2 could be approximated as in Eqn. 1. However, shortcut branch is parameter-free, then Y1 could not be recovered directly. To compensate this error, the optimization goal of the last layer is changed from Y2 to Y1 .Y . +Y2, which does not change + +<> + +our optimization. Here, Y . is the current feature map after + +<> + +previous layers pruned. When pruning, volumes should be sampled correspondingly from these two branches. +First layer of residual branch: Illustrated in Fig. 3(left), the input feature map of the residual block could not be pruned, since it is also shared with the short.cut branch. In this condition, we could perform feature map sampling before the first convolution to save computation. We still apply our algorithm as Eqn. 1. Differently, we sample the selected channels on the shared feature maps to construct a new input for the later convolution, shown in Fig. 3(right). Computational cost for this operation could be ignored. More importantly, after introducing feature map sampling, the convolution is still irregular. +Filter-wise pruning is another option for the first con.volution on the residual branch. Since the input channels of parameter-free shortcut branch could not be pruned, we apply our Eqn. 1 to each filter independently (each fil.ter chooses its own representative input channels). Under single layer acceleration, filter-wise pruning is more accurate than our original one. From our experiments, it im.proves 0.5% top-5 accuracy for 2. ResNet-50 (applied on the first layer of each residual branch) without fine-tuning. However, after fine-tuning, there is no noticeable improvement. In addition, it outputs irregular convolutional layers, which need special library implementation support. We do not adopt it in the following experiments. + +4. Experiment + +We evaluation our approach for the popular VGG Nets [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR.10 [25] and PASCAL VOC 2007 [11]. +For Batch Normalization [21], we first merge it into convolutional weights, which do not affect the outputs of the networks. So that each convolutional layer is followed by ReLU [36]. We use Caffe [23] for deep network evaluation, and scikit-learn [38] for solvers implementation. For channel pruning, we found that it is enough to extract 5000 images, and 10 samples per image. On ImageNet, we evaluate the top-5 accuracy with single view. Images are re.sized such that the shorter side is 256. The testing is on center crop of 224 . 224 pixels. We could gain more per.formance with fine-tuning. We use a batch size of 128 and +.5 +learning rate 1e^-4. We fine-tune our pruned models for 10 epochs. The augmentation for fine-tuning is random crop of 224 . 224 and mirror. + +4.1. Experiments with VGG.16 + +VGG-16 [43] is a 16 layers single path convolutional neural network, with 13 convolutional layers. It is widely used in recognition, detection and segmentation, etc. Single view top-5 accuracy for VGG-16 is 89.9%1. + +4.1.1 Single Layer Pruning + +In this subsection, we evaluate single layer acceleration performance using our algorithm in Sec. 3.1. For better under.standing, we compare our algorithm with two naive chan.nel selection strategies. first k selects the first k channels. max response selects channels based on corresponding filters that have high absolute weights sum [31]. For fair com.parison, we obtain the feature map indexes selected by each of them, then perform reconstruction (Sec. 3.1 (ii)). We hope that this could demonstrate the importance of channel selection. Performance is measured by increase of error af.ter a certain layer is pruned without fine-tuning, shown in Fig. 4. +As expected, error increases as speed-up ratio increases. Our approach is consistently better than other approaches in different convolutional layers under different speed-up ra.tio. Unexpectedly, sometimes max response is even worse than first k. We argue that max response ignores correlations between different filters. Filters with large absolute weight may have strong correlation. Thus selection based on filter weights is less meaningful. Correlation on feature maps is worth exploiting. We can find that channel selection http://www.vlfeat.org/matconvnet/pretrained/ + +<
> + +Figure 4. Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by increase of error. To verify the importance of channel selection referred in Sec. 3.1, we considered two naive baselines. first k selects the first k feature maps. max response selects channels based on absolute sum of corresponding weights filter [31]. Our approach is consistently better (smaller is better). + +<
> + +Table 1. Accelerating the VGG-16 model [43] using a speedup ratio of 2., 4., or 5. (smaller is better). +affects reconstruction error a lot. Therefore, it is important for channel pruning. +Also notice that channel pruning gradually becomes hard, from shallower to deeper layers. It indicates that shallower layers have much more redundancy, which is consistent with [52]. We could prune more aggressively on shallower layers in whole model acceleration. + + +4.1.2 Whole Model Pruning +Shown in Table 1, whole model acceleration results under 2., 4., 5. are demonstrated. We adopt whole model pruning proposed in Sec. 3.2. Guided by single layer experiments above, we pruning more aggressive for shallower layers. Remaining channels ratios for shallow lay.ers (conv 1_x to conv 3_x) and deep layers (conv4_x) is 1:1.5. conv 5_x are not pruned, since they only con.tribute 9% computation in total and are not redundant. +After fine-tuning, we could reach 2. speed-up without losing accuracy. Under 4., we only suffers 1.0% drops. Consistent with single layer analysis, our approach outperforms previous channel pruning approach (Li et al. [31]) by large margin. This is because we fully exploits channel redundancy within feature maps. Compared with tensor factorization algorithms, our approach is better than Jaderberg et al. [22], without fine-tuning. Though worse than Asym. [52], our combined model outperforms its combined Asym. 3D (Table 2). This may indicate that channel pruning is more challenging than tensor factorization, since removing channels in one layer might dramatically change the input of the following layer. However, channel pruning keeps the original model architecture, do not introduce additional layers, and the absolute speed-up ratio on GPU is much higher (Table 3). +Since our approach exploits a new cardinality, we further combine our channel pruning with spatial factorization [22] and channel factorization [52]. Demonstrated in Table 2, + +<
> + +Table 2. Performance of combined methods on the VGG-16 model + +[43] using a speed-up ratio of 4. or 5.. Our 3C solution outperforms previous approaches (smaller is better). +our 3 cardinalities acceleration (spatial, channel factorization, and channel pruning, denoted by 3C) outperforms previous state-of-the-arts. Asym. 3D [52] (spatial and chan.nel factorization), factorizes a convolutional layer to three parts: <>. +We apply spatial factorization, channel factorization, and our channel pruning together sequentially layer-by-layer. We fine-tune the accelerated models for 20 epochs, since they are 3 times deeper than the original ones. After fine-tuning, our 4. model suffers no degradation. Clearly, a combination of different acceleration techniques is better than any single one. This indicates that a model is redundant in each cardinality. + + +4.1.3 Comparisons of Absolute Performance +We further evaluate absolute performance of acceleration on GPU. Results in Table 3 are obtained under Caffe [23], CUDA 8 [37] and cuDNN5 [6], with a mini-batch of 32 on a GPU (GeForce GTX TITAN X). Results are averaged from 50 runs. Tensor factorization approaches decompose weights into too many pieces, which heavily increase over.head. They could not gain much absolute speed-up. Though our approach also encountered performance decadence, it generalizes better on GPU than other approaches. Our re.sults for tensor factorization differ from previous research [52, 22], maybe because current library and hardware prefer single large convolution instead of several small ones. + +4.1.4 Comparisons with Training from Scratch +Though training a compact model from scratch is time-consuming (usually 120 epochs), it worths comparing our approach and from scratch counterparts. To be fair, we evaluated both from scratch counterpart, and normal setting net.work that has the same computational complexity and same architecture. +Shown in Table 4, we observed that it is difficult for from scratch counterparts to reach competitive accuracy. our model outperforms from scratch one. Our approach successfully picks out informative channels and constructs highly compact models. We can safely draw the conclusion that the same model is difficult to be obtained from scratch. This coincides with architecture design researches [20, 1] that the model could be easier to train if there are more channels in shallower layers. However, channel prun.ing favors shallower layers. +For from scratch (uniformed), the filters in each layers is reduced by half (eg. reduce conv1_1 from 64 to 32). We can observe that normal setting networks of the same complexity couldn't reach same accuracy either. This consolidates our idea that there is much redundancy in networks while training. However, redundancy can be opt out at inference-time. This maybe an advantage of inference-time acceleration approaches over training-based approaches. +Notice that there is a 0.6% gap between the from scratch model and uniformed one, which indicates that there is room for model exploration. Adopting our approach is much faster than training a model from scratch, even for a thin.ner one. Further researches could alleviate our approach to do thin model exploring. + +4.1.5 Acceleration for Detection +VGG-16 is popular among object detection tasks [42, 41, 33]. We evaluate transfer learning ability of our 2./4. pruned VGG-16, for Faster R-CNN [42] object detections. PASCAL VOC 2007 object detection benchmark [11] contains 5k trainable images and 5k test images. The performance is evaluated by mean Average Precision (mAP). In our experiments, we first perform channel pruning for VGG-16 on the ImageNet. Then we use the pruned model as the pre-trained model for Faster R-CNN. +The actual running time of Faster R-CNN is 220ms / im.age. The convolutional layers contributes about 64%. We got actual time of 94ms for 4. acceleration. From Table 5, we observe 0.4% mAP drops of our 2. model, which is not harmful for practice consideration. + +4.2. Experiments with Residual Architecture Nets +For Multi-path networks [45, 18, 7], we further explore the popular ResNet [18] and latest Xception [7], on Ima.geNet and CIFAR-10. Pruning residual architecture nets is more challenging. These networks are designed for both efficiency and high accuracy. Tensor factorization algorithms [52, 22] have difficult to accelerate these model. Spatially, 1 . 1 convolution is favored, which could hardly be factorized. + +4.2.1 ResNet Pruning +ResNet complexity uniformly drops on each residual block. Guided by single layer experiments (Sec. 4.1.1), we still prefer reducing shallower layers heavier than deeper ones. +Following similar setting as Filter pruning [31], we keep 70% channels for sensitive residual blocks (res5 and blocks close to the position where spatial size + +<
> + +Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is better). + +<
> + +Table 4. Comparisons with training from scratch, under 4. acceleration. Our fine-tuned model outperforms scratch trained counterparts (smaller is better). + +<
> + +Table 5.Acceleration for Faster R-CNN detection. + +<
> + +Table 6. 2. acceleration for ResNet-50 on ImageNet, the base.line network is top-5 accuracy is 92.2% (one view). We improve performance with multi-branch enhancement (Sec. 3.3, smaller is better). +change, e.g. res3a,res3d). As for other blocks, we keep 30% channels. With multi-branch enhancement, we prune branch 2a more aggressively within each residual block. The remaining channels ratios for branch 2a,branch 2b,branch 2c is 2:4:3 (e.g., Given 30%, we keep 40%, 80%, 60% respectively). +We evaluate performance of multi-branch variants of our approach (Sec. 3.3). From Table 6, we improve 4.0% with our multi-branch enhancement. This is because we accounted the accumulated error from shortcut connection which could broadcast to every layer after it. And the large input feature map width at the entry of each residual block is well reduced by our feature map sampling. + +<
> + +Table 7. Comparisons for Xception-50, under 2. acceleration ra.tio. The baseline network is top-5 accuracy is 92.8%. Our approach outperforms previous approaches. Most structured simplification methods are not effective on Xception architecture (smaller is better). + + +4.2.2 Xception Pruning +Since computational complexity becomes important in model design, separable convolution has been payed much attention [49, 7]. Xception [7] is already spatially optimized and tensor factorization on 1 . 1 convolutional layer is destructive. Thanks to our approach, it could still be accelerated with graceful degradation. For the ease of comparison, we adopt Xception convolution on ResNet-50, denoted by Xception-50. Based on ResNet-50, we swap all convolutional layers with spatial conv blocks. To keep the same computational complexity, we increase the input channels of all branch2b layers by 2.. The baseline Xception.50 has a top-5 accuracy of 92.8% and complexity of 4450 MFLOPs. +We apply multi-branch variants of our approach as de.scribed in Sec. 3.3, and adopt the same pruning ratio setting as ResNet in previous section. Maybe because of Xcep.tion block is unstable, Batch Normalization layers must be maintained during pruning. Otherwise it becomes nontrivial to fine-tune the pruned model. +Shown in Table 7, after fine-tuning, we only suffer 1.0% increase of error under 2.. Filter pruning [31] could also apply on Xception, though it is designed for small speed.up ratio. Without fine-tuning, top-5 error is 100%. After training 20 epochs which is like training from scratch, in.creased error reach 4.3%. Our results for Xception-50 are not as graceful as results for VGG-16, since modern net.works tend to have less redundancy by design. + +<
> + +Table 8. 2. speed-up comparisons for ResNet-56 on CIFAR-10, the baseline accuracy is 92.8% (one view). We outperforms previous approaches and scratch trained counterpart (smaller is better). + + +4.2.3 Experiments on CIFAR-10 +Even though our approach is designed for large datasets, it could generalize well on small datasets. We perform experiments on CIFAR-10 dataset [25], which is favored by many acceleration researches. It consists of 50k images for training and 10k for testing in 10 classes. +We reproduce ResNet-56, which has accuracy of 92.8% (Serve as a reference, the official ResNet-56 [18] has ac.curacy of 93.0%). For 2. acceleration, we follow similar setting as Sec. 4.2.1 (keep the final stage unchanged, where the spatial size is 8 . 8). Shown in Table 8, our approach is competitive with scratch trained one, without fine-tuning, under 2. speed-up. After fine-tuning, our result is significantly better than Filter pruning [31] and scratch trained one. + +5. Conclusion +To conclude, current deep CNNs are accurate with high inference costs. In this paper, we have presented an inference-time channel pruning method for very deep net.works. The reduced CNNs are inference efficient networks while maintaining accuracy, and only require off-the-shelf libraries. Compelling speed-ups and accuracy are demonstrated for both VGG Net and ResNet-like networks on Im.ageNet, CIFAR-10 and PASCAL VOC. +In the future, we plan to involve our approaches into training time, instead of inference time only, which may also accelerate training procedure. + +References +[1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pages 2262fi2270, 2016. 1, 2, 3, 6 +[2] S. Anwar, K. Hwang, and W. Sung. Structured prun.ing of deep convolutional neural networks. arXiv preprint arXiv:1512.08571, 2015. 2 +[3] S. Anwar and W. Sung. Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv:1610.09639, 2016. 1, 2 +[4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn: Lookup-based convolutional neural network. arXiv preprint arXiv:1611.06473, 2016. 2 +[5] L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37(4):373fi384, 1995. 3 +[6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, +B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014. 6 +[7] F. Chollet. Xception: Deep learning with depthwise separa.ble convolutions. arXiv preprint arXiv:1610.02357, 2016. 1, 2, 3, 4, 6, 7 +[8] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. 1, 2 +[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248fi255. IEEE, 2009. 4 +[10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer.gus. Exploiting linear structure within convolutional net.works for efficient evaluation. In Advances in Neural In.formation Processing Systems, pages 1269fi1277, 2014. 2 +[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal.network.org/challenges/VOC/voc2007/workshop/index.html. 4, 6 +[12] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Inter.national Conference on Computer Vision, pages 1440fi1448, 2015. 2 +[13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress.ing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014. 2 +[14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Process.ing Systems, pages 1379fi1387, 2016. 2 +[15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: efficient inference engine on com.pressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243fi254. IEEE Press, 2016. 2 +[16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com.pressing deep neural network with pruning, trained quantiza.tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. +2 +[17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135fi1143, 2015. 1, 2, 3 +[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn.ing for image recognition. arXiv preprint arXiv:1512.03385, 2015. 1,2,3,4,6,8 +[19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim.ming: A data-driven neuron pruning approach towards effi.cient deep architectures. arXiv preprint arXiv:1607.03250, 2016. 2 + +[20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, +A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 6 +[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 4 +[22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 1, 2, 5, 6, 7 +[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir.shick, S. Guadarrama, and T. Darrell. Caffe: Convolu.tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 4, 6 +[24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015. 2 +[25] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. 4, 8 +[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097fi1105, 2012. 2, 3 +[27] A. Lavin. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015. 2 +[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and +V. Lempitsky. Speeding-up convolutional neural net.works using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2 +[29] V. Lebedev and V. Lempitsky. Fast convnets using group-wise brain damage. arXiv preprint arXiv:1506.02515, 2015. +2 +[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed.ings of the IEEE, 86(11):2278fi2324, 1998. 2, 3 +[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710,2016. 1,2,4,5,6,7,8 +[32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni.tion, pages 806fi814, 2015. 2 +[33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, +C. Fu, and A. C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. 6 +[34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint arXiv:1511.05077, 2015. 2 +[35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013. 2 +[36] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807fi814, 2010. 4 +[37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40fi53, 2008. 6 +[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, +B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, +V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, +M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma.chine learning in Python. Journal of Machine Learning Re.search, 12:2825fi2830, 2011. 4 +[39] A. Polyak and L. Wolf. Channel-level acceleration of deep face representations. IEEE Access, 3:2163fi2175, 2015. 2 +[40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor.net: Imagenet classification using binary convolutional neu.ral networks. In European Conference on Computer Vision, pages 525fi542. Springer, 2016. 2 +[41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015. 6 +[42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal net.works. CoRR, abs/1506.01497, 2015. 6 +[43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3, 4, 5, 6 +[44] S. Srinivas and R. V. Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015. 2 +[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, +D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1fi9, 2015. 1, 3, 6 +[46] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267fi288, 1996. 3 +[47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi.antino, and Y. LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014. 1, 2 +[48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances In Neural Information Processing Systems, pages 2074fi2082, 2016. 1, 2, 3 +[49] S. Xie, R. Girshick, P. Dollfiar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016. 7 +[50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In INTERSPEECH, pages 2365fi2369, 2013. 2 +[51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv preprint arXiv:1611.05128, 2016. 2 +[52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelli.gence, 38(10):1943fi1955, 2016. 1, 2, 3, 5, 6, 7 +<> <> <> + + +<> <> <> + Convex Neural Networks + + Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte + Dept. IRO, Universite de Montr´ eal´ + P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, Qc, Canada + fbengioy,lerouxni,vincentp,delallea,marcotteg@iro.umontreal.ca + + Abstract + Convexity has recently received a lot of attention in the machine learning + community, and the lack of convexity has been seen as a major disad- + vantage of many learning algorithms, such as multi-layer artificial neural + networks. We show that training multi-layer neural networks in which the + number of hidden units is learned can be viewed as a convex optimization + problem. This problem involves an infinite number of variables, but can be + solved by incrementally inserting a hidden unit at a time, each time finding + a linear classifier that minimizes a weighted sum of errors. + + 1 Introduction + The objective of this paper is not to present yet another learning algorithm, but rather to point + to a previously unnoticed relation between multi-layer neural networks (NNs),Boosting (Fre- + und and Schapire, 1997) and convex optimization. Its main contributions concern the mathe- + matical analysis of an algorithm that is similar to previously proposed incremental NNs, with + L1 regularization on the output weights. This analysis helps to understand the underlying + convex optimization problem that one is trying to solve. + This paper was motivated by the unproven conjecture (based on anecdotal experience) that + when the number of hidden units is “large”, the resulting average error is rather insensitive to + the random initialization of the NN parameters. One way to justify this assertion is that to re- + ally stay stuck in a local minimum, one must have second derivatives positive simultaneously + in all directions. When the number of hidden units is large, it seems implausible for none of + them to offer a descent direction. Although this paper does not prove or disprove the above + conjecture, in trying to do so we found an interesting characterization of the optimization + problem for NNs as a convex program if the output loss function is convex in the NN out- + put and if the output layer weights are regularized by a convex penalty. More specifically, + if the regularization is the L1 norm of the output layer weights, then we show that a “rea- + sonable” solution exists, involving a finite number of hidden units (no more than the number + of examples, and in practice typically much less). We present a theoretical algorithm that + is reminiscent of Column Generation (Chvatal, 1983), in which hidden neurons are inserted ´ + one at a time. Each insertion requires solving a weighted classification problem, very much + like in Boosting (Freund and Schapire, 1997) and in particular Gradient Boosting (Mason + et al., 2000; Friedman, 2001). + Neural Networks, Gradient Boosting, and Column Generation + Denote x~2Rd+1 the extension of vector x2Rd with one element with value 1. What + we call “Neural Network” (NN) here is a predictor for supervised learning of the form + <> where x is an input vector, <> is obtained from a linear dis- + criminant function hi <> with e.g. <>, or <> or + <>. A learning algorithm must specify how to select m, the <> + i ’s and the vi ’s. + + The classical solution (Rumelhart, Hinton and Williams, 1986) involves (a) selecting a loss + function Q(^y;y)that specifies how to penalize for mismatches between y^(x)and the ob- + served y’s (target output or target class), (b) optionally selecting a regularization penalty that + favors “small” parameters, and (c) choosing a method to approximately minimize the sum of + the losses on the training data D=f(x1 ;y 1 );:::;(xn ;y n )gplus the regularization penalty. + Note that in this formulation, an output non-linearity can still be used, by inserting it in the + loss function Q. Examples of such loss functions are the quadratic loss jjy^yjj 2 , the hinge + loss <> (used in SVMs), the cross-entropy loss <> + (used in logistic regression), and the exponential loss <> (used in Boosting). + Gradient Boosting has been introduced in (Friedman, 2001) and (Mason et al., 2000) as a + non-parametric greedy-stagewise supervised learning algorithm in which one adds a function + at a time to the current solution <>, in a steepest-descent fashion, to form an additive model + as above but with the functions hi typically taken in other kinds of sets of functions, such as + those obtained with decision trees. In a stagewise approach, when the (m+1)-th basis <> is added, + only <> is optimized (by a line search), like in matching pursuit algorithms. Such + a greedy-stagewise approach is also at the basis of Boosting algorithms (Freund and Schapire, + 1997), which is usually applied using decision trees as bases and Qthe exponential loss. + It may be difficult to minimize exactly for wm+1 and hm+1 when the previous bases and + weights are fixed, so (Friedman, 2001) proposes to “follow the gradient” in function space, + i.e., look for a base learner hm+1 that is best correlated with the gradient of the average + loss on the <> (that would be the residue <> in the case of the square loss). The + algorithm analyzed here also involves maximizing the correlation between Q0 (the derivative + of Q with respect to its first argument, evaluated on the training predictions) and the next + basis hm+1 . However, we follow a “stepwise”, less greedy, approach, in which all the output + weights are optimized at each step, in order to obtain convergence guarantees. + Our approach adapts the Column Generation principle (Chvatal, 1983), a decomposition´ + technique initially proposed for solving linear programs with many variables and few con- + straints. In this framework, active variables, or “columns”, are only generated as they are + required to decrease the objective. In several implementations, the column-generation sub- + problem is frequently a combinatorial problem for which efficient algorithms are available. + In our case, the subproblem corresponds to determining an “optimal” linear classifier. + + 2 Core Ideas + Informally, consider the set Hof all possible hidden unit functions (i.e., of all possible hidden + unit weight vectors vi ). Imagine a NN that has all the elements in this set as hidden units. We + might want to impose precision limitations on those weights to obtain either a countable or + even a finite set. For such a NN, we only need to learn the output weights. If we end up with + a finite number of non-zero output weights, we will have at the end an ordinary feedforward + NN. This can be achieved by using a regularization penalty on the output weights that yields + sparse solutions, such as the L1 penalty. If in addition the loss function is convex in the output + layer weights (which is the case of squared error, hinge loss, -tube regression loss, and + logistic or softmax cross-entropy), then it is easy to show that the overall training criterion + is convex in the parameters (which are now only the output weights). The only problem is + that there are as many variables in this convex program as there are elements in the set H, + which may be very large (possibly infinite). However, we find that with L1 regularization, + a finite solution is obtained, and that such a solution can be obtained by greedily inserting + one hidden unit at a time. Furthermore, it is theoretically possible to check that the global + optimum has been reached. + + Definition 2.1.Let Hbe a set of functions from an input space X to R. Elements of H + can be understood as “hidden units” in a NN. Let Wbe the Hilbert space of functions from + Hto R, with an inner product denoted by <>. An element of W can be + understood as the output weights vector in a neural network. Let < R>> the function + that maps any element <> of <>. <> can be understood as the vector of activations + of hidden units when input x is observed. Let w2 W represent a parameter(the output + weights). The NN prediction is denoted <>. Let < RxR>> be a + cost function convex in its first argument that takes a scalar prediction y^(x)and a scalar + target value y and returns a scalar cost. This is the cost to be minimized on example pair + (x;y). Let <> be the training set. Let <> be a convex + regularization functional that penalizes for the choice of more “complex” parameters (e.g., + <> according to a 1-norm in W, if His countable). We define the convex NN + criterion C(H;Q;;D;w)with parameter was follows: + + <> (1) + + The following is a trivial lemma, but it is conceptually very important as it is the basis for the + rest of the analysis in this paper. + + Lemma 2.2.The convex NN cost <> is a convex function of w. + Proof. <> is convex in w and <<>> is convex in w, by the above construction. C + is additive in <> and additive in . Hence C is convex in w. + Note that there are no constraints in this convex optimization program, so that at the global + minimum all the partial derivatives of C with respect to elements of w cancel. + Let jHj be the cardinality of the set H. If it is not finite, it is not obvious that an optimal + solution can be achieved in finitely many iterations. + + Lemma 2.2 says that training NNs from a very large class (with one or more hidden layer) + can be seen as convex optimization problems, usually in a very high dimensional space,as + long as we allow the number of hidden units to be selected by the learning algorithm. + By choosing a regularizer that promotes sparse solutions, we obtain a solution that has a + finite number of “active” hidden units (non-zero entries in the output weights vector w). + This assertion is proven below, in theorem 3.1, for the case of the hinge loss. + However, even if the solution involves a finite number of active hidden units, the convex + optimization problem could still be computationally intractable because of the large number + of variables involved. One approach to this problem is to apply the principles already suc- + cessfully embedded in Gradient Boosting, but more specifically in Column Generation (an + optimization technique for very large scale linear programs), i.e., add one hidden unit at a + time in an incremental fashion. The important ingredient here is a way to know that we + have reached the global optimum, thus not requiring to actually visit all the possible + hidden units.We show that this can be achieved as long as we can solve the sub-problem + of finding a linear classifier that minimizes the weighted sum of classification errors. This + can be done exactly only on low dimensional data sets but can be well approached using + weighted linear SVMs, weighted logistic regression, or Perceptron-type algorithms. + Another idea (not followed up here) would be to consider first a smaller set H1 , for which + the convex problem can be solved in polynomial time, and whose solution can theoretically + be selected as initialization for minimizing the criterion <>, with <>, + and where H2 may have infinite cardinality (countable or not). In this way we could show + that we can find a solution whose cost satisfies <>, + i.e., is at least as good as the solution of a more restricted convex optimization problem. The + second minimization can be performed with a local descent algorithm, without the necessity + to guarantee that the global optimum will be found. + + 3 Finite Number of Hidden Neurons + In this section we consider the special case with <> the hinge loss, + and <> regularization, and we show that the global optimum of the convex cost involves at + most n+ 1 hidden neurons, using an approach already exploited in (Ratsch, Demiriz and¨ + Bennett, 2002) for L1-loss regression Boosting with L1 regularization of output weights. Xn + The training criterion is <>. Let us rewrite t=1 this cost function as the + constrained optimization problem: + + <> (C1) + + <> (C2) + + Using a standard technique, the above program can be recast as a linear program. Defin- + ing <> the vector of Lagrangian multipliers for the constraints C1 , its dual + problem (P)takes the form (in the case of a finite number Jof base learners): + + <> + + In the case of a finite number Jof base learners, <>. If + the number of hidden units is uncountable, then Iis a closed bounded interval of R. + Such an optimization problem satisfies all the conditions needed for using Theorem 4.2 + from (Hettich and Kortanek, 1993). Indeed: + <> it is compact (as a closed bounded interval of <> is a concave function + it is even a linear function); + <> is convex in <<>> (it is actually linear in <<>>); + <> (therefore finite) ( (P)is the largest value of F satisfying the constraints); + for every set of n+1 points <>, there exists ~such that <> for + <> (one can take <> since K>0). + + Then, from Theorem 4.2 from (Hettich and Kortanek, 1993), the following theorem holds: + Theorem 3.1.The solution of (P) can be attained with constraints C0 and only n+1 constraints C0 + (i.e., there exists a subset of n+1 constraints C0 giving rise to the same maximum 1 + as when using the whole set of constraints). Therefore, the primal problem associated is the + minimization of the cost function of a NN with n+1 hidden neurons. + + 4 Incremental Convex NN Algorithm + In this section we present a stepwise algorithm to optimize a NN, and show that there is a cri- + terion that allows to verify whether the global optimum has been reached. This is a specializa- + tion of minimizing <>, with <> 1 and <> + is the set of soft or hard linear classifiers (depending on choice of s()). + + Algorithm ConvexNN( D, Q, , s) + + <> + + Theorem 4.1.AlgorithmConvexNN Pstops when it reaches the global optimum of + + <>. + + Proof.Let wbe the output weights vector when the algorithm stops. Because the set of + hidden units Hwe consider is such that when his in H, h is also in H, we can assume + all weights to be non-negative. By contradiction, if w0 6=wis the global optimum, with + <>, then, since Cis convex in the output weights, for any 2(0;1) , we have + <>. For + small enough, we can assume all weights in w that are strictly positive to be also strictly + positive in w . Let us denote by Ip the set of strictly positive weights in w (and w), by + Iz the set of weights set to zero in w but to a non-zero value in w , and by k the difference + w;k wk in the weight of hidden unit hk between wand w . We can assume j < 0 for + j2Iz , because instead of setting a small positive weight to hj , one can decrease the weight + of hj by the same amount, which will give either the same cost, or possibly a lower one + when the weight of <> is positive. With o() denoting a quantity such that o()!0 + when !0, the difference (w) =XC(w )C(w)can now be written: + + <> + + since for i2Ip , thanks to step (7) of the algorithm, we have @C (w) = 0 . Thus the @w + inequality <> rewrites into <> + which, when !0, yields (note that <> does not depend on ! since j is linear in ): + + <> (2) + + i being the optimal classifier chosen in step (5a) or (5c), all hidden units <> verify <> + + <> + + <> , contradicting eq. 2. + + (Mason et al., 2000) prove a related global convergence result for the AnyBoost algorithm, + a non-parametric Boosting algorithm that is also similar to Gradient Boosting (Friedman, + 2001). Again, this requires solving as a sub-problem an exact minimization to find a function + hi 2 H that is maximally correlated with the gradient Q0 on the output. We now show a + simple procedure to select a hyperplane with the best weighted classification error. + Exact Minimization + In step (5a) we are required to find a linear classifier that minimizes the weighted sum of + classification errors. Unfortunately, this is an NP-hard problem (w.r.t. d, see theorem 4 + in (Marcotte and Savard, 1992)). However, an exact solution can be easily found in O(n3 ) + computations for d= 2 inputs. + + Proposition 4.2.Finding a linear classifier that minimizes the weighted sum of classification + error can be achieved in O(n3 )steps when the input dimension is d= 2 . + Proof.We want to <> with respect to u and b, the c’s being + in <> Consider u fixed and sort the xi ’s according to their dot product with u and denote r + the function which maps ito r(i) such that xr(i) is in i-th position in the sort. Depending on P + the value of b, we will have n+1 possible sums, respectively <>, + <>. It is obvious that those sums only depend on the order of the products <>, + <>. When u varies smoothly on the unit circle, as the dot product is a continuous + function of its arguments, the changes in the order of the dot products will occur only when + there is a pair (i,j) such that <>. Therefore, there are at most as many order + changes as there are pairs of different points, i.e., <>. In the case of d=2, we + can enumerate all the different angles for which there is a change, namely a1 ;:::;a z with + <>. We then need to test at least one <> for each interval a2 i < + <>, and also one u for <>, which makes a total of <> possibilities. 2 + It is possible to generalize this result in higher dimensions, and as shown in (Marcotte and + Savard, 1992), one can achieve <> time. + + Algorithm 1 Optimal linear classifier search + + <> + + Approximate Minimization + + For data in higher dimensions, the exact minimization scheme to find the optimal linear + classifier is not practical. Therefore it is interesting to consider approximate schemes for + obtaining a linear classifier with weighted costs. Popular schemes for doing so are the linear + SVM (i.e., linear classifier with hinge loss), the logistic regression classifier, and variants of + the Perceptron algorithm. In that case, step (5c) of the algorithm is not an exact minimization, + and one cannot guarantee that the global optimum will be reached. However, it might be + reasonable to believe that finding a linear classifier by minimizing a weighted hinge loss + should yield solutions close to the exact minimization. Unfortunately, this is not generally + true, as we have found out on a simple toy data set described below. On the other hand, + if in step (7) one performs an optimization not only of the output weights wj (ji) but + also of the corresponding weight vectors vj , then the algorithm finds a solution close to the + global optimum (we could only verify this on 2-D data sets, where the exact solution can be + computed easily). It means that at the end of each stage, one first performs a few training + iterations of the whole NN (for the hidden units ji) with an ordinary gradient descent + mechanism (we used conjugate gradients but stochastic gradient descent would work too), + optimizing the wj ’s and the vj ’s, and then one fixes the vj ’s and obtains the optimal wj ’s for + these vj ’s (using a convex optimization procedure). In our experiments we used a quadratic + Q, for which the optimization of the output weights can be done with a neural network, using + the outputs of the hidden layer as inputs. + + Let us consider now a bit more carefully what it means to tune the v_j’s in step (7). Indeed, + changing the weight vector vj of a selected hidden neuron to decrease the cost is equivalent + to a change in the output weights w’s. More precisely, consider the step in which the + value of vj becomes v0 . This is equivalent to the following operation on the w’s, when wj j is the corresponding output weight value: the output weight associated with the value vj of + a hidden neuron is set to 0, and the output weight associated with the value v0 of a hidden j + neuron is set to wj . This corresponds to an exchange between two variables in the convex + program. We are justified to take any such step as long as it allows us to decrease the cost + C(w). The fact that we are simultaneously making such exchanges on all the hidden units + when we tune the vj ’s allows us to move faster towards the global optimum. + Extension to multiple outputs + The multiple outputs case is more involved than the single-output case because it is not P + enough to check the condition <>. Consider a new hidden neuron whose output is + hi when the input is xi . Let us also denote <> the vector of output weights + between the new hidden neuron and the <> output neurons. The gradient with respect to j + is <> with <> the value of the j-th output neuron with input <>. + This means that if, for a given j, we have <>, moving Pj away from 0 can + only increase the cost. Therefore, the right quantity to consider is <>. + We must therefore find <>. As before, this sub-problem is not + convex, but it is not + as obvious how to approximate it by a convex problem. The stopping P criterion becomes: if there is no j + such that <>, then all weights must remain equal to 0 and a global minimum is reached. + + Experimental Results + We performed experiments on the 2-D double moon toy dataset (as used in (Delalleau, Ben- + gio and Le Roux, 2005)), to be able to compare with the exact version of the algorithm. In + these experiments, <>. The set-up is the following: + + Select a new linear classifier, either (a) the optimal one or (b) an approximate using logistic + regression. + Optimize the output weights using a convex optimizer. + In case (b), tune both input and output weights by conjugate gradient descent on Cand + finally re-optimize the output weights using LASSO regression. + Optionally, remove neurons whose output weight has been set to 0. + Using the approximate algorithm yielded for 100 training examples an average penalized + ( = 1 ) squared error of 17.11 (over 10 runs), an average test classification error of 3.68% + and an average number of neurons of 5.5 . The exact algorithm yielded a penalized squared + error of 8.09, an average test classification error of 5.3%, and required 3 hidden neurons. A + penalty of = 1 was nearly optimal for the exact algorithm whereas a smaller penalty further + improved the test classification error of the approximate algorithm. Besides, when running + the approximate algorithm for a long time, it converges to a solution whose quadratic error is + extremely close to the one of the exact algorithm. + + 5 Conclusion + We have shown that training a NN can be seen as a convex optimization problem, and have + analyzed an algorithm that can exactly or approximately solve this problem. We have shown + that the solution with the hinge loss involved a number of non-zero weights bounded by + the number of examples, and much smaller in practice. We have shown that there exists a + stopping criterion to verify if the global optimum has been reached, but it involves solving a + sub-learning problem involving a linear classifier with weighted errors, which can be computationally + hard if the exact solution is sought, but can be easily implemented for toy data + sets (in low dimension), for comparing exact and approximate solutions. + The above experimental results are in agreement with our initial conjecture: when there are + many hidden units we are much less likely to stall in the optimization procedure, because + there are many more ways to descend on the convex cost C(w). They also suggest, based + on experiments in which we can compare with the exact sub-problem minimization, that + applying Algorithm ConvexNN with an approximate minimization for adding each hidden + unit while continuing to tune the previous hidden unit s tends to lead to fast convergence + to the global minimum. What can get us stuck in a “local minimum” (in the traditional sense, + i.e., of optimizing w’s and v’s together) is simply the inability to find a new hidden unit + weight vector that can improve the total cost (fit and regularization term) even if there + exists one. + + Note that as a side-effect of the results presented here, we have a simple way to train P neural + networks with hard-threshold hidden units, since increasing <> can be either achieved + exactly (at great price) or approximately (e.g. by using a cross-entropy + or hinge loss on the corresponding linear classifier). + + Acknowledgments + + The authors thank the following for support: NSERC, MITACS, and the Canada Research + Chairs. They are also grateful for the feedback and stimulating exchanges with Sam Roweis, + Nathan Srebro, and Aaron Courville. + + References + + Chvatal, V. (1983).´ Linear Programming. W.H. Freeman. + Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efficient non-parametric function induction + in semi-supervised learning. In Cowell, R. and Ghahramani, Z., editors,Proceedings of AIS- + TATS’2005, pages 96–103. + Freund, Y. and Schapire, R. E. (1997). A decision theoretic generalization of on-line learning and an + application to boosting.Journal of Computer and System Science, 55(1):119–139. + Friedman, J. (2001). Greedy function approximation: a gradient boosting machine.Annals of Statis- + tics, 29:1180. + Hettich, R. and Kortanek, K. (1993). Semi-infinite programming: theory, methods, and applications. + SIAM Review, 35(3):380–429. + Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem.Zeitschrift fr + Operations Research (Theory), 36:517–545. + Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. (2000). Boosting algorithms as gradient descent. + InAdvances in Neural Information Processing Systems 12, pages 512–518. + Ratsch, G., Demiriz, A., and Bennett, K. P. (2002). Sparse regression ensembles in infinite and finite¨ + hypothesis spaces.Machine Learning. + Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating + errors.Nature, 323:533–536 +<> <> <> + + +<> <> <> + DEEP COMPRESSION: COMPRESSING DEEP NEURAL + NETWORKS WITH PRUNING , T RAINED QUANTIZATION + AND HUFFMAN CODING + + + Song Han + Stanford University, Stanford, CA 94305, USA + songhan@stanford.edu + + Huizi Mao + Tsinghua University, Beijing, 100084, China + mhz12@mails.tsinghua.edu.cn + + William J. Dally + Stanford University, Stanford, CA 94305, USA + NVIDIA, Santa Clara, CA 95050, USA + dally@stanford.edu + + + + ABSTRACT + + Neural networks are both computationally intensive and memory intensive, making + them difficult to deploy on embedded systems with limited hardware resources. To + address this limitation, we introduce “deep compression”, a three stage pipeline: + pruning, trained quantization and Huffman coding, that work together to reduce + the storage requirement of neural networks by 35% to 49% without affecting their + accuracy. Our method first prunes the network by learning only the important + connections. Next, we quantize the weights to enforce weight sharing, finally, we + apply Huffman coding. After the first two steps we retrain the network to fine + tune the remaining connections and the quantized centroids. Pruning, reduces the + number of connections by 9% to 13%; Quantization then reduces the number of + bits that represent each connection from 32 to 5. On the ImageNet dataset, our + method reduced the storage required by AlexNet by 35%, from 240MB to 6.9MB, + without loss of accuracy. Our method reduced the size of VGG-16 by 49% from + 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model + into on-chip SRAM cache rather than off-chip DRAM memory. Our compression + method also facilitates the use of complex neural networks in mobile applications + where application size and download bandwidth are constrained. Benchmarked on + CPU, GPU and mobile GPU, compressed network has 3% to 4% layerwise speedup + and 3% to 7% better energy efficiency. + + + 1 INTRODUCTION + + Deep neural networks have evolved to the state-of-the-art technique for computer vision tasks + (Krizhevsky et al., 2012)(Simonyan & Zisserman, 2014). Though these neural networks are very + powerful, the large number of weights consumes considerable storage and memory bandwidth. For + example, the AlexNet Caffemodel is over 200MB, and the VGG-16 Caffemodel is over 500MB + (BVLC). This makes it difficult to deploy deep neural networks on mobile system. + First, for many mobile-first companies such as Baidu and Facebook, various apps are updated via + different app stores, and they are very sensitive to the size of the binary files. For example, App + Store has the restriction “apps above 100 MB will not download until you connect to Wi-Fi”. As a + result, a feature that increases the binary size by 100MB will receive much more scrutiny than one + that increases it by 10MB. Although having deep neural networks running on mobile has many great + + <
> + + Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning + reduces the number of weights by10%, while quantization further improves the compression rate: + between27%and31%. Huffman coding gives more compression: between35%and49%. The + compression rate already included the meta-data for sparse representation. The compression scheme + doesn’t incur any accuracy loss. + + features such as better privacy, less network bandwidth and real time processing, the large storage + overhead prevents deep neural networks from being incorporated into mobile apps. + The second issue is energy consumption. Running large neural networks require a lot of memory + bandwidth to fetch the weights and a lot of computation to do dot products— which in turn consumes + considerable energy. Mobile devices are battery constrained, making power hungry applications such + as deep neural networks hard to deploy. + Energy consumption is dominated by memory access. Under 45nm CMOS technology, a 32 bit + floating point add consumes 0.9PJ, a 32bit SRAM cache access takes 5PJ, while a 32bit DRAM + memory access takes 640PJ, which is 3 orders of magnitude of an add operation. Large networks + do not fit in on-chip storage and hence require the more costly DRAM accesses. Running a 1 billion + connection neural network, for example, at 20fps would require (20Hz)(1G)(640PJ) = 12.8W just + for DRAM access - well beyond the power envelope of a typical mobile device. + Our goal is to reduce the storage and energy required to run inference on such large networks so they + can be deployed on mobile devices. To achieve this goal, we present “deep compression”: a three- + stage pipeline (Figure 1) to reduce the storage required by neural network in a manner that preserves + the original accuracy. First, we prune the networking by removing the redundant connections, keeping + only the most informative connections. Next, the weights are quantized so that multiple connections + share the same weight, thus only the codebook (effective weights) and the indices need to be stored. + Finally, we apply Huffman coding to take advantage of the biased distribution of effective weights. + Our main insight is that, pruning and trained quantization are able to compress the network without + interfering each other, thus lead to surprisingly high compression rate. It makes the required storage + so small (a few megabytes) that all weights can be cached on chip instead of going to off-chip DRAM + which is energy consuming. Based on “deep compression”, the EIE hardware accelerator Han et al. + (2016) was later proposed that works on the compressed model, achieving significant speedup and + energy efficiency improvement. + + 2 NETWORK PRUNING + + Network pruning has been widely studied to compress CNN models. In early work, network pruning + proved to be a valid way to reduce the network complexity and over-fitting (LeCun et al., 1989; + Hanson & Pratt, 1989; Hassibi et al., 1993; Strom, 1997). Recently Han et al. (2015) pruned state- ¨ + of-the-art CNN models with no loss of accuracy. We build on top of that approach. As shown on + the left side of Figure 1, we start by learning the connectivity via normal network training. Next, we + prune the small-weight connections: all connections with weights below a threshold are removed + from the network. Finally, we retrain the network to learn the final weights for the remaining sparse + connections. Pruning reduced the number of parameters by9%and13%for AlexNet and VGG-16 + model. + + <
> + + Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow. + + <
> + + Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). + + + We store the sparse structure that results from pruning using compressed sparse row (CSR) or + compressed sparse column (CSC) format, which requires2a+n+1numbers, where a is the number + of non-zero elements and n is the number of rows or columns. + To compress further, we store the index difference instead of the absolute position, and encode this + difference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference larger + than the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds + 8, the largest 3-bit (as an example) unsigned number, we add a filler zero. + + 3 TRAINED QUANTIZATION AND WEIGHT SHARING + + Network quantization and weight sharing further compresses the pruned network by reducing the + number of bits required to represent each weight. We limit the number of effective weights we need to + store by having multiple connections share the same weight, and then fine-tune those shared weights. + Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4 + output neurons, the weight is a 4x4 matrix. On the top left is the 4x4 weight matrix, and on the + bottom left is the 4x4 gradient matrix. The weights are quantized to 4 bins (denoted with 4 colors), + all the weights in the same bin share the same value, thus for each weight, we then need to store only + a small index into a table of shared weights. During update, all the gradients are grouped by the color + and summed together, multiplied by the learning rate and subtracted from the shared centroids from + last iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for each + CONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy. + To calculate the compression rate, given k clusters, we only need log_2(k) bits to encode the index. In + general, for a network with n connections and each connection is represented with b bits, constraining + the connections to have only k shared weights will result in a compression rate of: + + <> (1) + + For example, Figure 3 shows the weights of a single layer neural network with four input units and + four output units. There are4%4 = 16weights originally but there are only4shared weights: similar + weights are grouped together to share the same value. Originally we need to store 16 weights each + + <
> + + Figure 4: Left: Three different methods for centroids initialization. Right: Distribution of weights + (blue) and distribution of codebook before (green cross) and after fine-tuning (red dot). + + + has 32 bits, now we need to store only 4 effective weights (blue, green, red and orange), each has 32 + bits, together with 16 2-bit indices giving a compression rate of <> + + 3.1 WEIGHT SHARING + + We use k-means clustering to identify the shared weights for each layer of a trained network, so that + all the weights that fall into the same cluster will share the same weight. Weights are not shared across + layers. We partition n original weights <> into k clusters <>, + n%k, so as to minimize the within-cluster sum of squares (WCSS): + + <> (2) + + Different from HashNet (Chen et al., 2015) where weight sharing is determined by a hash function + before the networks sees any training data, our method determines weight sharing after a network is + fully trained, so that the shared weights approximate the original network. + + 3.2 INITIALIZATION OF SHARED WEIGHTS + + Centroid initialization impacts the quality of clustering and thus affects the network’s prediction + accuracy. We examine three initialization methods: Forgy(random), density-based, and linear + initialization. In Figure 4 we plotted the original weights’ distribution of conv3 layer in AlexNet + (CDF in blue, PDF in red). The weights forms a bimodal distribution after network pruning. On the + bottom it plots the effective weights (centroids) with 3 different initialization methods (shown in blue, + red and yellow). In this example, there are 13 clusters. + Forgy(random) initialization randomly chooses k observations from the data set and uses these as + the initial centroids. The initialized centroids are shown in yellow. Since there are two peaks in the + bimodal distribution, Forgy method tend to concentrate around those two peaks. + Density-based initialization linearly spaces the CDF of the weights in the y-axis, then finds the + horizontal intersection with the CDF, and finally finds the vertical intersection on the x-axis, which + becomes a centroid, as shown in blue dots. This method makes the centroids denser around the two + peaks, but more scatted than the Forgy method. + Linear initialization linearly spaces the centroids between the [min, max] of the original weights. + This initialization method is invariant to the distribution of the weights and is the most scattered + compared with the former two methods. + Larger weights play a more important role than smaller weights (Han et al., 2015), but there are fewer + of these large weights. Thus for both Forgy initialization and density-based initialization, very few + centroids have large absolute value which results in poor representation of these few large weights. + Linear initialization does not suffer from this problem. The experiment section compares the accuracy + + <
> + + Figure 5: Distribution for weight (Left) and index (Right). The distribution is biased. + of different initialization methods after clustering and fine-tuning, showing that linear initialization + works best. + + 3.3 FEED-FORWARD AND BACK-PROPAGATION + + The centroids of the one-dimensional k-means clustering are the shared weights. There is one level + of indirection during feed forward phase and back-propagation phase looking up the weight table. + An index into the shared weight table is stored for each connection. During back-propagation, the + gradient for each shared weight is calculated and used to update the shared weight. This procedure is + shown in Figure 3. + We denote the loss byL, the weight in the ith column and jth row by Wij, the centroid index of + element Wij by Iij, the kth centroid of the layer by Ck. By using the indicator function <<1(.)>>, the + gradient of the centroids is calculated as: + + <> (3) + + 4 HUFFMAN CODING + + A Huffman code is an optimal prefix code commonly used for lossless data compression(Van Leeuwen, + 1976). It uses variable-length codewords to encode source symbols. The table is derived from the + occurrence probability for each symbol. More common symbols are represented with fewer bits. + Figure 5 shows the probability distribution of quantized weights and the sparse matrix index of the + last fully connected layer in AlexNet. Both distributions are biased: most of the quantized weights are + distributed around the two peaks; the sparse matrix index difference are rarely above 20. Experiments + show that Huffman coding these non-uniformly distributed values saves 20% to 30% of network + storage. + + 5 EXPERIMENTS + + We pruned, quantized, and Huffman encoded four networks: two on MNIST and two on ImageNet + data-sets. The network parameters and accuracy- 1 before and after pruning are shown in Table 1. The + compression pipeline saves network storage by 35% to 49% across different networks without loss + of accuracy. The total size of AlexNet decreased from 240MB to 6.9MB, which is small enough to + be put into on-chip SRAM, eliminating the need to store the model in energy-consuming DRAM + memory. + + Training is performed with the Caffe framework (Jia et al., 2014). Pruning is implemented by adding + a mask to the blobs to mask out the update of the pruned connections. Quantization and weight + sharing are implemented by maintaining a codebook structure that stores the shared weight, and + group-by-index after calculating the gradient of each layer. Each shared weight is updated with all + the gradients that fall into that bucket. Huffman coding doesn’t require training and is implemented + offline after all the fine-tuning is finished. + + 5.1 LE NET-300-100 AND LE NET-5 ON MNIST + + We first experimented on MNIST dataset with LeNet-300-100 and LeNet-5 network (LeCun et al., + 1998). LeNet-300-100 is a fully connected network with two hidden layers, with 300 and 100 + 1 Reference model is from Caffe model zoo, accuracy is measured without data augmentation + + Table 1: The compression pipeline can save35%to49%parameter storage with no loss of accuracy. + + <
> + + Table 2: Compression statistics for LeNet-300-100. P: pruning, Q:quantization, H:Huffman coding. + + <
> + + Table 3: Compression statistics for LeNet-5. P: pruning, Q:quantization, H:Huffman coding. + + <
> + + neurons each, which achieves 1.6% error rate on Mnist. LeNet-5 is a convolutional network that + has two convolutional layers and two fully connected layers, which achieves 0.8% error rate on + Mnist. Table 2 and table 3 show the statistics of the compression pipeline. The compression rate + includes the overhead of the codebook and sparse indexes. Most of the saving comes from pruning + and quantization (compressed 32%), while Huffman coding gives a marginal gain (compressed 40%) + + 5.2 ALEX NET ON IMAGE NET + + We further examine the performance of Deep Compression on the ImageNet ILSVRC-2012 dataset, + which has 1.2M training examples and 50k validation examples. We use the AlexNet Caffe model as + the reference model, which has 61 million parameters and achieved a top-1 accuracy of 57.2% and a + top-5 accuracy of 80.3%. Table 4 shows that AlexNet can be compressed to2:88%of its original size + without impacting accuracy. There are 256 shared weights in each CONV layer, which are encoded + with 8 bits, and 32 shared weights in each FC layer, which are encoded with only 5 bits. The relative + sparse index is encoded with 4 bits. Huffman coding compressed additional 22%, resulting in 35% + compression in total. + + 5.3 VGG-16 ON IMAGE NET + + With promising results on AlexNet, we also looked at a larger, more recent network, VGG-16 (Si- + monyan & Zisserman, 2014), on the same ILSVRC-2012 dataset. VGG-16 has far more convolutional + layers but still only three fully-connected layers. Following a similar methodology, we aggressively + compressed both convolutional and fully-connected layers to realize a significant reduction in the + number of effective weights, shown in Table5. + The VGG16 network as a whole has been compressed by49%. Weights in the CONV layers are + represented with 8 bits, and FC layers use 5 bits, which does not impact the accuracy. The two largest + fully-connected layers can each be pruned to less than 1.6% of their original size. This reduction + + Table 4: Compression statistics for AlexNet. P: pruning, Q: quantization, H:Huffman coding. + + <
> + + Table 5: Compression statistics for VGG-16. P: pruning, Q:quantization, H:Huffman coding. + + <
> + + is critical for real time image processing, where there is little reuse of these layers across images + (unlike batch processing). This is also critical for fast object detection algorithms where one CONV + pass is used by many FC passes. The reduced layers will fit in an on-chip SRAM and have modest + bandwidth requirements. Without the reduction, the bandwidth requirements are prohibitive. + + 6 DISCUSSIONS + + 6.1 PRUNING AND QUANTIZATION WORKING TOGETHER + + Figure 6 shows the accuracy at different compression rates for pruning and quantization together + or individually. When working individually, as shown in the purple and yellow lines, accuracy of + pruned network begins to drop significantly when compressed below 8% of its original size; accuracy + of quantized network also begins to drop significantly when compressed below 8% of its original + size. But when combined, as shown in the red line, the network can be compressed to 3% of original + size with no loss of accuracy. On the far right side compared the result of SVD, which is inexpensive + but has a poor compression rate. + The three plots in Figure 7 show how accuracy drops with fewer bits per connection for CONV layers + (left), FC layers (middle) and all layers (right). Each plot reports both top-1 and top-5 accuracy. + Dashed lines only applied quantization but without pruning; solid lines did both quantization and + pruning. There is very little difference between the two. This shows that pruning works well with + quantization. + Quantization works well on pruned network because unpruned AlexNet has 60 million weights to + quantize, while pruned AlexNet has only 6.7 million weights to quantize. Given the same amount of + centroids, the latter has less error. + + <
> + + Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning and + quantization works best when combined. + + <
> + + Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid: + quantization on pruned network; Accuracy begins to drop at the same number of quantization bits + whether or not the network has been pruned. Although pruning made the number of parameters less, + quantization still works well, or even better(3 bits case on the left figure) as in the unpruned network. + + <
> + + Figure 8: Accuracy of different initialization methods. Left: top-1 accuracy. Right: top-5 accuracy. + Linear initialization gives best result. + + The first two plots in Figure 7 show that CONV layers require more bits of precision than FC layers. + For CONV layers, accuracy drops significantly below 4 bits, while FC layer is more robust: not until + 2 bits did the accuracy drop significantly. + + + 6.2 CENTROID INITIALIZATION + + Figure 8 compares the accuracy of the three different initialization methods with respect to top-1 + accuracy (Left) and top-5 accuracy (Right). The network is quantized to2%8bits as shown on + x-axis. Linear initialization outperforms the density initialization and random initialization in all + cases except at 3 bits. + The initial centroids of linear initialization spread equally across the x-axis, from the min value to the + max value. That helps to maintain the large weights as the large weights play a more important role + than smaller ones, which is also shown in network pruning Han et al. (2015). Neither random nor + density-based initialization retains large centroids. With these initialization methods, large weights are + clustered to the small centroids because there are few large weights. In contrast, linear initialization + allows large weights a better chance to form a large centroid. + + <
> + + Figure 9: Compared with the original network, pruned network layer achieved 3% speedup on CPU, + 3.5% on GPU and 4.2% on mobile GPU on average. Batch size = 1 targeting real time processing. + Performance number normalized to CPU. + + <
> + + Figure 10: Compared with the original network, pruned network layer takes 7% less energy on CPU, + 3.3% less on GPU and 4.2% less on mobile GPU on average. Batch size = 1 targeting real time + processing. Energy number normalized to CPU. + + 6.3 SPEEDUP AND ENERGY EFFICIENCY + + Deep Compression is targeting extremely latency-focused applications running on mobile, which + requires real-time inference, such as pedestrian detection on an embedded processor inside an + autonomous vehicle. Waiting for a batch to assemble significantly adds latency. So when bench- + marking the performance and energy efficiency, we consider the case when batch size = 1. The cases + of batching are given in Appendix A. + Fully connected layer dominates the model size (more than90%) and got compressed the most by + Deep Compression (96%weights pruned in VGG-16). In state-of-the-art object detection algorithms + such as fast R-CNN (Girshick, 2015), up to 38% computation time is consumed on FC layers on + uncompressed model. So it’s interesting to benchmark on FC layers, to see the effect of Deep + Compression on performance and energy. Thus we setup our benchmark on FC6, FC7, FC8 layers of + AlexNet and VGG-16. In the non-batched case, the activation matrix is a vector with just one column, + so the computation boils down to dense / sparse matrix-vector multiplication for original / pruned + model, respectively. Since current BLAS library on CPU and GPU doesn’t support indirect look-up + and relative indexing, we didn’t benchmark the quantized model. + We compare three different off-the-shelf hardware: the NVIDIA GeForce GTX Titan X and the Intel + Core i7 5930K as desktop processors (same package as NVIDIA Digits Dev Box) and NVIDIA Tegra + K1 as mobile processor. To run the benchmark on GPU, we used cuBLAS GEMV for the original + dense layer. For the pruned sparse layer, we stored the sparse matrix in in CSR format, and used + cuSPARSE CSRMV kernel, which is optimized for sparse matrix-vector multiplication on GPU. To + run the benchmark on CPU, we used MKL CBLAS GEMV for the original dense model and MKL + SPBLAS CSRMV for the pruned sparse model. + + To compare power consumption between different systems, it is important to measure power at a + consistent manner (NVIDIA, b). For our analysis, we are comparing pre-regulation power of the + entire application processor (AP) / SOC and DRAM combined. On CPU, the benchmark is running on + single socket with a single Haswell-E class Core i7-5930K processor. CPU socket and DRAM power + are as reported by the pcm-power utility provided by Intel. For GPU, we used nvidia-smi + utility to report the power of Titan X. For mobile GPU, we use a Jetson TK1 development board and + measured the total power consumption with a power-meter. We assume 15% AC to DC conversion + loss,85% regulator efficiency and 15% power consumed by peripheral components (NVIDIA, a) to + report the AP+DRAM power for Tegra K1. + + Table 6: Accuracy of AlexNet with different aggressiveness of weight sharing and quantization. 8/5 + bit quantization has no loss of accuracy; 8/4 bit quantization, which is more hardware friendly, has + negligible loss of accuracy of 0.01%; To be really aggressive, 4/2 bit quantization resulted in 1.99% + and 2.60% loss of accuracy. + + <
> + + The ratio of memory access over computation characteristic with and without batching is different. + When the input activations are batched to a matrix the computation becomes matrix-matrix multipli- + cation, where locality can be improved by blocking. Matrix could be blocked to fit in caches and + reused efficiently. In this case, the amount of memory access isO(n2 ), and that of computation is + O(n3 ), the ratio between memory access and computation is in the order of1=n. + In real time processing when batching is not allowed, the input activation is a single vector and the + computation is matrix-vector multiplication. In this case, the amount of memory access isO(n2 ), and + the computation isO(n2 ), memory access and computation are of the same magnitude (as opposed + to1=n). That indicates MV is more memory-bounded than MM. So reducing the memory footprint + is critical for the non-batching case. + + Figure 9 illustrates the speedup of pruning on different hardware. There are 6 columns for each + benchmark, showing the computation time of CPU / GPU / TK1 on dense / pruned network. Time is + normalized to CPU. When batch size = 1, pruned network layer obtained 3% to 4% speedup over the + dense network on average because it has smaller memory footprint and alleviates the data transferring + overhead, especially for large matrices that are unable to fit into the caches. For example VGG16’s + FC6 layer, the largest layer in our experiment, contains 400MB data, which is far from the capacity of L3 cache. + + In those latency-tolerating applications, batching improves memory locality, where weights could + be blocked and reused in matrix-matrix multiplication. In this scenario, pruned network no longer + shows its advantage. We give detailed timing results in Appendix A. + + Figure 10 illustrates the energy efficiency of pruning on different hardware. We multiply power + consumption with computation time to get energy consumption, then normalized to CPU to get + energy efficiency. When batch size = 1, pruned network layer consumes 3% to 7% less energy over + the dense network on average. Reported by nvidia-smi, GPU utilization is 99% for both dense + and sparse cases. + + 6.4 RATIO OF WEIGHTS, INDEX AND CODEBOOK + + Pruning makes the weight matrix sparse, so extra space is needed to store the indexes of non-zero + elements. Quantization adds storage for a codebook. The experiment section has already included + these two factors. Figure 11 shows the breakdown of three different components when quantizing + four networks. Since on average both the weights and the sparse indexes are encoded with 5 bits, + their storage is roughly half and half. The overhead of codebook is very small and often negligible. + + <
> + + Figure 11: Storage ratio of weight, index and codebook. + + Table 7: Comparison with other compression methods on AlexNet. (Collins & Kohli, 2014) reduced + the parameters by 4% and with inferior accuracy. Deep Fried Conv nets(Yang et al., 2014) worked + on fully connected layers and reduced the parameters by less than 4%. SVD save parameters but + suffers from large accuracy loss as much as 2%. Network pruning (Han et al., 2015) reduced the + parameters by 9%, not including index overhead. On other networks similar to AlexNet, (Denton + et al., 2014) exploited linear structure of conv nets and compressed the network by 2.4% to 13.4% + layer wise, with 0.9% accuracy loss on compressing a single layer. (Gong et al., 2014) experimented + with vector quantization and compressed the network by 16% to 24%, incurring 1% accuracy loss. + + <
> + + 7 RELATED WORK + + Neural networks are typically over-parametrized, and there is significant redundancy for deep learning + models(Denil et al., 2013). This results in a waste of both computation and memory usage. There + have been various proposals to remove the redundancy: Vanhoucke et al. (2011) explored a fixed- + point implementation with 8-bit integer (vs 32-bit floating point) activations. Hwang & Sung + (2014) proposed an optimization method for the fixed-point network with ternary weights and 3-bit + activations. Anwar et al. (2015) quantized the neural network using L2 error minimization and + achieved better accuracy on MNIST and CIFAR-10 datasets.Denton et al. (2014) exploited the linear + structure of the neural network by finding an appropriate low-rank approximation of the parameters + and keeping the accuracy within 1% of the original model. + The empirical success in this paper is consistent with the theoretical study of random-like sparse + networks with +1/0/-1 weights (Arora et al., 2014), which have been proved to enjoy nice properties + (e.g. reversibility), and to allow a provably polynomial time algorithm for training. + Much work has been focused on binning the network parameters into buckets, and only the values in + the buckets need to be stored. HashedNets(Chen et al., 2015) reduce model sizes by using a hash + function to randomly group connection weights, so that all connections within the same hash bucket + share a single parameter value. In their method, the weight binning is pre-determined by the hash + function, instead of being learned through training, which doesn’t capture the nature of images. Gong + et al. (2014) compressed deep conv nets using vector quantization, which resulted in 1% accuracy + loss. Both methods studied only the fully connected layer, ignoring the convolutional layers. + There have been other attempts to reduce the number of parameters of neural networks by replacing + the fully connected layer with global average pooling. The Network in Network architecture(Lin et al., + 2013) and GoogLenet(Szegedy et al., 2014) achieves state-of-the-art results on several benchmarks by + adopting this idea. However, transfer learning, i.e. reusing features learned on the ImageNet dataset + and applying them to new tasks by only fine-tuning the fully connected layers, is more difficult with + this approach. This problem is noted by Szegedy et al. (2014) and motivates them to add a linear + layer on the top of their networks to enable transfer learning. + Network pruning has been used both to reduce network complexity and to reduce over-fitting. An + early approach to pruning was biased weight decay (Hanson & Pratt, 1989). Optimal Brain Damage + (LeCun et al., 1989) and Optimal Brain Surgeon (Hassibi et al., 1993) prune networks to reduce + the number of connections based on the Hessian of the loss function and suggest that such pruning + is more accurate than magnitude-based pruning such as weight decay. A recent work (Han et al., + 2015) successfully pruned several state of the art large scale networks and showed that the number of + parameters could be reduce by an order of magnitude. There are also attempts to reduce the number + of activations for both compression and acceleration Van Nguyen et al. (2015). + + 8 FUTURE WORK + + While thE pruned network has been benchmarked on various hardware, the quantized network with + weight sharing has not, because off-the-shelf cuSPARSE or MKL SPBLAS library does not support + indirect matrix entry lookup, nor is the relative index in CSC or CSR format supported. So the full + advantage of Deep Compression that fit the model in cache is not fully unveiled. A software solution + is to write customized GPU kernels that support this. A hardware solution is to build custom ASIC + architecture specialized to traverse the sparse and quantized network structure, which also supports + customized quantization bit width. We expect this architecture to have energy dominated by on-chip + SRAM access instead of off-chip DRAM access. + + 9 CONCLUSION + + We have presented “Deep Compression” that compressed neural networks without affecting accuracy. + Our method operates by pruning the unimportant connections, quantizing the network using weight + sharing, and then applying Huffman coding. We highlight our experiments on AlexNet which + reduced the weight storage by 35% without loss of accuracy. We show similar results for VGG-16 + and LeNet networks compressed by 49% and 39% without loss of accuracy. This leads to smaller + storage requirement of putting conv nets into mobile app. After Deep Compression the size of these + networks fit into on-chip SRAM cache (5pJ/access) rather than requiring off-chip DRAM memory + (640pJ/access). This potentially makes deep neural networks more energy efficient to run on mobile. + Our compression method also facilitates the use of complex neural networks in mobile applications + where application size and download bandwidth are constrained. + + REFERENCES + Anwar, Sajid, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point optimization of deep convolutional + neural networks for object recognition. InAcoustics, Speech and Signal Processing (ICASSP), + 2015 IEEE International Conference on, pp. 1131–1135. IEEE, 2015. + Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma, Tengyu. Provable bounds for learning some + deep representations. InProceedings of the 31th International Conference on Machine Learning, + ICML 2014, pp. 584–592, 2014. + BVLC. Caffe model zoo. URLhttp://caffe.berkeleyvision.org/model_zoo. + Chen, Wenlin, Wilson, James T., Tyree, Stephen, Weinberger, Kilian Q., and Chen, Yixin. Compress- + ing neural networks with the hashing trick.arXiv preprint arXiv:1504.04788, 2015. + Collins, Maxwell D and Kohli, Pushmeet. Memory bounded deep convolutional networks.arXiv + preprint arXiv:1412.1442, 2014. + Denil, Misha, Shakibi, Babak, Dinh, Laurent, de Freitas, Nando, et al. Predicting parameters in deep + learning. InAdvances in Neural Information Processing Systems, pp. 2148–2156, 2013. + Denton, Emily L, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear + structure within convolutional networks for efficient evaluation. InAdvances in Neural Information + Processing Systems, pp. 1269–1277, 2014. + Girshick, Ross. Fast r-cnn.arXiv preprint arXiv:1504.08083, 2015. + Gong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir. Compressing deep convolutional + networks using vector quantization.arXiv preprint arXiv:1412.6115, 2014. + Han, Song, Pool, Jeff, Tran, John, and Dally, William J. Learning both weights and connections for + efficient neural networks. InAdvances in Neural Information Processing Systems, 2015. + Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A, and Dally, + William J. EIE: Efficient inference engine on compressed deep neural network.arXiv preprint + arXiv:1602.01528, 2016. + Hanson, Stephen Jose and Pratt, Lorien Y. Comparing biases for minimal network construction with´ + back-propagation. InAdvances in neural information processing systems, pp. 177–185, 1989. + Hassibi, Babak, Stork, David G, et al. Second order derivatives for network pruning: Optimal brain + surgeon.Advances in neural information processing systems, pp. 164–164, 1993. + Hwang, Kyuyeon and Sung, Wonyong. Fixed-point feedforward deep neural network design using + weights+ 1, 0, and- 1. InSignal Processing Systems (SiPS), 2014 IEEE Workshop on, pp. 1–6. + IEEE, 2014. + Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, + Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature + embedding.arXiv preprint arXiv:1408.5093, 2014. + Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep + convolutional neural networks. InNIPS, pp. 1097–1105, 2012. + LeCun, Yann, Denker, John S, Solla, Sara A, Howard, Richard E, and Jackel, Lawrence D. Optimal + brain damage. InNIPs, volume 89, 1989. + LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied + to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. + Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network.arXiv:1312.4400, 2013. + NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing + to embedded systems, a. URLhttp://www.nvidia.com. + NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis, b. + URLhttp://www.nvidia.com/object/white-papers.html. + Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image + recognition.arXiv preprint arXiv:1409.1556, 2014. + Strom, Nikko. Phoneme probability estimation with dynamic sparsely connected artificial neural¨ + networks.The Free Speech Journal, 1(5):1–41, 1997. + Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, + Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. + arXiv preprint arXiv:1409.4842, 2014. + Van Leeuwen, Jan. On the construction of huffman trees. InICALP, pp. 382–410, 1976. + Van Nguyen, Hien, Zhou, Kevin, and Vemulapalli, Raviteja. Cross-domain synthesis of medical + images using efficient location-sensitive deep network. InMedical Image Computing and Computer- + Assisted Intervention–MICCAI 2015, pp. 677–684. Springer, 2015. + Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Improving the speed of neural networks on + cpus. InProc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011. + Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and + Wang, Ziyu. Deep fried convnets.arXiv preprint arXiv:1412.7149, 2014. + + A APPENDIX :DETAILED TIMING / POWER REPORTS OF DENSE & SPARSE + NETWORK LAYERS + + Table 8: Average time on different layers. To avoid variance, we measured the time spent on each + layer for 4096 input samples, and averaged the time regarding each input sample. For GPU, the time + consumed bycudaMallocandcudaMemcpyis not counted. For batch size = 1,gemvis used; + For batch size = 64,gemmis used. For sparse case,csrmvandcsrmmis used, respectively. + + <
> + + Table 9: Power consumption of different layers. We measured the Titan X GPU power with + nvidia-smi, Core i7-5930k CPU power withpcm-powerand Tegra K1 mobile GPU power with + an external power meter (scaled to AP+DRAM, see paper discussion). During power measurement, + we repeated each computation multiple times in order to get stable numbers. On CPU, dense matrix + multiplications consume2xenergy than sparse ones because it is accelerated with multi-threading. + + <
> +<> <> <> \ No newline at end of file diff --git a/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt b/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt deleted file mode 100644 index 9906917..0000000 --- a/Corpus/Channel Pruning for Accelerating Very Deep Neural Networks - He.txt +++ /dev/null @@ -1,391 +0,0 @@ - Channel Pruning for Accelerating Very Deep Neural Networks - - - Yihui He * Xiangyu Zhang Jian Sun - Xi’an Jiaotong University Megvii Inc. Megvii Inc. - Xi’an, 710049, China Beijing, 100190, China Beijing, 100190, China - heyihui@stu.xjtu.edu.cn zhangxiangyu@megvii.com sunjian@megvii.com - - - - Abstract W1 - - In this paper, we introduce a new channel pruning number of channels - nonlinear method to accelerate very deep convolutional neural net- - works. Given a trained CNN model, we propose an it- - erative two-step algorithm to effectively prune each layer, W2 - by a LASSO regression based channel selection and least nonlinear - square reconstruction. We further generalize this algorithm - to multi-layer and multi-branch cases. Our method re- W3 - duces the accumulated error and enhance the compatibility - with various architectures. Our pruned VGG-16 achieves (a) (b) (c) (d) - the state-of-the-art results by5×speed-up along with only Figure 1. Structured simplification methods that accelerate CNNs: 0.3% increase of error. More importantly, our method is (a) a network with 3 conv layers. (b) sparse connection deacti- - able to accelerate modern networks like ResNet, Xception vates some connections between channels. (c) tensor factorization - and suffers only 1.4%, 1.0% accuracy loss under2×speed- factorizes a convolutional layer into several pieces. (d) channel - up respectively, which is significant. pruning reduces number of channels in each layer (focus of this - paper). - - 1. Introduction a network into thinner one, as shown in Fig.1(d). It is effi- - Recent CNN acceleration works fall into three cate- cient on both CPU and GPU because no special implemen- - gories: optimized implementation (e.g., FFT [47]), quan- tation is required. - tization (e.g., BinaryNet [8]), and structured simplification Pruning channels is simple but challenging because re- - that convert a CNN into compact one [22]. This work fo- moving channels in one layer might dramatically change - cuses on the last one. the input of the following layer. Recently,training-based - Structured simplification mainly involves: tensor fac- channel pruning works [1,48] have focused on imposing - torization [22], sparse connection [17], and channel prun- sparse constrain on weights during training, which could - ing [48]. Tensor factorization factorizes a convolutional adaptively determine hyper-parameters. However, training - layer into several efficient ones (Fig.1(c)). However, fea- from scratch is very costly and results for very deep CNNs - ture map width (number of channels) could not be reduced, on ImageNet have been rarely reported.Inference-timeat- - which makes it difficult to decompose1×1convolutional tempts [31,3] have focused on analysis of the importance - layer favored by modern networks (e.g., GoogleNet [45], of individual weight. The reported speed-up ratio is very - ResNet [18], Xception [7]). This type of method also intro- limited. - duces extra computation overhead. Sparse connection deac- In this paper, we propose a new inference-time approach - tivates connections between neurons or channels (Fig.1(b)). for channel pruning, utilizing redundancy inter channels. - Though it is able to achieves high theoretical speed-up ratio, Inspired by tensor factorization improvement by feature - the sparse convolutional layers have an ”irregular” shape maps reconstruction [52], instead of analyzing filter weights - which is not implementation friendly. In contrast, channel [22,31], we fully exploits redundancy within feature maps. - pruning directly reduces feature map width, which shrinks Specifically, given a trained CNN model, pruning each layer - is achieved by minimizing reconstruction error on its output - * This work was done when Yihui He was an intern at Megvii Inc. feature maps, as showned in Fig.2. We solve this mini- - - - - 1389 A B C maps. There are several training-based approaches. [1,48] - W regularize networks to improve accuracy. Channel-wise - SSL [48] reaches high compression ratio for first few conv - layers of LeNet [30] and AlexNet [26]. However,training- kh kc w basedapproaches are more costly, and the effectiveness for - c n very deep networks on large datasets is rarely exploited. nonlinear nonlinear - Figure 2. Channel pruning for accelerating a convolutional layer. Inference-time channel pruning is challenging, as re- - We aim to reduce the width of feature map B, while minimizing ported by previous works [2,39]. Some works [44,34,19] - the reconstruction error on feature map C. Our optimization algo- focus on model size compression, which mainly operate the - rithm (Sec. 3.1) performs within the dotted box, which does not fully connected layers. Data-free approaches [31,3] results - involve nonlinearity. This figure illustrates the situation that two for speed-up ratio (e.g.,5×) have not been reported, and - channels are pruned for feature map B. Thus corresponding chan- requires long retraining procedure. [3] select channels via - nels of filtersWcan be removed. Furthermore, even though not over 100 random trials, however it need long time to evalu- directly optimized by our algorithm, the corresponding filters in ate each trial on a deep network, which makes it infeasible the previous layer can also be removed (marked by dotted filters). to work on very deep models and large datasets. [31] is even c,n: number of channels for feature maps B and C,kh ×kw : worse than naive solution from our observation sometimes kernel size. (Sec.4.1.1). - - mization problem by two alternative steps: channels selec- 3. Approach - tion and feature map reconstruction. In one step, we figure In this section, we first propose a channel pruning al-out the most representative channels, and prune redundant gorithm for a single layer, then generalize this approach toones, based on LASSO regression. In the other step, we multiple layers or the whole model. Furthermore, we dis-reconstruct the outputs with remaining channels with linear cuss variants of our approach for multi-branch networks.least squares. We alternatively take two steps. Further, we - approximate the network layer-by-layer, with accumulated 3.1. Formulation - error accounted. We also discuss methodologies to prune - multi-branch networks (e.g., ResNet [18], Xception [7]). Fig.2illustrates our channel pruning algorithm for a sin- - For VGG-16, we achieve4×acceleration, with only gle convolutional layer. We aim to reduce the width of - 1.0%increase of top-5 error. Combined with tensor factor- feature map B, while maintaining outputs in feature map - ization, we reach5×acceleration but merely suffer0.3% C. Once channels are pruned, we can remove correspond- - increase of error, which outperforms previous state-of-the- ing channels of the filters that take these channels as in- - arts. We further speed up ResNet-50 and Xception-50 by put. Also, filters that produce these channels can also be - 2×with only1.4%, 1.0%accuracy loss respectively. removed. It is clear that channel pruning involves two key - points. The first is channel selection, since we need to select - 2. Related Work most representative channels to maintain as much informa- - tion. The second is reconstruction. We need to reconstruct - There has been a significant amount of work on acceler- the following feature maps using the selected channels. - ating CNNs. Many of them fall into three categories: opti- Motivated by this, we propose an iterative two-step al- - mized implementation [4], quantization [40], and structured gorithm. In one step, we aim to select most representative - simplification [22]. channels. Since an exhaustive search is infeasible even for - Optimized implementation based methods [35,47,27,4] tiny networks, we come up with a LASSO regression based - accelerate convolution, with special convolution algorithms method to figure out representative channels and prune re- - like FFT [47]. Quantization [8,40] reduces floating point dundant ones. In the other step, we reconstruct the outputs - computational complexity. with remaining channels with linear least squares. We alter- - Sparse connection eliminates connections between neu- natively take two steps. - rons [17,32,29,15,14]. [51] prunes connections based on Formally, to prune a feature map withcchannels, we - weights magnitude. [16] could accelerate fully connected consider applyingn×c×kh ×kw convolutional filtersWon - layers up to50×. However, in practice, the actual speed-up N×c×kh ×kw input volumesXsampled from this feature - maybe very related to implementation. map, which producesN×noutput matrixY. Here,Nis - Tensor factorization [22,28,13,24] decompose weights the number of samples,nis the number of output channels, - into several pieces. [50,10,12] accelerate fully connected andkh ,k w are the kernel size. For simple representation, - layers with truncated SVD. [52] factorize a layer into3×3 bias term is not included in our formulation. To prune the - and1×1combination, driven by feature map redundancy. input channels fromcto desiredc′ (0≤c′ ≤c), while - Channel pruning removes redundant channels on feature minimizing reconstruction error, we formulate our problem - - - - 1390 as follow: penalty, andβ =c. We gradually increaseλ. For each 0 change ofλ, we iterate these two steps untilβ is stable. - 1 2 0 c Afterβ ≤c′ satisfies, we obtain the final solutionWarg min Y− β 0i Xi W⊤ i from{ββ,W 2N (1) i Wi }. In practice, we found that the two steps it- i=1 F eration is time consuming. So we apply (i) multiple times,subject toβ ≤c′ - 0 untilβ ≤c′ satisfies. Then apply (ii) just once, to obtain 0 - · is Frobenius norm.X the final result. From our observation, this result is compa- - F i isN×kh kw matrix sliced - fromith channel of input volumesX,i= 1,...,c.W rable with two steps iteration’s. Therefore, in the following i is - n×k experiments, we adopt this approach for efficiency. h kw filter weights sliced fromith channel ofW.βis - coefficient vector of lengthcfor channel selection, andβ Discussion: Some recent works [48,1,17] (though train- i - isith entry ofβ. Notice that, ifβ ing based) also introduceℓ1 -norm or LASSO. However, we i = 0,Xi will be no longer - useful, which could be safely pruned from feature map.W must emphasis that we use different formulations. Many of i - could also be removed. them introduced sparsity regularization into training loss, - Optimization instead of explicitly solving LASSO. Other work [1] solved - Solving thisℓ LASSO, while feature maps or data were not considered 0 minimization problem in Eqn.1is NP-hard. - Therefore, we relax theℓ during optimization. Because of these differences, our ap- 0 toℓ1 regularization: proach could be applied at inference time. - 1 c 2 - arg min Y− β 3.2. Whole Model Pruning i Xi W⊤ - i +λβ1β,W 2N (2) i=1 F Inspired by [52], we apply our approach layer by layersubject toβ ≤c′ ,∀iW = 1 0 iF sequentially. For each layer, we obtain input volumes from - the current input feature map, and output volumes from theλis a penalty coefficient. By increasingλ, there will be output feature map of the un-pruned model. This could bemore zero terms inβand one can get higher speed-up ratio. formalized as:We also add a constrain∀iWi = 1to this formulation, F which avoids trivial solution. - Now we solve this problem in two folds. First, we fixW, 1 c 2 - arg min Y′ − βsolveβfor channel selection. Second, we fixβ, solveWto i Xi W⊤ i - β,W 2N (5) - reconstruct error. i=1 F - (i) The subproblem ofβ. In this case,Wis fixed. We subject toβ ≤c′ - 0 - solveβfor channel selection. This problem can be solved Different from Eqn.1,Yis replaced byY′ , which is fromby LASSO regression [46,5], which is widely used for feature map of the original model. Therefore, the accumu-model selection. lated error could be accounted during sequential pruning. 2 c βˆLASSO 1(λ) = argmin Y− β +λβ 3.3. Pruning Multi­Branch Networks - β 2N i Zi 1 - i=1 F The whole model pruning discussed above is enough for - subject toβ ≤c′ - 0 single-branch networks like LeNet [30], AlexNet [26] and(3) VGG Nets [43]. However, it is insufficient for multi-branch HereZi = X i W⊤ i (sizeN×n). We will ignoreith channels networks like GoogLeNet [ 45] and ResNet [18]. We mainlyifβi = 0. focus on pruning the widely used residual structure (e.g.,(ii) The subproblem ofW. In this case,βis fixed. We ResNet [18], Xception [7]). Given a residual block shownutilize the selected channels to minimize reconstruction er- in Fig.3(left), the input bifurcates into shortcut and residualror. We can find optimized solution by least squares: branch. On the residual branch, there are several convolu- - tional layers (e.g., 3 convolutional layers which have spatialarg minY−X′ (W ′ )⊤ 2 (4) F size of1×1,3×3,1×1, Fig.3, left). Other layers ex- W′ cept the first and last layer can be pruned as is described - HereX′ = [β1 X1 β2 X2 ... β i Xi ... β c Xc ](size previously. For the first layer, the challenge is that the large - N×ck h kw ). W′ isn×ck h kw reshapedW,W′ = input feature map width (for ResNet, 4 times of its output) - [W 1 W2 ...Wi ...Wc ]. After obtained resultW′ , it is re- can’t be easily pruned, since it’s shared with shortcut. For - shaped back toW. Then we assignβi ←βi Wi ,W the last layer, accumulated error from the shortcut is hard to F i ← - Wi /Wi . Constrain∀iW be recovered, since there’s no parameter on the shortcut. To F i = 1satisfies. F We alternatively optimize (i) and (ii). In the beginning, address these challenges, we propose several variants of our - Wis initialized from the trained model,λ= 0, namely no approach as follows. - - - - 1391 c ers, which need special library implementation support. We - Input (c) sampled (c') 0 do not adopt it in the following experiments. c 0 0 - 0 - channel sampler - sampler 1x1,c c'0 4. Experiment 1 - c 1x1 1 relu c' 3x3,c 1 relu We evaluation our approach for the popular VGG Nets 2 - c 3x3 2 relu [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR- c'1x1 2 relu 10 [25] and PASCAL VOC 2007 [11]. 1x1 For Batch Normalization [21], we first merge it into con- Y2 Y volutional weights, which do not affect the outputs of the Y+Y 1 - 1 2 networks. So that each convolutional layer is followed by - Figure 3. Illustration of multi-branch enhancement for residual ReLU [36]. We use Caffe [23] for deep network evalua- - block.Left: original residual block.Right: pruned residual block tion, and scikit-learn [38] for solvers implementation. For - with enhancement,cx denotes the feature map width. Input chan- channel pruning, we found that it is enough to extract 5000 nels of the first convolutional layer are sampled, so that the large images, and 10 samples per image. On ImageNet, we eval- input feature map width could be reduced. As for the last layer, uate the top-5 accuracy with single view. Images are re- rather than approximateY2 , we try to approximateY1 + Y 2 di- sized such that the shorter side is 256. The testing is on rectly (Sec.3.3Last layer of residual branch). center crop of224×224pixels. We could gain more per- - formance with fine-tuning. We use a batch size of 128 and - learning rate1e−5 . We fine-tune our pruned models for 10Last layer of residual branch: Shown in Fig.3, the epoches. The augmentation for fine-tuning is random cropoutput layer of a residual block consists of two inputs: fea- of224×224and mirror.ture mapY1 andY2 from the shortcut and residual branch. - We aim to recoverY1 + Y 2 for this block. Here,Y1 ,Y2 4.1. Experiments with VGG­16 are the original feature maps before pruning.Y2 could be - approximated as in Eqn.1. However, shortcut branch is VGG-16 [43] is a 16 layers single path convolutional - parameter-free, thenY neural network, with 13 convolutional layers. It is widely 1 could not be recovered directly. To - compensate this error, the optimization goal of the last layer used in recognition, detection and segmentation,etc. Single - is changed fromY view top-5 accuracy for VGG-16 is 89.9% 1 .2 toY1 −Y′ +Y, which does not change 1 2 - our optimization. Here,Y′ is the current feature map after 1 previous layers pruned. When pruning, volumes should be 4.1.1 Single Layer Pruning - sampled correspondingly from these two branches. In this subsection, we evaluate single layer acceleration per-First layer of residual branch: Illustrated in formance using our algorithm in Sec.3.1. For better under-Fig.3(left), the input feature map of the residual block standing, we compare our algorithm with two naive chan-could not be pruned, since it is also shared with the short- nel selection strategies.first kselects the firstkchannels.cut branch. In this condition, we could performfeature max responseselects channels based on corresponding fil-map samplingbefore the first convolution to save compu- ters that have high absolute weights sum [31]. For fair com-tation. We still apply our algorithm as Eqn.1. Differently, parison, we obtain the feature map indexes selected by eachwe sample the selected channels on the shared feature maps of them, then perform reconstruction (Sec. 3.1(ii)). We to construct a new input for the later convolution, shown hope that this could demonstrate the importance of channelin Fig.3(right). Computational cost for this operation could selection. Performance is measured by increase of error af-be ignored. More importantly, after introducingfeature map ter a certain layer is pruned without fine-tuning, shown insampling, the convolution is still ”regular”. Fig.4.Filter-wise pruningis another option for the first con- As expected, error increases as speed-up ratio increases.volution on the residual branch. Since the input channels Our approach is consistently better than other approaches inof parameter-free shortcut branch could not be pruned, we different convolutional layers under different speed-up ra-apply our Eqn.1to each filter independently (each fil- tio. Unexpectedly, sometimesmax responseis even worseter chooses its own representative input channels). Under thanfirst k. We argue thatmax responseignores correla-single layer acceleration,filter-wise pruningis more accu- tions between different filters. Filters with large absoluterate than our original one. From our experiments, it im- weight may have strong correlation. Thus selection based proves 0.5% top-5 accuracy for2×ResNet-50 (applied on on filter weights is less meaningful. Correlation on featurethe first layer of each residual branch) without fine-tuning. maps is worth exploiting. We can find that channel selectionHowever, after fine-tuning, there’s no noticeable improve- - ment. In addition, it outputs ”irregular” convolutional lay- 1 http://www.vlfeat.org/matconvnet/pretrained/ - - - - 1392 conv1_1 conv2_1 conv3_1 5 - first k first k first k - max response max response max response 4 ours ours ours - - - - - - increase of error (%) 3 - - 2 - - 1 - - 0 - - conv3_2 conv4_1 conv4_2 5 - first k first k first k - max response max response max response 4 ours ours ours - - - - - - increase of error (%) 3 - - 2 - - 1 - - 01.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 - speed-up ratio speed-up ratio speed-up ratio - Figure 4. Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by increase of error. To verify - the importance of channel selection refered in Sec.3.1, we considered two naive baselines.first kselects the firstkfeature maps.max - responseselects channels based on absolute sum of corresponding weights filter [31]. Our approach is consistently better (smaller is - better). - - - Increase of top-5 error (1-view, baseline 89.9%) periments above, we pruning more aggressive for shal- - Solution 2× 4× 5× lower layers. Remaining channels ratios for shallow lay- - Jaderberget al. [22] ([52]’s impl.) - 9.7 29.7 ers (conv1_xtoconv3_x) and deep layers (conv4_x) - Asym. [52] 0.28 3.84 - is1 : 1.5.conv5_xare not pruned, since they only con- - Filter pruning [31] tribute 9% computation in total and are not redundant.0.8 8.6 14.6(fine-tuned, our impl.) After fine-tuning, we could reach2×speed-up without - Ours (without fine-tune) 2.7 7.9 22.0 losing accuracy. Under4×, we only suffers 1.0% drops. - Ours (fine-tuned) 0 1.0 1.7 Consistent with single layer analysis, our approach outper- - Table 1. Accelerating the VGG-16 model [43] using a speedup forms previous channel pruning approach (Liet al. [31]) by - ratio of2×,4×, or5×(smaller is better). large margin. This is because we fully exploits channel re- - dundancy within feature maps. Compared with tensor fac- - affects reconstruction error a lot. Therefore, it is important torization algorithms, our approach is better than Jaderberg - for channel pruning. et al. [22], without fine-tuning. Though worse than Asym. - Also notice that channel pruning gradually becomes [52], our combined model outperforms its combined Asym. - hard, from shallower to deeper layers. It indicates that shal- 3D (Table2). This may indicate that channel pruning is - lower layers have much more redundancy, which is consis- more challenging than tensor factorization, since removing - tent with [52]. We could prune more aggressively on shal- channels in one layer might dramatically change the input - lower layers in whole model acceleration. of the following layer. However, channel pruning keeps the - original model architecture, do not introduce additional lay- - ers, and the absolute speed-up ratio on GPU is much higher4.1.2 Whole Model Pruning (Table 3). - Shown in Table1, whole model acceleration results under Since our approach exploits a new cardinality, we further - 2×,4×,5×are demonstrated. We adopt whole model combine our channel pruning with spatial factorization [22] - pruning proposed in Sec.3.2. Guided by single layer ex- and channel factorization [52]. Demonstrated in Table2, - - - - 1393 Increase of top-5 error (1-view, 89.9%) scratch. This coincides with architecture design researches - Solution 4× 5× [20,1] that the model could be easier to train if there are - Asym. 3D [52] 0.9 2.0 more channels in shallower layers. However, channel prun- - Asym. 3D (fine-tuned) [52] 0.3 1.0 ing favors shallower layers. - Our 3C 0.7 1.3 For from scratch (uniformed), the filters in each layers - Our 3C (fine-tuned) 0.0 0.3 is reduced by half (eg. reduceconv1_1from 64 to 32). - Table 2. Performance of combined methods on the VGG-16 model We can observe that normal setting networks of the same - [43] using a speed-up ratio of4×or5×. Our 3C solution outper- complexity couldn’t reach same accuracy either. This con- - forms previous approaches (smaller is better). solidates our idea that there’s much redundancy in networks - while training. However, redundancy can be opt out at - inference-time. This maybe an advantage of inference-timeour 3 cardinalities acceleration (spatial, channel factoriza- acceleration approaches over training-based approaches.tion, and channel pruning, denoted by 3C) outperforms pre- Notice that there’s a 0.6% gap between the from scratchvious state-of-the-arts. Asym. 3D [52] (spatial and chan- model and uniformed one, which indicates that there’s roomnel factorization), factorizes a convolutional layer to three for model exploration. Adopting our approach is muchparts:1×3,3×1,1×1. faster than training a model from scratch, even for a thin-We apply spatial factorization, channel factorization, and ner one. Further researches could alleviate our approach to our channel pruning together sequentially layer-by-layer. do thin model exploring.We fine-tune the accelerated models for 20 epoches, since - they are 3 times deeper than the original ones. After fine- - tuning, our4×model suffers no degradation. Clearly, a 4.1.5 Acceleration for Detection - combination of different acceleration techniques is better VGG-16 is popular among object detection tasks [42,41,than any single one. This indicates that a model is redun- 33]. We evaluate transfer learning ability of our2×/4×dant in each cardinality. pruned VGG-16, for Faster R-CNN [42] object detections. - PASCAL VOC 2007 object detection benchmark [11] con- - 4.1.3 Comparisons of Absolute Performance tains 5k trainval images and 5k test images. The per- - formance is evaluated by mean Average Precision (mAP).We further evaluate absolute performance of acceleration In our experiments, we first perform channel pruning foron GPU. Results in Table3are obtained under Caffe [23], VGG-16 on the ImageNet. Then we use the pruned modelCUDA8 [37] and cuDNN5 [6], with a mini-batch of 32 as the pre-trained model for Faster R-CNN.on a GPU (GeForce GTX TITAN X). Results are averaged The actual running time of Faster R-CNN is 220ms / im-from 50 runs. Tensor factorization approaches decompose age. The convolutional layers contributes about 64%. Weweights into too many pieces, which heavily increase over- got actual time of 94ms for4×acceleration. From Table5,head. They could not gain much absolute speed-up. Though we observe 0.4% mAP drops of our2×model, which is notour approach also encountered performance decadence, it harmful for practice consideration.generalizes better on GPU than other approaches. Our re- - sults for tensor factorization differ from previous research 4.2. Experiments with Residual Architecture Nets - [52,22], maybe because current library and hardware pre- For Multi-path networks [45,18,7], we further explorefer single large convolution instead of several small ones. the popular ResNet [18] and latest Xception [7], on Ima- - geNet and CIFAR-10. Pruning residual architecture nets is - 4.1.4 Comparisons with Training from Scratch more challenging. These networks are designed for both ef- - ficiency and high accuracy. Tensor factorization algorithmsThough training a compact model from scratch is time- [52,22] have difficult to accelerate these model. Spatially,consuming (usually 120 epoches), it worths comparing our 1×1convolution is favored, which could hardly be factor-approach and from scratch counterparts. To be fair, we eval- ized.uated both from scratch counterpart, and normal setting net- - work that has the same computational complexity and same 4.2.1 ResNet Pruningarchitecture. - Shown in Table4, we observed that it’s difficult for ResNet complexity uniformly drops on each residual block. - from scratch counterparts to reach competitive accuracy. Guided by single layer experiments (Sec. 4.1.1), we still - our model outperforms from scratch one. Our approach prefer reducing shallower layers heavier than deeper ones. - successfully picks out informative channels and constructs Following similar setting as Filter pruning [31], we - highly compact models. We can safely draw the conclu- keep 70% channels for sensitive residual blocks (res5 - sion that the same model is difficult to be obtained from and blocks close to the position where spatial size - - - - 1394 Model Solution Increased err. GPU time/ms - VGG-16 - 0 8.144 - Jaderberget al. [22] ([52]’s impl.) 9.7 8.051(1.01×) - Asym. [52] 3.8 5.244(1.55×) - VGG-16 (4×) Asym. 3D [52] 0.9 8.503(0.96×) - Asym. 3D (fine-tuned) [52] 0.3 8.503(0.96×) - Ours (fine-tuned) 1.0 3.264 (2.50×) - Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is - better). - - - Original (acc. 89.9%) Top-5 err. Increased err. Solution Increased err. - From scratch 11.9 1.8 Filter pruning [31] (our impl.) 92.8 - From scratch (uniformed) 12.5 2.4 Filter pruning [31] 4.3Ours 18.0 7.9 (fine-tuned, our impl.) - Ours (fine-tuned) 11.1 1.0 Ours 2.9 - Table 4. Comparisons with training from scratch, under4×accel- Ours (fine-tuned) 1.0 - eration. Our fine-tuned model outperforms scratch trained coun- Table 7. Comparisons for Xception-50, under2×acceleration ra- - terparts (smaller is better). tio. The baseline network’s top-5 accuracy is 92.8%. Our ap- - proach outperforms previous approaches. Most structured sim- - plification methods are not effective on Xception architecture - Speedup mAP ∆mAP (smaller is better). - Baseline 68.7 - - 2× 68.3 0.4 - 4× 66.9 1.8 4.2.2 Xception Pruning - Table 5.2×,4×acceleration for Faster R-CNN detection. - Since computational complexity becomes important in - model design, separable convolution has been payed muchSolution Increased err. attention [49,7]. Xception [7] is already spatially optimizedOurs 8.0 and tensor factorization on1×1convolutional layer is de-Ours 4.0 structive. Thanks to our approach, it could still be acceler-(enhanced) ated with graceful degradation. For the ease of comparison,Ours 1.4 we adopt Xception convolution on ResNet-50, denoted by(enhanced, fine-tuned) Xception-50. Based on ResNet-50, we swap all convolu- Table 6.2×acceleration for ResNet-50 on ImageNet, the base- tional layers with spatial conv blocks. To keep the same line network’s top-5 accuracy is 92.2% (one view). We improve computational complexity, we increase the input channels performance with multi-branch enhancement (Sec.3.3,smaller is of allbranch2blayers by2×. The baseline Xception- better). 50 has a top-5 accuracy of 92.8% and complexity of 4450 - MFLOPs. - We apply multi-branch variants of our approach as de-change, e.g. res3a,res3d). As for other blocks, scribed in Sec.3.3, and adopt the same pruning ratio settingwe keep 30% channels. With multi-branch enhance- as ResNet in previous section. Maybe because of Xcep-ment, we prunebranch2amore aggressively within tion block is unstable, Batch Normalization layers must beeach residual block. The remaining channels ratios for maintained during pruning. Otherwise it becomes nontrivialbranch2a,branch2b,branch2cis2 : 4 : 3(e.g., to fine-tune the pruned model.Given 30%, we keep 40%, 80%, 60% respectively). Shown in Table7, after fine-tuning, we only suffer1.0% - We evaluate performance of multi-branch variants of our increase of error under2×. Filter pruning [31] could also - approach (Sec. 3.3). From Table6, we improve 4.0% apply on Xception, though it is designed for small speed- - with our multi-branch enhancement. This is because we up ratio. Without fine-tuning, top-5 error is 100%. After - accounted the accumulated error from shortcut connection training 20 epochs which is like training from scratch, in- - which could broadcast to every layer after it. And the large creased error reach 4.3%. Our results for Xception-50 are - input feature map width at the entry of each residual block not as graceful as results for VGG-16, since modern net- - is well reduced by ourfeature map sampling. works tend to have less redundancy by design. - - - - 1395 Solution Increased err. [4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn: - Filter pruning [31] Lookup-based convolutional neural network.arXiv preprint 1.3(fine-tuned, our impl.) arXiv:1611.06473, 2016.2 - From scratch 1.9 [5] L. Breiman. Better subset regression using the nonnegative - Ours 2.0 garrote.Technometrics, 37(4):373–384, 1995.3 - Ours (fine-tuned) 1.0 [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, - Table 8.2×speed-up comparisons for ResNet-56 on CIFAR-10, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives - the baseline accuracy is 92.8% (one view). We outperforms previ- for deep learning.CoRR, abs/1410.0759, 2014.6 - ous approaches and scratch trained counterpart (smaller is better). [7] F. Chollet. Xception: Deep learning with depthwise separa- - ble convolutions.arXiv preprint arXiv:1610.02357, 2016. 1, - 2,3,4,6,7 - 4.2.3 Experiments on CIFAR-10 [8] M. Courbariaux and Y. Bengio. Binarynet: Training deep - neural networks with weights and activations constrained to+ - Even though our approach is designed for large datasets, it 1 or-1.arXiv preprint arXiv:1602.02830, 2016.1,2 - could generalize well on small datasets. We perform ex- [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- - periments on CIFAR-10 dataset [25], which is favored by Fei. Imagenet: A large-scale hierarchical image database. - many acceleration researches. It consists of 50k images for InComputer Vision and Pattern Recognition, 2009. CVPR - training and 10k for testing in 10 classes. 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 4 - We reproduce ResNet-56, which has accuracy of 92.8% [10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- - (Serve as a reference, the official ResNet-56 [18] has ac- gus. Exploiting linear structure within convolutional net- - curacy of 93.0%). For2×acceleration, we follow similar works for efficient evaluation. InAdvances in Neural In- - formation Processing Systems, pages 1269–1277, 2014.2 setting as Sec.4.2.1(keep the final stage unchanged, where - the spatial size is8×8). Shown in Table8, our approach [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, - and A. Zisserman. The PASCAL Visual Object Classes is competitive with scratch trained one, without fine-tuning, Challenge 2007 (VOC2007) Results. http://www.pascal- under2×speed-up. After fine-tuning, our result is signif- network.org/challenges/VOC/voc2007/workshop/index.html. icantly better than Filter pruning [31] and scratch trained 4,6 - one. [12] R. Girshick. Fast r-cnn. InProceedings of the IEEE Inter- - national Conference on Computer Vision, pages 1440–1448, - 5. Conclusion 2015.2 - [13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress- - To conclude, current deep CNNs are accurate with high ing deep convolutional networks using vector quantization. - inference costs. In this paper, we have presented an arXiv preprint arXiv:1412.6115, 2014.2 - inference-time channel pruning method for very deep net- [14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for - works. The reduced CNNs are inference efficient networks efficient dnns. InAdvances In Neural Information Process- - while maintaining accuracy, and only require off-the-shelf ing Systems, pages 1379–1387, 2016.2 - libraries. Compelling speed-ups and accuracy are demon- [15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, - strated for both VGG Net and ResNet-like networks on Im- and W. J. Dally. Eie: efficient inference engine on com- - ageNet, CIFAR-10 and PASCAL VOC. pressed deep neural network. InProceedings of the 43rd - International Symposium on Computer Architecture, pages In the future, we plan to involve our approaches into 243–254. IEEE Press, 2016. 2 training time, instead of inference time only, which may [16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- also accelerate training procedure. pressing deep neural network with pruning, trained quantiza- - tion and huffman coding.CoRR, abs/1510.00149, 2, 2015. - References 2 - [17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights - [1] J. M. Alvarez and M. Salzmann. Learning the number of and connections for efficient neural network. InAdvances in - neurons in deep networks. InAdvances in Neural Informa- Neural Information Processing Systems, pages 1135–1143, - tion Processing Systems, pages 2262–2270, 2016. 1,2,3, 2015.1,2,3 - 6 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- - [2] S. Anwar, K. Hwang, and W. Sung. Structured prun- ing for image recognition.arXiv preprint arXiv:1512.03385, - ing of deep convolutional neural networks. arXiv preprint 2015. 1,2,3,4,6,8 - arXiv:1512.08571, 2015.2 [19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim- - [3] S. Anwar and W. Sung. Compact deep convolutional ming: A data-driven neuron pruning approach towards effi- - neural networks with coarse pruning. arXiv preprint cient deep architectures. arXiv preprint arXiv:1607.03250, - arXiv:1610.09639, 2016.1,2 2016.2 - - - - - 1396 [20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, - A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, - Speed/accuracy trade-offs for modern convolutional object V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, - detectors.arXiv preprint arXiv:1611.10012, 2016. 6 M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma- - [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating chine learning in Python.Journal of Machine Learning Re- - deep network training by reducing internal covariate shift. search, 12:2825–2830, 2011.4 - arXiv preprint arXiv:1502.03167, 2015.4 [39] A. Polyak and L. Wolf. Channel-level acceleration of deep - [22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up face representations.IEEE Access, 3:2163–2175, 2015.2 - convolutional neural networks with low rank expansions. [40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor- - arXiv preprint arXiv:1405.3866, 2014.1,2,5,6,7 net: Imagenet classification using binary convolutional neu- - [23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- ral networks. InEuropean Conference on Computer Vision, - shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- pages 525–542. Springer, 2016. 2 - tional architecture for fast feature embedding.arXiv preprint [41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. arXiv:1408.5093, 2014. 4,6 You only look once: Unified, real-time object detection. - [24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. CoRR, abs/1506.02640, 2015. 6 - Compression of deep convolutional neural networks for [42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:fast and low power mobile applications. arXiv preprint towards real-time object detection with region proposal net-arXiv:1511.06530, 2015.2 works.CoRR, abs/1506.01497, 2015.6 [25] A. Krizhevsky and G. Hinton. Learning multiple layers of [43] K. Simonyan and A. Zisserman. Very deep convolutionalfeatures from tiny images. 2009.4,8 networks for large-scale image recognition. arXiv preprint[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet arXiv:1409.1556, 2014.3,4,5,6classification with deep convolutional neural networks. In [44] S. Srinivas and R. V. Babu. Data-free parameter pruningAdvances in neural information processing systems, pages for deep neural networks.arXiv preprint arXiv:1507.06149,1097–1105, 2012.2,3 2015.2[27] A. Lavin. Fast algorithms for convolutional neural networks. [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,arXiv preprint arXiv:1509.09308, 2015.2 D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and Going deeper with convolutions. InProceedings of the IEEEV. Lempitsky. Speeding-up convolutional neural net- Conference on Computer Vision and Pattern Recognition,works using fine-tuned cp-decomposition. arXiv preprint pages 1–9, 2015.1,3,6arXiv:1412.6553, 2014.2 [46] R. Tibshirani. Regression shrinkage and selection via the[29] V. Lebedev and V. Lempitsky. Fast convnets using group- lasso. Journal of the Royal Statistical Society. Series Bwise brain damage.arXiv preprint arXiv:1506.02515, 2015. (Methodological), pages 267–288, 1996.32 [47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- antino, and Y. LeCun. Fast convolutional nets withbased learning applied to document recognition. Proceed- fbfft: A gpu performance evaluation. arXiv preprintings of the IEEE, 86(11):2278–2324, 1998.2,3 arXiv:1412.7580, 2014.1,2[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. - Graf. Pruning filters for efficient convnets. arXiv preprint [48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning - arXiv:1608.08710, 2016.1,2,4,5,6,7,8 structured sparsity in deep neural networks. InAdvances In - [32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Neural Information Processing Systems, pages 2074–2082, - Sparse convolutional neural networks. InProceedings of the 2016.1,2,3 - IEEE Conference on Computer Vision and Pattern Recogni- [49] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated´ - tion, pages 806–814, 2015.2 residual transformations for deep neural networks. arXiv - [33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, preprint arXiv:1611.05431, 2016.7 - C. Fu, and A. C. Berg. SSD: single shot multibox detector. [50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural - CoRR, abs/1512.02325, 2015.6 network acoustic models with singular value decomposition. - [34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint InINTERSPEECH, pages 2365–2369, 2013.2 - arXiv:1511.05077, 2015.2 [51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy- - [35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training efficient convolutional neural networks using energy-aware - of convolutional networks through ffts. arXiv preprint pruning.arXiv preprint arXiv:1611.05128, 2016.2 - arXiv:1312.5851, 2013.2 [52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very - [36] V. Nair and G. E. Hinton. Rectified linear units improve deep convolutional networks for classification and detection. - restricted boltzmann machines. InProceedings of the 27th IEEE transactions on pattern analysis and machine intelli- - international conference on machine learning (ICML-10), gence, 38(10):1943–1955, 2016.1,2,3,5,6,7 - pages 807–814, 2010.4 - [37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable - parallel programming with CUDA.ACM Queue, 6(2):40–53, - 2008.6 - - - - - 1397 \ No newline at end of file diff --git a/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt b/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt deleted file mode 100644 index a4ec71b..0000000 Binary files a/Corpus/DEEP COMPRESSION_ COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING.txt and /dev/null differ diff --git a/Corpus/convex-neural-networks.txt b/Corpus/convex-neural-networks.txt deleted file mode 100644 index 9097e3b..0000000 Binary files a/Corpus/convex-neural-networks.txt and /dev/null differ