Revised documents for corpus

This commit is contained in:
Eduardo Cueto Mendoza 2020-08-06 20:01:26 -06:00
parent 514f272a6d
commit 8b5f469305
8 changed files with 4603 additions and 2350 deletions

View File

@ -1,555 +0,0 @@
IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1
A Survey of Model Compression and Acceleration
for Deep Neural Networks
Yu Cheng, Duo Wang, Pan Zhou,Member, IEEE,and Tao Zhang,Senior Member, IEEE
Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model
recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment
in devices with low memory resources or in applications with to billions [4].
arXiv:1710.09282v7 [cs.LG] 7 Feb 2019 strict latency requirements. Therefore, a natural thought is to As larger neural networks with more layers and nodes
perform model compression and acceleration in deep networks are considered, reducing their storage and computational cost
without significantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech-
niques for compacting and accelerating CNNs model developed. tion, recent years witnessed significant progress in virtual
These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre-
parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle
ferred/compact convolutional filters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced.
For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efficient deep learning methods can have
performance, related applications, advantages, and drawbacks significant impacts on distributed systems, embedded devices,
etc. Then we will go through a few very recent additional and FPGA for Artificial Intelligence. For example, the ResNet-
successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion floating number multiplications matrix, the main datasets used for evaluating the model per-
formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant
this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than
on this topic. 75% of parameters and 50% computational time. For devices
Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte
Model Compression and Acceleration, resources, how to compact the models used on them is also
important.
Achieving these goal calls for joint solutions from manyI. I NTRODUCTION disciplines, including but not limited to machine learning, op-
In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing,
lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works
achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which
These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community
billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years.
very high computation capability plays a key role in their We classify these approaches into four categories: pa-
success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans-
achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional filters, and knowledge distil-
using a network containing 60 million parameters with five lation. The parameter pruning and sharing based methods
convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to
it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor-
ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to
example is the top face verification results on the Labeled estimate the informative parameters of the deep CNNs. The
Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional filters
containing hundreds of millions of parameters, using a mix design special structural convolutional filters to reduce the
of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge
distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
Way, Redmond, WA 98052, USA. compact neural network to reproduce the output of a larger
Duo Wang and Tao Zhang are with the Department of Automation, network.
Tsinghua University, Beijing 100084, China. In Table I, we briefly summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074,
China. rank factorization and knowledge distillation approaches can IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2
TABLE I
SUMMARIZATION OF DIFFERENT APPROACHES FOR MODEL COMPRESSION AND ACCELERATION .
Theme Name Description Applications More details
Parameter pruning and sharing Reducing redundant parameters which Convolutional layer and Robust to various settings, can achieve
are not sensitive to the performance fully connected layer good performance, can support both train
from scratch and pre-trained model
Low-rank factorization Using matrix/tensor decomposition to Convolutional layer and Standardized pipeline, easily to be
estimate the informative parameters fully connected layer implemented, can support both train
from scratch and pre-trained model
Transferred/compact convolutional Designing special structural convolutional Convolutional layer Algorithms are dependent on applications,
filters filters to save parameters only usually achieve good performance,
only support train from scratch
Knowledge distillation Training a compact neural network with Convolutional layer and Model performances are sensitive
distilled knowledge of a large model fully connected layer to applications and network structure
only support train from scratch
be used in DNN models with fully connected layers and
convolutional layers, achieving comparable performances. On
the other hand, methods using transferred/compact filters are
designed for models with convolutional layers only. Low-rank
factorization and transfered/compact filters based approaches
provide an end-to-end pipeline and can be easily implemented
in CPU/GPU environment, which is straightforward. while
parameter pruning & sharing use different methods such as
vector quantization, binary coding and sparse constraints to
perform the task. Generally it will take several steps to achieve
the goal. Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output
eter pruning/sharing low-rank factorization can be extracted is the compression model.
from pre-trained ones or trained from scratch. While the
transferred/compact filter and knowledge distillation models
can only support train from scratch. These methods are inde- memory usage and float point operations with little loss in
pendently designed and complement each other. For example, classification accuracy.
transferred layers and parameter pruning & sharing can be The method proposed in [10] quantized the link weights
used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the
used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce
speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con-
properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the
small-weight connections. Finally, the network was retrained
II. P to learn the final weights for the remaining sparse connections. ARAMETER PRUNING AND SHARING This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importancefitting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classified into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix. instance, BinaryConnect [12], BinaryNet [13] and XNORNet-
works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization activation during the model training. The systematic study in
Network quantization compresses the original network by [15] showed that networks trained with back propagation could
reducing the number of bits required to represent each weight. be resilient to specific weight distortions, including binary
Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights.
quantization to the parameter values. Vanhouckeet al.[8] Drawbacks: the accuracy of the binary nets is significantly
showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet.
in significant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina-
work in [9] used 16-bit fixed-point representation in stochastic rization schemes are based on simple matrix approximations
rounding based CNN training, which significantly reduced and ignore the effect of binarization on the accuracy loss. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3
To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of
Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear
directly minimizes the loss with respect to the binary weights. transformsf(x;M) =(Mx), where()is an element-wise
The work in [17] reduced the time on float point multiplication nonlinear operator,xis the input vector, andMis themn
in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense
converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing
significant changes. matrix-vector products inO(mn)time. Thus, an intuitive
way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing structural matrix. Anmnmatrix that can be described
Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured
network complexity and to address the over-fitting issue. An matrix. Typically, the structure should not only reduce the
early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference
[18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and
Surgeon [20] methods reduced the number of connections gradient computations.
based on the Hessian of the loss function, and their work sug- Following this direction, the work in [30], [31] proposed a
gested that such pruning gave higher accuracy than magnitude- simple and efficient approach based on circulant projections,
while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training (rprocedure of those methods followed the way training from 0 ;r 1 ;;r d1 ), a circulant matrixR2Rdd is defined
as:
scratch manner. 2 3 r A recent trend in this direction is to prune redundant, 0 rd1 ::: r 2 r1 6r6 1 r0 rd1 r2 77 non-informative weights in a pre-trained CNN model. For 6 .. . 7
example, Srinivas and Babu [21] explored the redundancy R= circ(r) :=66 . r . .. . 71 r0 . 7: (1)6 . 7 among neurons, and proposed a data-free pruning method to 4r . .. .. 5d2 rd1
remove redundant neurons. Hanet al.[22] proposed to reduce rd1 rd2 ::: r 1 r0
the total number of parameters and operations in the entire thus the memory cost becomesO(d)instead ofO(d2 ).network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourierused a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan- In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fullyon soft weight-sharing was proposed, which included both connected layers. The Adaptive Fastfood transform matrixquantization and pruning in one simple (re-)training procedure. R2Rnd was defined as:The above pruning schemes typically produce connections
pruning in CNNs. R=SHGHB (2)
There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices. 2
with sparsity constraints. Those sparsity constraints are typ- f0;1gdd is a random permutation matrix, andHdenotes
ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con-
norm regularizers. The work in [25] imposed group sparsity nected layer withdinputs andnoutputs using the Adaptive
constraint on the convolutional filters to achieve structured Fastfood transform reduces the storage and the computational
brain Damage, i.e., pruning entries of the convolution kernels costs fromO(nd)toO(n)and fromO(nd)toO(nlogd),
in a group-wise fashion. In [26], a group-sparse regularizer respectively.
on neurons was introduced during the training stage to learn The work in [29] showed the effectiveness of the new
compact CNNs with reduced filters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their
structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured
filters, channels or even layers. In the filter-level pruning, all matrix classes, including block and multi-level Toeplitz-like
the above works usedl2;1 -norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34].
usedl1 -norm to select and prune unimportant filters. Following this idea, [35] proposed a general structured effi-
Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs.
and sharing. First, pruning withl1 orl2 regularization requires Drawbacks: one problem of this kind of approaches is that
more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the
pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand,
which demands fine-tuning of the parameters and could be how to find a proper structural matrix is difficult. There is no
cumbersome for some applications. theoretical way to derive it out.
C. Designing Structural Matrix III. L OW -RANK FACTORIZATION AND SPARSITY
In architectures that contain fully-connected layers, it is Convolution operations contribute the bulk of most com-
critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4
TABLE II
COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES
ON ILSVRC-2012.
Model TOP-5 Accuracy Speed-up Compression Rate
AlexNet 80.03% 1. 1.
BN Low-rank 80.56% 1.09 4.94
CP Low-rank 79.66% 1.82 5.
VGG-16 90.60% 1. 1.
Fig. 2. A typical framework of the low-rank regularization method. The left BN Low-rank 90.47% 1.53 2.72
is the original convolutional layer and the right is the low-rank constraint CP Low-rank 90.31% 2.05 2.75
convolutional layer with rank-K. GoogleNet 92.21% 1. 1.
BN Low-rank 91.88% 1.08 2.79
CP Low-rank 91.79% 1.20 2.84
would improve the compression rate as well as the overall
speedup. For the convolution kernels, it can be viewed as a
4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic
the intuition that there is a significant amount of redundancy parameters in deep models using the low-rank method. [42]
in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the final weight
remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted
it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite
help. the fully connected layer for designing compact multi-task
It has been a long time for using low-rank filters to acceler- deep learning architectures.
ate convolution, for example, high dimensional DCT (discrete Drawbacks: low-rank approaches are straightforward for
cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements
to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti-
respectively. Learning separable 1D filters was introduced fied units and maxout. However, the implementation is not
by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which
idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current
approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and
kernels were proposed in [37]. They achieved 2speedup thus cannot perform global parameter compression, which
for a single convolutional layer with 1% drop in classification is important as different layers hold different information.
accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to
decomposition schemes, reporting a 4.5speedup with 1% achieve convergence when compared to the original model.
drop in accuracy in text recognition.
The low-rank approximation was done layer by layer. The IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS
parameters of one layer were fixed after it was done, and the CNNs are parameter efficient due to exploring the trans-layers above were fine-tuned based on a reconstruction error lation invariant property of the representations to the inputcriterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-fitting. Although a strong theorylowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant propertyused nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu-decomposition for training low-rank constrained CNNs from tional filters to compress CNN models is motivated by recentscratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input,()be a network or layer andT()be theboth the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is defined as:Low-rank) can be used to train CNNs from scratch. However,
there are few differences between them. For example, finding T (x) = (Tx) (3)the best low-rank approximation in CP decomposition is an ill-
posed problem, and the best rank-K(Kis the rank number) indicating that transforming the inputxby the transformT()
approximation may not exist sometimes. While for the BN and then passing it through the network or layer()should
scheme, the decomposition always exists. We perform a simple give the same result as first mappingxthrough the network
comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq.
speedup and the compression rates are used to measure their (10), the transformsT()andT0 ()are not necessarily the
performances. same as they operate on different objects. According to this
As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or filters
be viewed as a 2D matrix and thus the above mentioned ()to compress the whole network models. From empirical
methods can also be applied there. There are several classical observation, deep CNNs also benefit from using a large set of
works on exploiting low-rankness in fully connected layers. convolutional filters by applying certain transformT()to a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5
small set of base filters since it acts as a regularizer for the TABLE III
model. ASIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND
Following this direction, there are many recent reworks CIFAR-100.
proposed to build a convolutional layer from a set of base Model CIFAR-100 CIFAR-10 Compression Rate
filters [43][46]. What they have in common is that the VGG-16 34.26% 9.85% 1.
transformT()lies in the family of functions that only operate MBA [46] 33.66% 9.76% 2.
CRELU [45] 34.57% 9.92% 2. in the spatial domain of the convolutional filters. For example, CIRC [43] 35.15% 10.23% 4.
the work in [45] found that the lower convolution layers of DCNN [44] 33.57% 9.65% 1.62
CNNs learned redundant filters to extract both positive and
negative phase information of an input signal, and definedT() Drawbacks: there are few issues to be addressed for ap-to be the simple negation function: proaches that apply transform constraints to convolutional fil-
T(Wx ) =W (4) ters. First, these methods can achieve competitive performance x for wide/flat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional filter andW is the filter x ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2compression Using a compact filter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric filters with compact blocks to improve the classification accuracy. The intuition is that the learning the speed, which significantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing33convolution into two11to useful convolutional filters instead of redundant ones. convolutions was used in [48], which achieved significantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace33convolution with11convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The
transformT()was define as: V. K NOWLEDGE DISTILLATION T (x) =Wx + (5) To the best of our knowledge, exploiting knowledge transfer
wherewere the multi-bias factors. The work in [47] con- (KT) to compress model was first proposed by Caruanaet
sidered a combination of rotation by a multiple of90 and al.[50]. They trained a compressed/ensemble model of strong
horizontal/vertical flipping with: classifiers with pseudo-data labeled, and reproduced the output
of the original larger network. But the work is limited toT (x) =WT (6) shallow models. The idea has been recently adopted in [51]
whereWT was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide
original filters with angle2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model
transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The
was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from
can achieve good classification performance. a large teacher model into a small one by learning the class
The work in [44] definedT()as the set of translation distributions output via softmax.
functions applied to 2D filters: The work in [52] introduced a KD compression framework,
which eased the training of deep networks by following aT (x) =T(;x;y)x;y2fk;:::;kg;(x;y)6=(0;0) (7) student-teacher paradigm, in which the student was penalized
whereT(;x;y)denoted the translation of the first operand by according to a softened version of the teachers output. The
(x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into
at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained
can be used to 1) improve the classification accuracy as a to predict the output and the classification labels. Despite
regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various
parameter efficiency by flexibly varying their architectures to image classification tasks. The work in [53] aimed to address
compress networks. the network compression problem by taking advantage of
Table III briefly compares the performance of different depth neural networks. It proposed an approach to train thin
methods with transferred convolutional filters, using VGGNet but deep networks, called FitNets, to compress wide and
(16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended
on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In
observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher
little or no drop in classification accuracy. network, FitNet made the student mimic the full feature maps IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 6
of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec-
the capacities of teacher and student may differ greatly. ture such as GoogleNet or Network in Network, can achieve
All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting
10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully
experimental results show that these methods match or outper- optimized the utilization of the computing resources inside
form the teachers performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62]
parameters and multiplications. and motivated them to increase the depth and width of the
There are several extension along this direction of dis- network while keeping the computational budget constant.
tillation knowledge. The work in [54] trained a parametric The work in [63] targeted the Residual Network based
student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called
proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory
neural networks for the student model. Different from previous setup to train short networks and used deep networks at test
works which represented the knowledge using the soften label time. It started with very deep networks, while during training,
probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers
neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this
information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual
The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed
instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers
network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best
are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional
mations between neural network specifications. Zagoruyko networks with adaptive inference graphs to adaptively define
et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66].
assumption of FitNet. They transferred the attention maps that Other approaches to reduce the convolutional overheads in-are summaries of the full activations. clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help significantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classification tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral filters [70]. Those worksperformance competitive with other type of approaches. only aim to speed up the computation but not reduce the
memory storage.VI. O THER TYPES OF APPROACHES
We first summarize the works utilizing attention-based
methods. Note that attention-based mechanism [58] can reduce VII. B ENCHMARKS , E VALUATION AND DATABASES
computations significantly by learning to selectively focus or In the past five years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to first standard models include LeNets [71], All-CNN-nets [72] andfind the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been significantly reduced. layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected
a sparse combination of the experts to process each input. In The standard criteria to measure the quality of model
[61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the
which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters
selected and executed a subset of D2NN neurons based on the in the original modelManda is that of the compressed
input. modelM , then the compression rate(M;M )ofM over
There have been other attempts to reduce the number of Mis aparameters of neural networks by replacing the fully connected (M;M ) = : (8)a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7
TABLE IV or low rank factorization based methods. If you need
SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION . and transferred convolutional filters approaches could be
Baseline Models Representative Works considered.
Alexnet [1] structural matrix [29], [30], [32] For applications in some specific domains, methods with low-rank factorization [40] human prior (like the transferred convolutional filters, Network in network [73] low-rank factorization [40]
VGG nets [74] transferred filters [44] structural matrix) sometimes have benefits. For example,
low-rank factorization [40] when doing medical images classification, transferred Residual networks [75] compact filters [49], stochastic depth [63] convolutional filters could work well as medical images parameter sharing [24]
All-CNN-nets [72] transferred filters [45] (like organ) do have the rotation transformation property.
LeNets [71] parameter sharing [24] Usually the approaches of pruning & sharing could give parameter pruning [20], [22] reasonable compression rate while not hurt the accuracy.
Thus for applications which requires stable model accu-
Another widely used measurement is the index space saving racy, it is better to utilize pruning & sharing.
defined in several papers [30], [35] as If your problem involves small/medium size datasets, you
can try the knowledge distillation approaches. The com-aa
(M;M ) = ; (9) pressed student model can take the benefit of transferringa knowledge from teacher model, making it robust datasets
whereaandaare the number of the dimension of the index which are not large.
space in the original model and that of the compressed model, As we mentioned before, techniques of the four groups
respectively. are orthogonal. It is reasonable to combine two or three
Similarly, given the running timesofMands ofM , of them to maximize the performance. For some spe-
the speedup rate(M;M )is defined as: cific applications, like object detection, which requires
s both convolutional and fully connected layers, you can(M;M ) = : (10)s compress the convolutional layers with low rank based
Most work used the average training time per epoch to measure method and the fully connected layers with a pruning
the running time, while in [30], [35], the average testing time technique.
was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster
computation for both the training and the testing stages. Techniques for deep model compression and acceleration
Good compression methods are expected to achieve almost are still in the early stage and the following challenges still
the same performance as the original model with much smaller need to be addressed.
parameters and less computational time. However, for different Most of the current state-of-the-art approaches are built
applications with different CNN designs, the relation between on well-designed CNN models, which have limited free-
parameter size and computational time may be different. dom to change the configuration (e.g., network structural,
For example, it is observed that for deep CNNs with fully hyper-parameters). To handle more complicated tasks,
connected layers, most of the parameters are in the fully it should provide more plausible ways to configure the
connected layers; while for image classification tasks, float compressed models.
point operations are mainly in the first few convolutional layers Pruning is an effective way to compress and acceler-
since each filter is convolved with the whole image, which is ate CNNs. The current pruning techniques are mostly
usually very large at the beginning. Thus compression and designed to eliminate connections between neurons. On
acceleration of the network should focus on different type of the other hand, pruning channel can directly reduce the
layers for different applications. feature map width and shrink the model into a thinner
one. It is efficient but also challenging because removing
VIII. D ISCUSSION AND CHALLENGES channels might dramatically change the input of the
following layer.In this paper, we summarized recent efforts on compressing
and accelerating deep neural networks (DNNs). Here we dis- As we mentioned before, methods of structural matrix
and transferred convolutional filters impose prior humancuss more details about how to choose different compression knowledge to the model, which could significantly affectapproaches, and possible challenges/solutions on this area. the performance and stability. It is critical to investigate
how to control the impact of those prior knowledge.A. General Suggestions The methods of knowledge distillation provide many ben-
There is no golden rule to measure which approach is the efits such as directly accelerating model without special
best. How to choose the proper method is really depending hardware or implementations. It is still worthy developing
on the applications and requirements. Here are some general KD-based approaches and exploring how to improve their
guidance we can provide: performances.
If the applications need compacted models from pre- Hardware constraints in various of small platforms (e.g.,
trained models, you can choose either pruning & sharing mobile, robotic, self-driving car) are still a major problem IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8
to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g.,
use of the limited computational source and how to design video and image frames [88], [89]).
special compression methods for such platforms are still
challenges that need to be addressed. IX. ACKNOWLEDGMENTS
Despite the great achievements of these compression ap-
proaches, the black box mechanism is still the key barrier The authors would like to thank the reviewers and broader
to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular,
is still an important problem. we would like to thank Hong Zhao from the Department of
Automation of Tsinghua University for her help on modifying
C. Possible Solutions the paper. This research is supported by National Science
Foundation of China with Grant number 61401169.To solve the hyper-parameters configuration problem, we
can rely on the recent learning-to-learn strategies [76], [77].
This framework provides a mechanism allowing the algorithm REFERENCES
to automatically learn how to exploit structure in the problem [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with of interest. Very recently, leveraging reinforcement learning deep convolutional neural networks,” inNIPS, 2012.
to efficiently sample the design space and improve the model [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
compression has also been tried [78]. gap to human-level performance in face verification,” inCVPR, 2014.
[3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efficiency benefit on both adaptive feature sharing in multi-task networks with applications in
CPU and GPU because no special implementation is required. person attribute classification,”CoRR, vol. abs/1611.05377, 2016.
But it is also challenging to handle the input configuration. [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel distributed deep networks,” inNIPS, 2012.
pruning methods [79], which focus on imposing sparse con- [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
straints on weights during training. However, training from recognition,”CoRR, vol. abs/1512.03385, 2015.
[6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In deep convolutional networks using vector quantization,”CoRR, vol.
[80], the authors provided an iterative two-step algorithm to abs/1412.6115, 2014.
effectively prune channels in each layer. [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized
convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models on Computer Vision and Pattern Recognition (CVPR), 2016.
and transferring it to the student models is useful for the [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
knowledge distillation (KD) approaches. Instead of directly re- neural networks on cpus,” inDeep Learning and Unsupervised Feature
Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl- [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
edge of neurons could be helpful. One can derive a way to learning with limited numerical precision,” inProceedings of the
select essential neurons related to the task [81], [82]. The 32Nd International Conference on International Conference on Machine
Learning - Volume 37, ser. ICML15, 2015, pp. 17371746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
or samples, that implies these regions or samples share some deep neural networks with pruning, trained quantization and huffman
common properties that may relate to the task. coding,”International Conference on Learning Representations (ICLR),
2016. For methods with the convolutional filters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network
matrix, we can conclude that the transformation lies in the quantization,”CoRR, vol. abs/1612.01543, 2016.
family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is in Neural Information Processing Systems 28: Annual Conference on
to provide a generalization of the aforementioned approaches Neural Information Processing Systems 2015, December 7-12, 2015,
in two aspects: 1) instead of limiting the transformation to Montreal, Quebec, Canada, 2015, pp. 31233131.
[13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predefined transformations, let it be the works with weights and activations constrained to +1 or -1,”CoRR, vol.
whole family of spatial transformations applied on 2D filters abs/1602.02830, 2016.
or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in model parameters. ECCV, 2016.
Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha,
some general/unified approaches is one direction. Wanget al. “Deep neural networks are robust to weight binarization and other non-
[83] presented a feature map dimensionality reduction method linear distortions,”CoRR, vol. abs/1606.01981, 2016.
[16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen- networks,”CoRR, vol. abs/1611.01600, 2016.
erated from different filters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
information of the original network. The idea can be applied with few multiplications,”CoRR, vol. abs/1510.03009, 2015.
[18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The construction with back-propagation,” inAdvances in Neural Information
work in [84] proposed a one-shot whole network compression Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177185.
scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information
processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and fine-tuning to make deep Damage, pp. 598605.
CNNs work in mobile devices. [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives
Despite the classification task, people are also adapting the for network pruning: Optimal brain surgeon,” inAdvances in Neural
Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164 compacted models in other tasks [85][87]. We would like to 171. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9
[21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net-
networks,” inProceedings of the British Machine Vision Conference works,”arXiv preprint arXiv:1602.07576, 2016.
2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp. [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural
31.131.12. networks,” inAdvances In Neural Information Processing Systems, 2016,
[22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and pp. 10821090.
connections for efficient neural networks,” inProceedings of the 28th [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and
International Conference on Neural Information Processing Systems, ser. improving convolutional neural networks via concatenated rectified
NIPS15, 2015. linear units,”arXiv preprint arXiv:1603.05201, 2016.
[23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com- [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in
pressing neural networks with the hashing trick.” JMLR Workshop and deep neural networks,”arXiv preprint arXiv:1604.00676, 2016.
Conference Proceedings, 2015. [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic
[24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural symmetry in convolutional neural networks,” inProceedings of the
network compression,”CoRR, vol. abs/1702.04008, 2017. 33rd International Conference on International Conference on Machine
[25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain Learning - Volume 48, ser. ICML16, 2016.
damage,” in2016 IEEE Conference on Computer Vision and Pattern [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, resnet and the impact of residual connections on learning.”CoRR, vol.
pp. 25542564. abs/1602.07261, 2016.
[26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified,
cnns,” inEuropean Conference on Computer Vision, Amsterdam, the small, low power fully convolutional neural networks for real-time object
Netherlands, October 2016, pp. 662677. detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016.
[27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ
sparsity in deep neural networks,” inAdvances in Neural Information inProceedings of the 12th ACM SIGKDD International Conference on
Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, Knowledge Discovery and Data Mining, ser. KDD 06, 2006, pp. 535
I. Guyon, and R. Garnett, Eds., 2016, pp. 20742082. 541.
[28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
filters for efficient convnets,”CoRR, vol. abs/1608.08710, 2016. Advances in Neural Information Processing Systems 27: Annual Confer-
[29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for ence on Neural Information Processing Systems 2014, December 8-13
small-footprint deep learning,” inAdvances in Neural Information Pro- 2014, Montreal, Quebec, Canada, 2014, pp. 26542662.
cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
and R. Garnett, Eds., 2015, pp. 30883096. neural network,”CoRR, vol. abs/1503.02531, 2015.
[30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
Chang, “An exploration of parameter redundancy in deep networks with Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550,
circulant projections,” inInternational Conference on Computer Vision 2014.
(ICCV), 2015. [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling,
[31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and “Bayesian dark knowledge,” inAdvances in Neural Information Process-
S. Chang, “Fast neural networks with circulant projections,”CoRR, vol. ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
abs/1502.03436, 2015. and R. Garnett, Eds., 2015, pp. 34203428.
[32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression
and Z. Wang, “Deep fried convnets,” inInternational Conference on by distilling knowledge from neurons,” inProceedings of the Thirtieth
Computer Vision (ICCV), 2015. AAAI Conference on Artificial Intelligence, February 12-17, 2016,
[33]J. Chun and T. Kailath,Generalized Displacement Structure for Block- Phoenix, Arizona, USA., 2016, pp. 35603566.
Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel- [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning
berg: Springer Berlin Heidelberg, 1991, pp. 215236. via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015.
[34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
in low-rank tensor formats via cross approximation,”SIAM J. Scientific Improving the performance of convolutional neural networks via atten-
Computing, vol. 37, no. 2, 2015. tion transfer,”CoRR, vol. abs/1612.03928, 2016.
[35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc: [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
A structured efficient linear layer,” inInternational Conference on jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014.
Learning Representations (ICLR), 2016. [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and
[36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable A. C. Courville, “Dynamic capacity networks,” inProceedings of the
filters,” in2013 IEEE Conference on Computer Vision and Pattern 33nd International Conference on Machine Learning, ICML 2016, New
Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754 York City, NY, USA, June 19-24, 2016, 2016, pp. 25492558.
2761. [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
[37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, and J. Dean, “Outrageously large neural networks: The sparsely-gated
“Exploiting linear structure within convolutional networks for efficient mixture-of-experts layer,” 2017.
evaluation,” inAdvances in Neural Information Processing Systems 27, [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and
Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. J. Odobez, “Deep dynamic neural networks for multimodal gesture
Weinberger, Eds., 2014, pp. 12691277. segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell.,
[38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional vol. 38, no. 8, pp. 15831597, 2016.
neural networks with low rank expansions,” inProceedings of the British [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
Machine Vision Conference. BMVA Press, 2014. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
[39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit- inComputer Vision and Pattern Recognition (CVPR), 2015.
sky, “Speeding-up convolutional neural networks using fine-tuned cp- [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep
decomposition,”CoRR, vol. abs/1412.6553, 2014. Networks with Stochastic Depth, 2016.
[40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual
with low-rank regularization,” vol. abs/1511.06067, 2015. networks with separated stochastic depth,”CoRR, vol. abs/1612.01230,
[41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, 2016.
“Predicting parameters in deep learning,” in Advances in Neural [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and
Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, R. Feris, “Blockdrop: Dynamic inference paths in residual networks,”
Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 21482156. inCVPR, 2018.
[Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer-
files/nips26/1053.pdf ence graphs,” 2018.
[42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab- [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional
hadran, “Low-rank matrix factorization for deep neural network training networks through FFTs, 2014.
with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
Acoustics, Speech and Signal Processing, 2013. works,” in2016 IEEE Conference on Computer Vision and Pattern IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong,
pp. 40134021. M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X.
[69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S. Yu, “Ibm research and columbia university trecvid-2012 multimedia
Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol. event detection (med), multimedia event recounting (mer), and semantic
abs/1611.05138, 2016. indexing (sin) systems,” 2012.
[70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving
pooling in deep networks,” inProceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018. Yu Cheng(yu.cheng@microsoft.com) currently is a
[71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Researcher at Microsoft. Before that, he was a Re-
applied to document recognition,” inProceedings of the IEEE, 1998, pp. search Staff Member at IBM T.J. Watson Research
22782324. Center. Yu got his Ph.D. from Northwestern Univer-
[72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried- sity in 2015 and bachelor from Tsinghua University
miller, “Striving for simplicity: The all convolutional net,”CoRR, vol. in 2010. His research is about deep learning in
abs/1412.6806, 2014. general, with specific interests in the deep generative
[73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014. model, model compression, and transfer learning.
[74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for He regularly serves on the program committees of
large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014. top-tier AI conferences such as NIPS, ICML, ICLR,
[75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image CVPR and ACL.
recognition,”arXiv preprint arXiv:1512.03385, 2015.
[76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient
descent by gradient descent,” inNeural Information Processing Systems
(NIPS), 2016. Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference ceived the B.S. degree in automation from theon Learning Representations 2016, 2016. Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018. Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in are about deep learning, particularly in few-shotdeep networks,” pp. 22702278, 2016. learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on robotics vision.Computer Vision (ICCV), Oct 2017.
[81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep
neural networks,”ECCV, 2018.
[82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric
learning via cross sample similarities transfer,” inProceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18),
New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852 Pan Zhou(panzhou@hust.edu.cn) is currently an
2859. associate professor with School of Electronic In-
[83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature formation and Communications, Wuhan, China. He
map for portable deep model,” inProceedings of the 34th International received his Ph.D. in the School of Electrical and
Conference on Machine Learning, ser. Proceedings of Machine Learning Computer Engineering at the Georgia Institute of
Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Technology in 2011. Before that, he received his
Convention Centre, Sydney, Australia: PMLR, 0611 Aug 2017, pp. B.S. degree in theAdvanced Classof HUST, and
37033711. a M.S. degree in the Department of Electronics
[84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression and Information Engineering from HUST, Wuhan,
of deep convolutional neural networks for fast and low power mobile China, in 2006 and 2008, respectively. His current
applications,”CoRR, vol. abs/1511.06530, 2015. research interest includes big data analytics and
[85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient machine learning, security and privacy, and information networks.
object detection models with knowledge distillation,” inAdvances in
Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
Eds., 2017, pp. 742751.
[86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob-
“Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE tained his B.S., M.S., and Ph.D. degrees from Ts-
Conference on Computer Vision and Pattern Recognition (CVPR), June inghua University, Beijing, China, in 1993, 1995,
2018. and 1999, respectively, and another Ph.D. degree
[87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, from Saga University, Saga, Japan, in 2002, all in
Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy control engineering. He is currently a Professor with
trade-offs for modern convolutional object detectors,” in2017 IEEE the Department of Automation, Tsinghua University.
Conference on Computer Vision and Pattern Recognition, CVPR 2017, He serves the Associate Dean, School of Information
Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 32963297. Science and Technology and Head of the Department
[88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence of Automation. His current research interests include
modeling for video event detection,” in The IEEE Conference on artificial intelligence, robotics, image processing,
Computer Vision and Pattern Recognition (CVPR), June 2014. control theory, and control of spacecraft.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,391 +0,0 @@
Channel Pruning for Accelerating Very Deep Neural Networks
Yihui He * Xiangyu Zhang Jian Sun
Xian Jiaotong University Megvii Inc. Megvii Inc.
Xian, 710049, China Beijing, 100190, China Beijing, 100190, China
heyihui@stu.xjtu.edu.cn zhangxiangyu@megvii.com sunjian@megvii.com
Abstract W1
In this paper, we introduce a new channel pruning number of channels
nonlinear method to accelerate very deep convolutional neural net-
works. Given a trained CNN model, we propose an it-
erative two-step algorithm to effectively prune each layer, W2
by a LASSO regression based channel selection and least nonlinear
square reconstruction. We further generalize this algorithm
to multi-layer and multi-branch cases. Our method re- W3
duces the accumulated error and enhance the compatibility
with various architectures. Our pruned VGG-16 achieves (a) (b) (c) (d)
the state-of-the-art results by5×speed-up along with only Figure 1. Structured simplification methods that accelerate CNNs: 0.3% increase of error. More importantly, our method is (a) a network with 3 conv layers. (b) sparse connection deacti-
able to accelerate modern networks like ResNet, Xception vates some connections between channels. (c) tensor factorization
and suffers only 1.4%, 1.0% accuracy loss under2×speed- factorizes a convolutional layer into several pieces. (d) channel
up respectively, which is significant. pruning reduces number of channels in each layer (focus of this
paper).
1. Introduction a network into thinner one, as shown in Fig.1(d). It is effi-
Recent CNN acceleration works fall into three cate- cient on both CPU and GPU because no special implemen-
gories: optimized implementation (e.g., FFT [47]), quan- tation is required.
tization (e.g., BinaryNet [8]), and structured simplification Pruning channels is simple but challenging because re-
that convert a CNN into compact one [22]. This work fo- moving channels in one layer might dramatically change
cuses on the last one. the input of the following layer. Recently,training-based
Structured simplification mainly involves: tensor fac- channel pruning works [1,48] have focused on imposing
torization [22], sparse connection [17], and channel prun- sparse constrain on weights during training, which could
ing [48]. Tensor factorization factorizes a convolutional adaptively determine hyper-parameters. However, training
layer into several efficient ones (Fig.1(c)). However, fea- from scratch is very costly and results for very deep CNNs
ture map width (number of channels) could not be reduced, on ImageNet have been rarely reported.Inference-timeat-
which makes it difficult to decompose1×1convolutional tempts [31,3] have focused on analysis of the importance
layer favored by modern networks (e.g., GoogleNet [45], of individual weight. The reported speed-up ratio is very
ResNet [18], Xception [7]). This type of method also intro- limited.
duces extra computation overhead. Sparse connection deac- In this paper, we propose a new inference-time approach
tivates connections between neurons or channels (Fig.1(b)). for channel pruning, utilizing redundancy inter channels.
Though it is able to achieves high theoretical speed-up ratio, Inspired by tensor factorization improvement by feature
the sparse convolutional layers have an ”irregular” shape maps reconstruction [52], instead of analyzing filter weights
which is not implementation friendly. In contrast, channel [22,31], we fully exploits redundancy within feature maps.
pruning directly reduces feature map width, which shrinks Specifically, given a trained CNN model, pruning each layer
is achieved by minimizing reconstruction error on its output
* This work was done when Yihui He was an intern at Megvii Inc. feature maps, as showned in Fig.2. We solve this mini-
1389 A B C maps. There are several training-based approaches. [1,48]
W regularize networks to improve accuracy. Channel-wise
SSL [48] reaches high compression ratio for first few conv
layers of LeNet [30] and AlexNet [26]. However,training- kh kc w basedapproaches are more costly, and the effectiveness for
c n very deep networks on large datasets is rarely exploited. nonlinear nonlinear
Figure 2. Channel pruning for accelerating a convolutional layer. Inference-time channel pruning is challenging, as re-
We aim to reduce the width of feature map B, while minimizing ported by previous works [2,39]. Some works [44,34,19]
the reconstruction error on feature map C. Our optimization algo- focus on model size compression, which mainly operate the
rithm (Sec. 3.1) performs within the dotted box, which does not fully connected layers. Data-free approaches [31,3] results
involve nonlinearity. This figure illustrates the situation that two for speed-up ratio (e.g.,5×) have not been reported, and
channels are pruned for feature map B. Thus corresponding chan- requires long retraining procedure. [3] select channels via
nels of filtersWcan be removed. Furthermore, even though not over 100 random trials, however it need long time to evalu- directly optimized by our algorithm, the corresponding filters in ate each trial on a deep network, which makes it infeasible the previous layer can also be removed (marked by dotted filters). to work on very deep models and large datasets. [31] is even c,n: number of channels for feature maps B and C,kh ×kw : worse than naive solution from our observation sometimes kernel size. (Sec.4.1.1).
mization problem by two alternative steps: channels selec- 3. Approach
tion and feature map reconstruction. In one step, we figure In this section, we first propose a channel pruning al-out the most representative channels, and prune redundant gorithm for a single layer, then generalize this approach toones, based on LASSO regression. In the other step, we multiple layers or the whole model. Furthermore, we dis-reconstruct the outputs with remaining channels with linear cuss variants of our approach for multi-branch networks.least squares. We alternatively take two steps. Further, we
approximate the network layer-by-layer, with accumulated 3.1. Formulation
error accounted. We also discuss methodologies to prune
multi-branch networks (e.g., ResNet [18], Xception [7]). Fig.2illustrates our channel pruning algorithm for a sin-
For VGG-16, we achieve4×acceleration, with only gle convolutional layer. We aim to reduce the width of
1.0%increase of top-5 error. Combined with tensor factor- feature map B, while maintaining outputs in feature map
ization, we reach5×acceleration but merely suffer0.3% C. Once channels are pruned, we can remove correspond-
increase of error, which outperforms previous state-of-the- ing channels of the filters that take these channels as in-
arts. We further speed up ResNet-50 and Xception-50 by put. Also, filters that produce these channels can also be
2×with only1.4%, 1.0%accuracy loss respectively. removed. It is clear that channel pruning involves two key
points. The first is channel selection, since we need to select
2. Related Work most representative channels to maintain as much informa-
tion. The second is reconstruction. We need to reconstruct
There has been a significant amount of work on acceler- the following feature maps using the selected channels.
ating CNNs. Many of them fall into three categories: opti- Motivated by this, we propose an iterative two-step al-
mized implementation [4], quantization [40], and structured gorithm. In one step, we aim to select most representative
simplification [22]. channels. Since an exhaustive search is infeasible even for
Optimized implementation based methods [35,47,27,4] tiny networks, we come up with a LASSO regression based
accelerate convolution, with special convolution algorithms method to figure out representative channels and prune re-
like FFT [47]. Quantization [8,40] reduces floating point dundant ones. In the other step, we reconstruct the outputs
computational complexity. with remaining channels with linear least squares. We alter-
Sparse connection eliminates connections between neu- natively take two steps.
rons [17,32,29,15,14]. [51] prunes connections based on Formally, to prune a feature map withcchannels, we
weights magnitude. [16] could accelerate fully connected consider applyingn×c×kh ×kw convolutional filtersWon
layers up to50×. However, in practice, the actual speed-up N×c×kh ×kw input volumesXsampled from this feature
maybe very related to implementation. map, which producesN×noutput matrixY. Here,Nis
Tensor factorization [22,28,13,24] decompose weights the number of samples,nis the number of output channels,
into several pieces. [50,10,12] accelerate fully connected andkh ,k w are the kernel size. For simple representation,
layers with truncated SVD. [52] factorize a layer into3×3 bias term is not included in our formulation. To prune the
and1×1combination, driven by feature map redundancy. input channels fromcto desiredc (0≤c ≤c), while
Channel pruning removes redundant channels on feature minimizing reconstruction error, we formulate our problem
1390 as follow: penalty, andβ =c. We gradually increaseλ. For each 0 change ofλ, we iterate these two steps untilβ is stable.
1 2 0 c Afterβ ≤c satisfies, we obtain the final solutionWarg min Y β 0i Xi W i from{ββ,W 2N (1) i Wi }. In practice, we found that the two steps it- i=1 F eration is time consuming. So we apply (i) multiple times,subject toβ ≤c
0 untilβ ≤c satisfies. Then apply (ii) just once, to obtain 0
· is Frobenius norm.X the final result. From our observation, this result is compa-
F i isN×kh kw matrix sliced
fromith channel of input volumesX,i= 1,...,c.W rable with two steps iterations. Therefore, in the following i is
n×k experiments, we adopt this approach for efficiency. h kw filter weights sliced fromith channel ofW.βis
coefficient vector of lengthcfor channel selection, andβ Discussion: Some recent works [48,1,17] (though train- i
isith entry ofβ. Notice that, ifβ ing based) also introduce1 -norm or LASSO. However, we i = 0,Xi will be no longer
useful, which could be safely pruned from feature map.W must emphasis that we use different formulations. Many of i
could also be removed. them introduced sparsity regularization into training loss,
Optimization instead of explicitly solving LASSO. Other work [1] solved
Solving this LASSO, while feature maps or data were not considered 0 minimization problem in Eqn.1is NP-hard.
Therefore, we relax the during optimization. Because of these differences, our ap- 0 to1 regularization: proach could be applied at inference time.
1 c 2
arg min Y β 3.2. Whole Model Pruning i Xi W
i β1β,W 2N (2) i=1 F Inspired by [52], we apply our approach layer by layersubject toβ ≤c ,∀iW = 1 0 iF sequentially. For each layer, we obtain input volumes from
the current input feature map, and output volumes from theλis a penalty coefficient. By increasingλ, there will be output feature map of the un-pruned model. This could bemore zero terms inβand one can get higher speed-up ratio. formalized as:We also add a constrain∀iWi = 1to this formulation, F which avoids trivial solution.
Now we solve this problem in two folds. First, we fixW, 1 c 2
arg min Y βsolveβfor channel selection. Second, we fixβ, solveWto i Xi W i
β,W 2N (5)
reconstruct error. i=1 F
(i) The subproblem ofβ. In this case,Wis fixed. We subject toβ ≤c
0
solveβfor channel selection. This problem can be solved Different from Eqn.1,Yis replaced byY , which is fromby LASSO regression [46,5], which is widely used for feature map of the original model. Therefore, the accumu-model selection. lated error could be accounted during sequential pruning. 2 c βˆLASSO 1(λ) = argmin Y β β 3.3. Pruning Multi­Branch Networks
β 2N i Zi 1
i=1 F The whole model pruning discussed above is enough for
subject toβ ≤c
0 single-branch networks like LeNet [30], AlexNet [26] and(3) VGG Nets [43]. However, it is insufficient for multi-branch HereZi = X i W i (sizeN×n). We will ignoreith channels networks like GoogLeNet [ 45] and ResNet [18]. We mainlyifβi = 0. focus on pruning the widely used residual structure (e.g.,(ii) The subproblem ofW. In this case,βis fixed. We ResNet [18], Xception [7]). Given a residual block shownutilize the selected channels to minimize reconstruction er- in Fig.3(left), the input bifurcates into shortcut and residualror. We can find optimized solution by least squares: branch. On the residual branch, there are several convolu-
tional layers (e.g., 3 convolutional layers which have spatialarg minYX (W ) 2 (4) F size of1×1,3×3,1×1, Fig.3, left). Other layers ex- W cept the first and last layer can be pruned as is described
HereX = [β1 X1 β2 X2 ... β i Xi ... β c Xc ](size previously. For the first layer, the challenge is that the large
N×ck h kw ). W isn×ck h kw reshapedW,W = input feature map width (for ResNet, 4 times of its output)
[W 1 W2 ...Wi ...Wc ]. After obtained resultW , it is re- cant be easily pruned, since its shared with shortcut. For
shaped back toW. Then we assignβi ←βi Wi ,W the last layer, accumulated error from the shortcut is hard to F i ←
Wi /Wi . Constrain∀iW be recovered, since theres no parameter on the shortcut. To F i = 1satisfies. F We alternatively optimize (i) and (ii). In the beginning, address these challenges, we propose several variants of our
Wis initialized from the trained model,λ= 0, namely no approach as follows.
1391 c ers, which need special library implementation support. We
Input (c) sampled (c') 0 do not adopt it in the following experiments. c 0 0
0
channel sampler
sampler 1x1,c c'0 4. Experiment 1
c 1x1 1 relu c' 3x3,c 1 relu We evaluation our approach for the popular VGG Nets 2
c 3x3 2 relu [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR- c'1x1 2 relu 10 [25] and PASCAL VOC 2007 [11]. 1x1 For Batch Normalization [21], we first merge it into con- Y2 Y volutional weights, which do not affect the outputs of the Y+Y 1
1 2 networks. So that each convolutional layer is followed by
Figure 3. Illustration of multi-branch enhancement for residual ReLU [36]. We use Caffe [23] for deep network evalua-
block.Left: original residual block.Right: pruned residual block tion, and scikit-learn [38] for solvers implementation. For
with enhancement,cx denotes the feature map width. Input chan- channel pruning, we found that it is enough to extract 5000 nels of the first convolutional layer are sampled, so that the large images, and 10 samples per image. On ImageNet, we eval- input feature map width could be reduced. As for the last layer, uate the top-5 accuracy with single view. Images are re- rather than approximateY2 , we try to approximateY1 + Y 2 di- sized such that the shorter side is 256. The testing is on rectly (Sec.3.3Last layer of residual branch). center crop of224×224pixels. We could gain more per-
formance with fine-tuning. We use a batch size of 128 and
learning rate1e5 . We fine-tune our pruned models for 10Last layer of residual branch: Shown in Fig.3, the epoches. The augmentation for fine-tuning is random cropoutput layer of a residual block consists of two inputs: fea- of224×224and mirror.ture mapY1 andY2 from the shortcut and residual branch.
We aim to recoverY1 + Y 2 for this block. Here,Y1 ,Y2 4.1. Experiments with VGG­16 are the original feature maps before pruning.Y2 could be
approximated as in Eqn.1. However, shortcut branch is VGG-16 [43] is a 16 layers single path convolutional
parameter-free, thenY neural network, with 13 convolutional layers. It is widely 1 could not be recovered directly. To
compensate this error, the optimization goal of the last layer used in recognition, detection and segmentation,etc. Single
is changed fromY view top-5 accuracy for VGG-16 is 89.9% 1 .2 toY1 Y +Y, which does not change 1 2
our optimization. Here,Y is the current feature map after 1 previous layers pruned. When pruning, volumes should be 4.1.1 Single Layer Pruning
sampled correspondingly from these two branches. In this subsection, we evaluate single layer acceleration per-First layer of residual branch: Illustrated in formance using our algorithm in Sec.3.1. For better under-Fig.3(left), the input feature map of the residual block standing, we compare our algorithm with two naive chan-could not be pruned, since it is also shared with the short- nel selection strategies.first kselects the firstkchannels.cut branch. In this condition, we could performfeature max responseselects channels based on corresponding fil-map samplingbefore the first convolution to save compu- ters that have high absolute weights sum [31]. For fair com-tation. We still apply our algorithm as Eqn.1. Differently, parison, we obtain the feature map indexes selected by eachwe sample the selected channels on the shared feature maps of them, then perform reconstruction (Sec. 3.1(ii)). We to construct a new input for the later convolution, shown hope that this could demonstrate the importance of channelin Fig.3(right). Computational cost for this operation could selection. Performance is measured by increase of error af-be ignored. More importantly, after introducingfeature map ter a certain layer is pruned without fine-tuning, shown insampling, the convolution is still ”regular”. Fig.4.Filter-wise pruningis another option for the first con- As expected, error increases as speed-up ratio increases.volution on the residual branch. Since the input channels Our approach is consistently better than other approaches inof parameter-free shortcut branch could not be pruned, we different convolutional layers under different speed-up ra-apply our Eqn.1to each filter independently (each fil- tio. Unexpectedly, sometimesmax responseis even worseter chooses its own representative input channels). Under thanfirst k. We argue thatmax responseignores correla-single layer acceleration,filter-wise pruningis more accu- tions between different filters. Filters with large absoluterate than our original one. From our experiments, it im- weight may have strong correlation. Thus selection based proves 0.5% top-5 accuracy for2×ResNet-50 (applied on on filter weights is less meaningful. Correlation on featurethe first layer of each residual branch) without fine-tuning. maps is worth exploiting. We can find that channel selectionHowever, after fine-tuning, theres no noticeable improve-
ment. In addition, it outputs ”irregular” convolutional lay- 1 http://www.vlfeat.org/matconvnet/pretrained/
1392 conv1_1 conv2_1 conv3_1 5
first k first k first k
max response max response max response 4 ours ours ours
increase of error (%) 3
2
1
0
conv3_2 conv4_1 conv4_2 5
first k first k first k
max response max response max response 4 ours ours ours
increase of error (%) 3
2
1
01.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
speed-up ratio speed-up ratio speed-up ratio
Figure 4. Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by increase of error. To verify
the importance of channel selection refered in Sec.3.1, we considered two naive baselines.first kselects the firstkfeature maps.max
responseselects channels based on absolute sum of corresponding weights filter [31]. Our approach is consistently better (smaller is
better).
Increase of top-5 error (1-view, baseline 89.9%) periments above, we pruning more aggressive for shal-
Solution 2× 4× 5× lower layers. Remaining channels ratios for shallow lay-
Jaderberget al. [22] ([52]s impl.) - 9.7 29.7 ers (conv1_xtoconv3_x) and deep layers (conv4_x)
Asym. [52] 0.28 3.84 - is1 : 1.5.conv5_xare not pruned, since they only con-
Filter pruning [31] tribute 9% computation in total and are not redundant.0.8 8.6 14.6(fine-tuned, our impl.) After fine-tuning, we could reach2×speed-up without
Ours (without fine-tune) 2.7 7.9 22.0 losing accuracy. Under4×, we only suffers 1.0% drops.
Ours (fine-tuned) 0 1.0 1.7 Consistent with single layer analysis, our approach outper-
Table 1. Accelerating the VGG-16 model [43] using a speedup forms previous channel pruning approach (Liet al. [31]) by
ratio of2×,4×, or5×(smaller is better). large margin. This is because we fully exploits channel re-
dundancy within feature maps. Compared with tensor fac-
affects reconstruction error a lot. Therefore, it is important torization algorithms, our approach is better than Jaderberg
for channel pruning. et al. [22], without fine-tuning. Though worse than Asym.
Also notice that channel pruning gradually becomes [52], our combined model outperforms its combined Asym.
hard, from shallower to deeper layers. It indicates that shal- 3D (Table2). This may indicate that channel pruning is
lower layers have much more redundancy, which is consis- more challenging than tensor factorization, since removing
tent with [52]. We could prune more aggressively on shal- channels in one layer might dramatically change the input
lower layers in whole model acceleration. of the following layer. However, channel pruning keeps the
original model architecture, do not introduce additional lay-
ers, and the absolute speed-up ratio on GPU is much higher4.1.2 Whole Model Pruning (Table 3).
Shown in Table1, whole model acceleration results under Since our approach exploits a new cardinality, we further
2×,4×,5×are demonstrated. We adopt whole model combine our channel pruning with spatial factorization [22]
pruning proposed in Sec.3.2. Guided by single layer ex- and channel factorization [52]. Demonstrated in Table2,
1393 Increase of top-5 error (1-view, 89.9%) scratch. This coincides with architecture design researches
Solution 4× 5× [20,1] that the model could be easier to train if there are
Asym. 3D [52] 0.9 2.0 more channels in shallower layers. However, channel prun-
Asym. 3D (fine-tuned) [52] 0.3 1.0 ing favors shallower layers.
Our 3C 0.7 1.3 For from scratch (uniformed), the filters in each layers
Our 3C (fine-tuned) 0.0 0.3 is reduced by half (eg. reduceconv1_1from 64 to 32).
Table 2. Performance of combined methods on the VGG-16 model We can observe that normal setting networks of the same
[43] using a speed-up ratio of4×or5×. Our 3C solution outper- complexity couldnt reach same accuracy either. This con-
forms previous approaches (smaller is better). solidates our idea that theres much redundancy in networks
while training. However, redundancy can be opt out at
inference-time. This maybe an advantage of inference-timeour 3 cardinalities acceleration (spatial, channel factoriza- acceleration approaches over training-based approaches.tion, and channel pruning, denoted by 3C) outperforms pre- Notice that theres a 0.6% gap between the from scratchvious state-of-the-arts. Asym. 3D [52] (spatial and chan- model and uniformed one, which indicates that theres roomnel factorization), factorizes a convolutional layer to three for model exploration. Adopting our approach is muchparts:1×3,3×1,1×1. faster than training a model from scratch, even for a thin-We apply spatial factorization, channel factorization, and ner one. Further researches could alleviate our approach to our channel pruning together sequentially layer-by-layer. do thin model exploring.We fine-tune the accelerated models for 20 epoches, since
they are 3 times deeper than the original ones. After fine-
tuning, our4×model suffers no degradation. Clearly, a 4.1.5 Acceleration for Detection
combination of different acceleration techniques is better VGG-16 is popular among object detection tasks [42,41,than any single one. This indicates that a model is redun- 33]. We evaluate transfer learning ability of our2×/4×dant in each cardinality. pruned VGG-16, for Faster R-CNN [42] object detections.
PASCAL VOC 2007 object detection benchmark [11] con-
4.1.3 Comparisons of Absolute Performance tains 5k trainval images and 5k test images. The per-
formance is evaluated by mean Average Precision (mAP).We further evaluate absolute performance of acceleration In our experiments, we first perform channel pruning foron GPU. Results in Table3are obtained under Caffe [23], VGG-16 on the ImageNet. Then we use the pruned modelCUDA8 [37] and cuDNN5 [6], with a mini-batch of 32 as the pre-trained model for Faster R-CNN.on a GPU (GeForce GTX TITAN X). Results are averaged The actual running time of Faster R-CNN is 220ms / im-from 50 runs. Tensor factorization approaches decompose age. The convolutional layers contributes about 64%. Weweights into too many pieces, which heavily increase over- got actual time of 94ms for4×acceleration. From Table5,head. They could not gain much absolute speed-up. Though we observe 0.4% mAP drops of our2×model, which is notour approach also encountered performance decadence, it harmful for practice consideration.generalizes better on GPU than other approaches. Our re-
sults for tensor factorization differ from previous research 4.2. Experiments with Residual Architecture Nets
[52,22], maybe because current library and hardware pre- For Multi-path networks [45,18,7], we further explorefer single large convolution instead of several small ones. the popular ResNet [18] and latest Xception [7], on Ima-
geNet and CIFAR-10. Pruning residual architecture nets is
4.1.4 Comparisons with Training from Scratch more challenging. These networks are designed for both ef-
ficiency and high accuracy. Tensor factorization algorithmsThough training a compact model from scratch is time- [52,22] have difficult to accelerate these model. Spatially,consuming (usually 120 epoches), it worths comparing our 1×1convolution is favored, which could hardly be factor-approach and from scratch counterparts. To be fair, we eval- ized.uated both from scratch counterpart, and normal setting net-
work that has the same computational complexity and same 4.2.1 ResNet Pruningarchitecture.
Shown in Table4, we observed that its difficult for ResNet complexity uniformly drops on each residual block.
from scratch counterparts to reach competitive accuracy. Guided by single layer experiments (Sec. 4.1.1), we still
our model outperforms from scratch one. Our approach prefer reducing shallower layers heavier than deeper ones.
successfully picks out informative channels and constructs Following similar setting as Filter pruning [31], we
highly compact models. We can safely draw the conclu- keep 70% channels for sensitive residual blocks (res5
sion that the same model is difficult to be obtained from and blocks close to the position where spatial size
1394 Model Solution Increased err. GPU time/ms
VGG-16 - 0 8.144
Jaderberget al. [22] ([52]s impl.) 9.7 8.051(1.01×)
Asym. [52] 3.8 5.244(1.55×)
VGG-16 (4×) Asym. 3D [52] 0.9 8.503(0.96×)
Asym. 3D (fine-tuned) [52] 0.3 8.503(0.96×)
Ours (fine-tuned) 1.0 3.264 (2.50×)
Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is
better).
Original (acc. 89.9%) Top-5 err. Increased err. Solution Increased err.
From scratch 11.9 1.8 Filter pruning [31] (our impl.) 92.8
From scratch (uniformed) 12.5 2.4 Filter pruning [31] 4.3Ours 18.0 7.9 (fine-tuned, our impl.)
Ours (fine-tuned) 11.1 1.0 Ours 2.9
Table 4. Comparisons with training from scratch, under4×accel- Ours (fine-tuned) 1.0
eration. Our fine-tuned model outperforms scratch trained coun- Table 7. Comparisons for Xception-50, under2×acceleration ra-
terparts (smaller is better). tio. The baseline networks top-5 accuracy is 92.8%. Our ap-
proach outperforms previous approaches. Most structured sim-
plification methods are not effective on Xception architecture
Speedup mAP ∆mAP (smaller is better).
Baseline 68.7 -
2× 68.3 0.4
4× 66.9 1.8 4.2.2 Xception Pruning
Table 5.2×,4×acceleration for Faster R-CNN detection.
Since computational complexity becomes important in
model design, separable convolution has been payed muchSolution Increased err. attention [49,7]. Xception [7] is already spatially optimizedOurs 8.0 and tensor factorization on1×1convolutional layer is de-Ours 4.0 structive. Thanks to our approach, it could still be acceler-(enhanced) ated with graceful degradation. For the ease of comparison,Ours 1.4 we adopt Xception convolution on ResNet-50, denoted by(enhanced, fine-tuned) Xception-50. Based on ResNet-50, we swap all convolu- Table 6.2×acceleration for ResNet-50 on ImageNet, the base- tional layers with spatial conv blocks. To keep the same line networks top-5 accuracy is 92.2% (one view). We improve computational complexity, we increase the input channels performance with multi-branch enhancement (Sec.3.3,smaller is of allbranch2blayers by2×. The baseline Xception- better). 50 has a top-5 accuracy of 92.8% and complexity of 4450
MFLOPs.
We apply multi-branch variants of our approach as de-change, e.g. res3a,res3d). As for other blocks, scribed in Sec.3.3, and adopt the same pruning ratio settingwe keep 30% channels. With multi-branch enhance- as ResNet in previous section. Maybe because of Xcep-ment, we prunebranch2amore aggressively within tion block is unstable, Batch Normalization layers must beeach residual block. The remaining channels ratios for maintained during pruning. Otherwise it becomes nontrivialbranch2a,branch2b,branch2cis2 : 4 : 3(e.g., to fine-tune the pruned model.Given 30%, we keep 40%, 80%, 60% respectively). Shown in Table7, after fine-tuning, we only suffer1.0%
We evaluate performance of multi-branch variants of our increase of error under2×. Filter pruning [31] could also
approach (Sec. 3.3). From Table6, we improve 4.0% apply on Xception, though it is designed for small speed-
with our multi-branch enhancement. This is because we up ratio. Without fine-tuning, top-5 error is 100%. After
accounted the accumulated error from shortcut connection training 20 epochs which is like training from scratch, in-
which could broadcast to every layer after it. And the large creased error reach 4.3%. Our results for Xception-50 are
input feature map width at the entry of each residual block not as graceful as results for VGG-16, since modern net-
is well reduced by ourfeature map sampling. works tend to have less redundancy by design.
1395 Solution Increased err. [4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn:
Filter pruning [31] Lookup-based convolutional neural network.arXiv preprint 1.3(fine-tuned, our impl.) arXiv:1611.06473, 2016.2
From scratch 1.9 [5] L. Breiman. Better subset regression using the nonnegative
Ours 2.0 garrote.Technometrics, 37(4):373384, 1995.3
Ours (fine-tuned) 1.0 [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
Table 8.2×speed-up comparisons for ResNet-56 on CIFAR-10, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives
the baseline accuracy is 92.8% (one view). We outperforms previ- for deep learning.CoRR, abs/1410.0759, 2014.6
ous approaches and scratch trained counterpart (smaller is better). [7] F. Chollet. Xception: Deep learning with depthwise separa-
ble convolutions.arXiv preprint arXiv:1610.02357, 2016. 1,
2,3,4,6,7
4.2.3 Experiments on CIFAR-10 [8] M. Courbariaux and Y. Bengio. Binarynet: Training deep
neural networks with weights and activations constrained to+
Even though our approach is designed for large datasets, it 1 or-1.arXiv preprint arXiv:1602.02830, 2016.1,2
could generalize well on small datasets. We perform ex- [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
periments on CIFAR-10 dataset [25], which is favored by Fei. Imagenet: A large-scale hierarchical image database.
many acceleration researches. It consists of 50k images for InComputer Vision and Pattern Recognition, 2009. CVPR
training and 10k for testing in 10 classes. 2009. IEEE Conference on, pages 248255. IEEE, 2009. 4
We reproduce ResNet-56, which has accuracy of 92.8% [10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-
(Serve as a reference, the official ResNet-56 [18] has ac- gus. Exploiting linear structure within convolutional net-
curacy of 93.0%). For2×acceleration, we follow similar works for efficient evaluation. InAdvances in Neural In-
formation Processing Systems, pages 12691277, 2014.2 setting as Sec.4.2.1(keep the final stage unchanged, where
the spatial size is8×8). Shown in Table8, our approach [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The PASCAL Visual Object Classes is competitive with scratch trained one, without fine-tuning, Challenge 2007 (VOC2007) Results. http://www.pascal- under2×speed-up. After fine-tuning, our result is signif- network.org/challenges/VOC/voc2007/workshop/index.html. icantly better than Filter pruning [31] and scratch trained 4,6
one. [12] R. Girshick. Fast r-cnn. InProceedings of the IEEE Inter-
national Conference on Computer Vision, pages 14401448,
5. Conclusion 2015.2
[13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress-
To conclude, current deep CNNs are accurate with high ing deep convolutional networks using vector quantization.
inference costs. In this paper, we have presented an arXiv preprint arXiv:1412.6115, 2014.2
inference-time channel pruning method for very deep net- [14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for
works. The reduced CNNs are inference efficient networks efficient dnns. InAdvances In Neural Information Process-
while maintaining accuracy, and only require off-the-shelf ing Systems, pages 13791387, 2016.2
libraries. Compelling speed-ups and accuracy are demon- [15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,
strated for both VGG Net and ResNet-like networks on Im- and W. J. Dally. Eie: efficient inference engine on com-
ageNet, CIFAR-10 and PASCAL VOC. pressed deep neural network. InProceedings of the 43rd
International Symposium on Computer Architecture, pages In the future, we plan to involve our approaches into 243254. IEEE Press, 2016. 2 training time, instead of inference time only, which may [16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- also accelerate training procedure. pressing deep neural network with pruning, trained quantiza-
tion and huffman coding.CoRR, abs/1510.00149, 2, 2015.
References 2
[17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights
[1] J. M. Alvarez and M. Salzmann. Learning the number of and connections for efficient neural network. InAdvances in
neurons in deep networks. InAdvances in Neural Informa- Neural Information Processing Systems, pages 11351143,
tion Processing Systems, pages 22622270, 2016. 1,2,3, 2015.1,2,3
6 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
[2] S. Anwar, K. Hwang, and W. Sung. Structured prun- ing for image recognition.arXiv preprint arXiv:1512.03385,
ing of deep convolutional neural networks. arXiv preprint 2015. 1,2,3,4,6,8
arXiv:1512.08571, 2015.2 [19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim-
[3] S. Anwar and W. Sung. Compact deep convolutional ming: A data-driven neuron pruning approach towards effi-
neural networks with coarse pruning. arXiv preprint cient deep architectures. arXiv preprint arXiv:1607.03250,
arXiv:1610.09639, 2016.1,2 2016.2
1396 [20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
Speed/accuracy trade-offs for modern convolutional object V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
detectors.arXiv preprint arXiv:1611.10012, 2016. 6 M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating chine learning in Python.Journal of Machine Learning Re-
deep network training by reducing internal covariate shift. search, 12:28252830, 2011.4
arXiv preprint arXiv:1502.03167, 2015.4 [39] A. Polyak and L. Wolf. Channel-level acceleration of deep
[22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up face representations.IEEE Access, 3:21632175, 2015.2
convolutional neural networks with low rank expansions. [40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
arXiv preprint arXiv:1405.3866, 2014.1,2,5,6,7 net: Imagenet classification using binary convolutional neu-
[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- ral networks. InEuropean Conference on Computer Vision,
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- pages 525542. Springer, 2016. 2
tional architecture for fast feature embedding.arXiv preprint [41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. arXiv:1408.5093, 2014. 4,6 You only look once: Unified, real-time object detection.
[24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. CoRR, abs/1506.02640, 2015. 6
Compression of deep convolutional neural networks for [42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:fast and low power mobile applications. arXiv preprint towards real-time object detection with region proposal net-arXiv:1511.06530, 2015.2 works.CoRR, abs/1506.01497, 2015.6 [25] A. Krizhevsky and G. Hinton. Learning multiple layers of [43] K. Simonyan and A. Zisserman. Very deep convolutionalfeatures from tiny images. 2009.4,8 networks for large-scale image recognition. arXiv preprint[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet arXiv:1409.1556, 2014.3,4,5,6classification with deep convolutional neural networks. In [44] S. Srinivas and R. V. Babu. Data-free parameter pruningAdvances in neural information processing systems, pages for deep neural networks.arXiv preprint arXiv:1507.06149,10971105, 2012.2,3 2015.2[27] A. Lavin. Fast algorithms for convolutional neural networks. [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,arXiv preprint arXiv:1509.09308, 2015.2 D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and Going deeper with convolutions. InProceedings of the IEEEV. Lempitsky. Speeding-up convolutional neural net- Conference on Computer Vision and Pattern Recognition,works using fine-tuned cp-decomposition. arXiv preprint pages 19, 2015.1,3,6arXiv:1412.6553, 2014.2 [46] R. Tibshirani. Regression shrinkage and selection via the[29] V. Lebedev and V. Lempitsky. Fast convnets using group- lasso. Journal of the Royal Statistical Society. Series Bwise brain damage.arXiv preprint arXiv:1506.02515, 2015. (Methodological), pages 267288, 1996.32 [47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- antino, and Y. LeCun. Fast convolutional nets withbased learning applied to document recognition. Proceed- fbfft: A gpu performance evaluation. arXiv preprintings of the IEEE, 86(11):22782324, 1998.2,3 arXiv:1412.7580, 2014.1,2[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.
Graf. Pruning filters for efficient convnets. arXiv preprint [48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning
arXiv:1608.08710, 2016.1,2,4,5,6,7,8 structured sparsity in deep neural networks. InAdvances In
[32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Neural Information Processing Systems, pages 20742082,
Sparse convolutional neural networks. InProceedings of the 2016.1,2,3
IEEE Conference on Computer Vision and Pattern Recogni- [49] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated´
tion, pages 806814, 2015.2 residual transformations for deep neural networks. arXiv
[33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, preprint arXiv:1611.05431, 2016.7
C. Fu, and A. C. Berg. SSD: single shot multibox detector. [50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural
CoRR, abs/1512.02325, 2015.6 network acoustic models with singular value decomposition.
[34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint InINTERSPEECH, pages 23652369, 2013.2
arXiv:1511.05077, 2015.2 [51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-
[35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training efficient convolutional neural networks using energy-aware
of convolutional networks through ffts. arXiv preprint pruning.arXiv preprint arXiv:1611.05128, 2016.2
arXiv:1312.5851, 2013.2 [52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very
[36] V. Nair and G. E. Hinton. Rectified linear units improve deep convolutional networks for classification and detection.
restricted boltzmann machines. InProceedings of the 27th IEEE transactions on pattern analysis and machine intelli-
international conference on machine learning (ICML-10), gence, 38(10):19431955, 2016.1,2,3,5,6,7
pages 807814, 2010.4
[37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable
parallel programming with CUDA.ACM Queue, 6(2):4053,
2008.6
1397

Binary file not shown.