testing_generation/Corpus/Structured Pruning of Convo...

626 lines
55 KiB
Plaintext
Raw Normal View History

2020-08-06 20:53:44 +00:00
Received July 24, 2019, accepted July 30, 2019, date of publication August 5, 2019, date of current version August 15, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2933032
Structured Pruning of Convolutional Neural
Networks via L1 Regularization
CHEN YANG 1,2 , ZHENGHONG YANG 1,2 , ABDUL MATEEN KHATTAK 2,3 , LIU YANG 1,2 ,
WENXIN ZHANG 1,2 , WANLIN GAO 1,2 , AND MINJUAN WANG 1,2
1 Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, Beijing 100083, China
2 College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
3 Department of Horticulture, The University of Agriculture, Peshawar 25120, Pakistan
Corresponding authors: Wanlin Gao (wanlin_cau@163.com) and Minjuan Wang (minjuan@cau.edu.cn)
This work was supported by the Project of Scientic Operating Expenses from Ministry of Education of China under Grant 2017PT19.
ABSTRACT Deep learning architecture has achieved amazing success in many areas with the recent
advancements in convolutional neural networks (CNNs). However, real-time applications of CNNs are seri-
ously hindered by the signicant storage and computational costs. Structured pruning is a promising method
tocompressandaccelerateCNNsanddoesnotneedspecialhardwareorsoftwareforanauxiliarycalculation.
Here a simple strategy of structured pruning approach is proposed to crop unimportant lters or neurons
automatically during the training stage. The proposed method introduces a mask for all lters or neurons
to evaluate their importance. Thus the lters or neurons with zero mask are removed. To achieve this,
the proposed method adopted L1 regularization to zero lters or neurons of CNNs. Experiments were
conducted to assess the validity of this technique. The experiments showed that the proposed approach
could crop 90.4%, 95.6% and 34.04% parameters on LeNet-5, VGG-16, and ResNet-32respectively, with a
negligible loss of accuracy.
INDEX TERMSConvolutional neural networks, regularization, structured pruning.
I. INTRODUCTION network pruning [14]. For the deep neural networks (DNN)
During the recent years, convolutional neural netwo- that have been trained, the low-rank decomposition tech-
rks (CNNs) [1] have accomplished successful applications nology decomposes and approximates a tensor to a smaller
in many areas such as image classication [2], object detec- level to achieve compression. The low-rank decomposition
tion [3], neural style transfer [4], identity authentication [5], achieves efcient speedup because it reduces the elements of
information security [6], speech recognition and natural lan- the matrix. However, it can only decompose or approximate
guage processing. However, these achievements were made tensors one by one within every layer, and cannot discover the
through leveraging large-scale networks, which possessed redundant parameters of DNN. Besides, more research has
millions or even billions of parameters. Those large-scale been focused on network module designs, which are smaller,
networks heavily relied on GPUs to accelerate computation. more efcient and more sophisticated. These models, such
Moreover, devices with limited resources, such as mobile, as SqueezeNet [15], MobileNet [16] and Shufenet [17], are
FPGA or embedded devices, etc. have difculties to deploy basically made up of low resolutions convolution with lesser
CNNs in actual applications. Thus, it is critical to acceler- parameters and better performance.
ate the inference of CNNs and reduce storage for a wide At present, network pruning is a major focus of research,
range of applications [7]. which not only accelerates DNN, but also reduces redundant
According to the studies done so far, the major approaches parameters. Actually, using a large-scale network directly
for compressing deep neural networks can be categorized may provide state-of-the-art performance, so learning a large-
into four groups, i.e. low-rank decomposition [8], param- scale network is needed. However, optimum network archi-
eter quantization [9], knowledge distillation [10][13], and tecture may not be known. Thus, a massive redundancy
exists in large neural networks. To combat this problem,
network pruning is useful to remove redundant parameters, The associate editor coordinating the review of this manuscript and
approving it for publication was Chao Shen. lters, channels or neurons, and address the over-tting issue.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019 106385 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
FIGURE 1.The architecture of the layer with the mask. (a) The architecture of a convolutional layer with the mask. (b) The architecture of a
fully-connected layer with the mask. The proposed approach chooses the unimportant filters and neurons (highlighted in yellow) by the order
of magnitude of mask value.
Network pruning techniques can also be broadly catego- Contrarily, in the proposed method, a mask is introduced
rized as structured pruning and non-structured pruning. to address this issue and the regularization term only is the
Non-structured pruning aims to remove single parameters l1 -norm of the mask, which easily calculates the gradients
that have little inuence on the accuracy of networks and of the mask. In this method, the parameters of lters or
non-structured pruning is efcient and effective for compact- neurons are multiplied by a mask to pick unimportant lters
ing networks. Nonetheless, non-structured pruning is dif- or neurons, and once the mask is zero the corresponding
cult to be widely used in practical applications. Actually, lter or neuron will be removed. Here, though a mask is
the operation of convolution is reformulated as a matrix- introduced for lters or neurons, the method does not change
by-matrix multiplication in many prevalent deep learning the architecture of the network. This allows for other com-
frameworks. This requires additional information to rep- pression methods to be used with the proposed technique.
resent pruned locations in non-structured pruning method. Similar to the proposed method, Linet al.[32] also adopted a
Therefore, special hardware or software is needed to assist mask to identify unimportant lters or neurons, but the value
with the calculation, which may increase computation time. of the mask could not be changed by training. In addition,
Instead, structured pruning directly removes the entire lters, removing unimportant lters or neurons may temporarily
channels or neurons. Thus, the remaining network archi- degrade accuracy, but the network can be retrained for recov-
tecture can be used directly by the existing hardware. For ery performance. FIGURE 1 shows the framework of the
example, Anwaret al.[18] employed particle ltering to proposed method.
structured sparsity convolutional neural network at channel- Inthisarticle,astructuredpruningtechnologyispresented,
wise, kernel-wise, and intra-kernel stride levels. At present, which allows for simultaneously learning and removing
severalstructuredpruningmethods[24],[25],[27]aremainly unimportant lters or neurons of CNNs. The main contribu-
based on the statistical information of parameters or acti- tions are as follows:
vation outputs. These methods do not consider the loss and A simple yet effective method based L1 regularization is
are unable to remove parameters during training. In addition, presented to compress CNNs model during the training
some methods, such as those mentioned by [19], [20], require stage.
layer-by-layer iterative pruning and recovery accuracy, which A threshold is adopted to solve the optimization problem
involves enormous calculations. On the contrary, the pro- ofl1 -norm. In this approach, only some mask values are
posed approach links pruning with minimization of loss and required to be near zero, though not completely zero.
can be implemented during the training. The detail is provided in the following section.
It is inspiring that the lters whos weights are all zero can
be safely removed, because, whatever the input, they would II. PREVIOUS WORK
not extract any features. This study presents a scheme to The importance of compressing deep learning models before
prune lters or neurons of fully-connected layers based on the application is self-evident, especially for expanding the
L1 regularization [21] to zero out the weights of some lters application scenarios of deep learning [11]. For example, a
or neurons. Similar to this method, Wenet al.[31] adopted compressed deep learning model can be combined with edge
groupLASSOregularization[40]tozerooutlters.However, computing [12] to enable Internet of things devices under-
all the weights are required to compute an extra gradient, stand data. In this section, we will review the contributions of
whichiscomputationallyexpensiveforalarge-scalenetwork. others.
106386 VOLUME 7, 2019 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
Le Cunet al.[14] rst proposed a saliency measurement III. The APPROACH OF STRUCTURED
method called Optimal Brain Damage (OBD) to selectively PRUNING FOR CNNs
delete weights by second-derivative information of error A. NOTATIONS
function. Later, Hassibi and Strok [22] proposed the Opti- First of all, notations are claried in this section. CNN is
mal Brain Surgeon (OBS) algorithm based on OBD. The a multi-layer deep feed-forward neural network, which is
OBS not only removed unimportant weights but also auto- composed of a stack of convolutional layers, pooling layers,
matically adjusted the remaining weights, which improved and full-connected layers. In anl-layer CNNs model,Wk 2l accuracy and generalization ability. All these methods are RddCl1 represents thek-th lter ofllayer, C l1 denotes
based on Taylor expansion (even OBD and OBS are required the number of feature maps inl-1 layer anddindicates the
to compute Hessian matrix), which may be computationally kernel size. Let us denote feature maps in thellayer by
intensive especially for large networks. In addition, they Zl 2RHl Wl Cl , whereHl Wl is the size, C l is the number
use a criterion of minimal increase in error on the train- of channels, andZl is the output ofl-1 layer. In addition,Zkl ing data. Guoet al.[23] introduced a binary matrix to represents thek-th feature map ofllayer. The output feature
dynamically choose important weights. Hanet al.[24], [25] mapZk can be computed as: l directly removed weights with values lower than a predened
threshold to compress networks, then followed by retraining Zk Df(Z Wk Cbk ); (1) l l1 l l
to recover accuracy. Considering most lters in CNNs that
tended to be smooth in the spatial domain, Liuet al.[26] wheref is a non-linear activation function, is the con-
extended Guos work to the frequency domain by imple- volutional operation andbk is the bias.D D fX Dl
menting Discrete Cosine Transform (DCT) to lters in the fx1 ;x2 ;:::;xN g;YD fy1 ;y2 ;:::;yN ggrepresents the train-
spatial domain. However, these non-structured pruning tech- ing set, wherexi andyi represent the training sample and label
nologies were hard to use in real applications, because extra respectively, andNindicates the number of samples.
software or hardware was required for the calculation.
Directly cropping a trained model by the value of weight B. THE PROPOSED SCHEME
is a wide method. Normally it is used to nd an effective The goal of pruning is to remove those redundant l-
evaluation to judge the importance of weights and to cut the ters or neurons, which are unimportant or useless for the
unimportant connection or lter to reduce the redundancy of performance of the networks. Essentially, the main role of
a model. Huet al.[27] thought the activation outputs of a the convolutional layer lters is to extract local features.
signicant portion of neurons were zero in a large network, However, once all the parameters of a lter are zeroed, the l-
whatever inputs the network received. These zero activa- ter is conrmed unimportant. Whatever the inputs for the
tion neurons were unimportant, so they dened the Average lter, the outputs are always zero. Under the circumstance,
Percentage of Zeros (ApoZ) to observe the percentage of the lters are unable to extract any information. When the
activations of a neuron and cropped the neurons with fewer lters are multiplied by zero, all the parameters of the lters
activations. Liet al.[28] introduced a structured pruning become zero. Based on this observation, a mask is introduced
method by measuring the norm of lters to remove unim- for every lter to estimate its importance. This can be for-
portant lters. Luoet al.[29] took advantage of a subset of mulized as:
input channels to approximate output for compressing con- Zk Df(Z mk )Cbk ); (2)volutional layers. Changpinyoet al.[30] proposed a random l l1 (Wk l l l
method to compress CNNs. They randomly connected the wheremk represents thek-th mask ofl-layer. outputchanneltoasmallsubsetofinputchannelstocompress l Therefore, the problem of zeroing out the values of someCNNs. Though successful to an extent, their method did not lters can be transformed to zero some mask. For this pur-directly relate to the loss, hence it was necessary to retrain pose, the following optimization solution is proposed:the network for the recovery of accuracy. On the other hand,
such a scheme could only be used layer-by-layer. Thus, it was minL(Y;F(XIW;m))
essential to iterate over and over to prune, which would result W
s:t:kmkin massive computation costs. 0 C; (3)
Dinget al.[37] applied a customized L2 regularization whereL() is a loss function, such as cross-entropy loss,F()to remove unimportant lters and simultaneously stimulate istheoutputofCNNsandCisahyper-parameterthatcontrolsimportant lters to grow stronger. Linet al.[32] proposed the number of pruned lters. Equation (3) is the core of thea Global & Dynamic Filter Pruning (GDP) method, which proposed method. Once the optimal solution of the equationcould dynamically recover the previously removed lters. is obtained, the pruning is achieved.Liuet al.[33] enforced channel-level sparsity in the net- In addition, this method can also remove redundant neu-work to compress DNNs in the training phase. In addition, rons in a fully-connected layer. The inference of fully-Gordonet al.[39] iteratively shrank and expanded a network connected layer can be represented by:targetingreductionofparticularresources(e.g.FLOPS,orthe
number of parameters). Zl Df(Wl Zl1 Cbl ); (4)
VOLUME 7, 2019 106387 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
whereW Algorithm 1The Proposed Pruning Approach l 2 Rmn is a weight matrix andZl1 2 Rn1
is the input ofl-th layer. Here, when fully-connected layers Input:TraindataD,CNNsmodel,threshold,penaltyfactor
introduce mask, the inference of these layers can be reformu- C, maskm.
lated as: DO:
1. Initializing weightWand maskmD1
Z l Df(Wl Zl1 ml Cbl ); (5) 2. Training the CNNs with the mask, for suitableC
3. Pruning the lters or neuron based the value of the
wherem ml 2R is a mask vector andis Hadamard product mask
operator. 4. Fine-tuning the network by retraining
Equation (3) can be transformed into the following form End
based Lagrange multiplier: Merging weights and masks and then removing the mask
layer.minL(Y;f(X;W;m))Ckmk0 ; (6) Return the pruned network architecture and preserved W weights.
whereis a coefcient associated with C. Equation (6) is
an NP-hard problem because of the zero norm. Thus, it is
quite difcult to obtain an optimal solution with equation (6). D. FINE-TUNING AND OTHER REGULARIZATION Therefore,l1 -norm is adopted to replacel0 -norm, as: STRATEGIES
Pruning may temporarily lead to degradation in accuracy, sominL(Y;f(X;W;m))Ckmk1 : (7) W ne-tuning is necessary to improve accuracy. Furthermore,
the proposed method can be employed iteratively to obtainEquation (7) can be solved by SGD in practical application, a narrower architecture. Actually, a single iteration of pro-so the proposed method is simple and easy to implement. posed method is enough to yield noticeable compaction. TheWe just need to introduce a mask for each layer and train method is elaborated in Algorithm 1.the network. Though the proposed method introduces mask, Essentially, the purpose of this approach is to adjust somethe network topology will be preserved because the mask can masks to adequately small order of magnitude. Therefore,be absorbed into weight. L2 regularization can also serve as a regularization strategy
in this approach.
C. THRESHOLD
L1 regularization is a widely used sparse technology, which IV. EXPERIMENTS
pushes the coefcients of uninformative features to zero. So a The approach was primarily evaluated through three net-
sparse network is achieved by solving equation (7). However, works: LeNet-5 on MNIST dataset, VGG-16 on CIFAR-10
there is a problem in solving equation (7). Here the mask dataset and ResNet-32 on CIFAR-10 dataset. The implemen-
value cannot be completely zeroed in practical application, tation of this approach was accomplished through the stan-
because the objective function (7) is non-convex and the dard Keras library. All experiments were conducted through
global optimal solution may not be obtained. A strategy is Intel E5-2630 V4 CPU and NVIDIA 1080Ti GPU.
adopted in the proposed method to solve this problem. If the
order of magnitude of the mask value is small enough, it can A. DATASETS
be considered almost as zero. Thus, to decide whether the 1) MNIST
mask is zero, a threshold is introduced. However, considering MNIST dataset of handwritten digits from 0 to 9 is widely
only the value of the mask is meaningless if the mask is applied to evaluate machine learning models. This dataset
not completely zero. Because there is a linear transformation owns 60000 train samples and 10000 test samples.
between mask and convolution. One can shrink the masks
while expanding the weights to keep the product of them 2) CIFAR-10
the same. Hence, considering the mask and weight simul- The CIFAR-10 dataset [41] has a total of 60000 images con-
taneously is necessary. The average value of the product of sisting of 10 classes, each having 6000 images with 3232
the mask and the weight is used to determine whether the resolution. There are 50000 training images and 10000 test
mask is exactly zero or not? The specic denition can be images. During training, a data augmentation scheme was
presented as: adopted, which contained random horizontal ip, rotation,
( and translation. The input data was normalized using the
mk if abs(E(mk wk )) means and standard deviations.mk D l l l (8) l 0 if abs(E(mk wk ))< ; l l B. NETWORK MODELS
whereis a pre-dened threshold andE() is the average 1) LENET-5
operation. This strategy is efcient and reasonable, which can LeNet-5 is a convolutional neural network designed by
be proved by the results of the experiment. LeCun et al. [34]. It has two convolutional and two
106388 VOLUME 7, 2019 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
TABLE 1.The result of lenet-5 on mnist.
full-connected layers. This network has 44.2K learnable dropout rate was set to 0.5 for the fully-connected layer.
parameters. In this network, dropout is used in the full- While implementing the pruning training, only the epochs
connected layer. was modied. The epochs was set at 10 and the threshold
mentioned above to select pruned lters was set at 0.01. The
2) VGG-16 pruned network was then retrained to compensate for the loss
The original VGG-16 [35] has thirteen convolutional of accuracy. We adopted the same hyper-parameter setting as
and two fully-connected layers and has 130M learn- in normal training.
able parameters. However, VGG-16 is very complex for
CIFAR-10 dataset. So the fully-connected layers were 2) VGG-16 ON CIFAR-10
removed.Moreover,BatchNormalizationwasusedaftereach To get the baseline accuracy, the network was normally
convolution operation. The modied model has 14.7M learn- trained from scratch by SGD with a batch size of 128. The
able parameters. total epochs were set to 60. The initial learning rate was
set to 0.01 and then scaled up by 0.1 every 20 epochs. The
3) RESNET-32 weight decay was set at 0.0005 and the momentum at 0.9.
Deep residual network (ResNet) [42] is a state-of-the-art mul- While implementing the pruning training, epochs was set
tiple CNNs architecture. In this paper, ResNet-32 was imple- to 30 , the learning rate was scaled by 0.1 every 10 epochs
mentedtoevaluatetheproposedmethod.TheusedResNet-32 and other settings remained the same, while the threshold was
had the same architecture as described in [42], which con- set at 0.01. Finally, the pruned model was retrained following
tained three stages of convolutional, one global average pool- the same pre-processing and hyper-parameter settings as the
ing after last convolutional layer and one fully-connected normal training.
layer. In addition, when the dimensions increased, 11
convolution was adopted as identity mapping to match the 3) RESNET-32 ON CIFAR-10
dimensions. This network has 0.47M learnable parameters. Generally, the network was trained from scratch by SGD as
the baseline with a batch size of 128. The weight decay was
C. THE DETAIL OF TRAINING, PRUNING, AND set at 0.0001, the epochs were set at 120, and the momentum
FINE-TUNING was set at 0.9. The initial learning rate was set at 0.1 and then
To obtain the baseline of accuracy in the experiments, scaledby0.1at60and100epochs.Here,forpruningtraining,
we trained LeNet-5 on MNIST, VGG-16 on CIFAR-10, and the epoch was set at 30, the learning rate was scaled by
ResNet-32 on CIFAR-10 from scratch. Then, the pruning was 0.1 every 10 epochs and the other settings remained the same.
performed on the basis of the trained network and the strategy After pruning, the network was retrained from scratch. The
of regularization was chosen as L1 regularization, with the epochs was modied to 60 and the learning rate was scaled
mask initialized to 1. Later, we would retrain the pruned by 0.1 every 20 epochs.
network for the recovery of accuracy.
D. RESULTS OF THE EXPERIMENTS
1) LENET-5 ON MNIST 1) LENET-5 ON MNIST
The original network was normally trained from scratch, for AspertheresultsinTABLE1,88.84%oftheparameterswere
a total of 30 epochs, by Adam [43] with a batch sizes of 128. removed without any impact on performance. Based on the
The learning rate was initialized to 0.001, the weight decay proposed method, 95.46% of the parameters were discarded
was set to 0.0005. The momentum was set to 0.9 and the as well with an accuracy loss of 0.57%.
VOLUME 7, 2019 106389 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
TABLE 2.Result of VGG-16 on CIFAR-10 datasets.
TABLE 1 also reveals that there was enormous redundancy
in fully-connected layers because at least 90% parameters of
fully-connected layers could easily be dropped. According to
the form, the proposed method may indeed seek important
connections. The reasons can be summarized in two points.
First, when parameters of 83.83% are removed, the accuracy
doesnt change. This indicates that the pruned parameters
are unimportant for maintaining the accuracy of the network.
Second, it is difcult to remove some lters or neurons, espe-
ciallytheneuronsoffully-connectedlayers,whenthepruning
rate gradually increases. So the remaining connections are
crucial.
In addition, the convolutional layer, especially the rst
one, is hard to prune in comparison with the next layer. The
possible explanation could be that the proposed method auto-
matically selects the unimportant lters through a backprop-
agation algorithm. However, the backpropagation algorithm FIGURE 2.Comparison of L1 regularization and L2 regularization.
will cause the previous layer to suffer gradient vanishing accuracy loss represents the difference of accuracy between pruned
CNNs and original CNNs. A positive value indicates the improvement of problem. That is why the former layers are hard to prune network accuracy after pruning, while a negative value indicates the
compared to the later ones. decrease of accuracy.
2) VGG-16 ON CIFAR-10 is expensive in terms of computation cost, especially in case
As depicted in TABLE 2, over 94.4% of parameters could of large-scale datasets and networks.
be removed with a negligible accuracy loss of 0.51%.
It can also be observed that the loss of accuracy was 3) RESNET-32 ON CIFAR-10
only 2.04% when prune parameters of 97.76%. The pro- Pruning ResNet-32 based on the order of magnitude of
posed method proved to be effective again in reducing the mask may result in different output map dimensions
redundancy. in the residual module. So a 11 convolution is needed
In fact, preserving the remaining architecture without as identity mapping to match dimensions. However, this
retaining the parameters (training the pruned network from operation brings about extra parameter and computation.
scratch) is also a strategy to ne-tune network. This strategy To avoid this problem, a percentile was dened to remove
was adopted here to retrain the network and the results were lters of the same proportion in every convolutional layer.
promising, as shown in TABLE 2. The results reveal that a TABLE 3 shows that the proposed method removed 34%
better effect can be achieved through directly retraining the parameters with accuracy loss of 0.65%. Moreover, over
pruned network from scratch. Perhaps the signicance of the 62.3%ofparameterscouldalsobediscardedwithanaccuracy
proposed method is that it furnishes the facility to discover loss of 1.76%. Thus, it was conrmed that the proposed
excellent architectures, as mentioned by Liuet al.[36] as method could reduce the redundancy of complex network,
well. Nevertheless, training a pruned network from scratch i.e. ResNet.
106390 VOLUME 7, 2019 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
FIGURE 3.The comparison of pruned and reserved filters. (a) The comparison of parameters order of magnitude between
pruned and reserved filters. The x-axis represents the distribution interval and the y-axis represents the percentage of the
parameter in the interval. (b) The comparison of non-zero activations. The left bar represents average non-zero activation
percentage, and the right bar represents average non-zero activation value.
TABLE 3.Restlt of RESNET-32 on CIFAR-10 datasets. be removed. Empirically, a weak lter or neuron always
has lower activation outputs, lower activation frequency, and
lower weight value. Hence weight values and activation
outputs were chosen here to evaluate the difference between
pruned and preserved lters.
As shown in Figure 2, the bulk of values of pruned param-
eters, with a percentage of 96.9, are less than 10 6 , in terms
of the weight absolute values. However, most of the values of
reserved parameters, with a percentage of 94.5, are greater
than 0.001. The results indicate an enormous distribution
difference between the values of the pruned and the reserved V. ANALYSIS parameters. Therefore, the present approach can effectively A. L2 REGULARIZATION reduce the order of magnitude of the pruned parameters.L2 regularization was also explored as a regularization strat- In addition, the test set was chosen as a sample to calcu-egy in this study. As shown in FIGURE 2, the LeNet-5 late the average non-zero activation values and percentagecan also be compressed without degrading accuracy based of CONV3-1. As obvious from Figure 3, both the averageL2 regularization. Nevertheless, there is some difference percentage of non-zero activation and the average values ofbetweenL1regularizationandL2regularization.BothL1and non-zero activation of the pruned lters was much lower thanL2 regularizations can improve accuracy when pruning rate those of the reserved lters. From the activation perspective,is less than 84%, but the effect of L2 regularization is better. the pruned lters were weak, because the output and weightThe main reason is that regularization techniques can prevent values of pruned lters were negligible compared with theovertting and improve the generalization ability. Moreover, reserved lters and could be completely ignored. Thus, usingwith the pruning rate increasing, L1 regularization can the order of magnitude of the mask to determine prunedachieve a greater compression effect in the same accuracy. lters or neurons was reasonable.As per Hanet al.[24], L1 regularization pushes more
parameters closer to zero, so it can prune more parameters.
Having studied the difference between L1 regularization C. COMPARISON WITH OTHER METHODS
and L2 regularization, the inclination is more towards the In this section, two classical structured prune methods wereL1 regularization from the perspective of compression and compared with the proposed method. First, in LeNet-5 onaccuracy trade-off. MNIST-10 dataset, the proposed method was compared with
that of Wenet al.[31]. In this experiment, both the proposed
B. THE EFFECT OF PRUNING and Wenet al.[31] methods adopted the same coefcient of
To better describe the effect of the proposed method, a com- sparsity regularization (D0:03). The results (TABLE 5)
parison was made between the pruned lters and reserved show that both the methods were analogous in terms of
lters. The CONV3-1 layer of VGG-16, which owned accuracy and compression effect. However, the proposed
256 lters, was chosen while theset at 0.008. Based method is simpler and costs less computation in practice.
on the above setting, 125 lters of CONV3-1 layer could Further, the proposed method was also compared with that
VOLUME 7, 2019 106391 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
TABLE 4.Compare of VGG-16 on CIFAR-10. [2]A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classication
with deep convolutional neural networks, inProc. Adv. Neural Inf. Pro-
cess. Syst. (NIPS), 2012, pp. 10971105.
[3]R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies
for accurate object detection and semantic segmentation, inProc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580587.
[4]I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in
Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 26722680.
TABLE 5.Compare of LENET-5 on MNIST. [5]C.Shen,Y.Li,Y.Chen,X.Guan,andR.Maxion,Performanceanalysisof
multi-motion sensor behavior for active smartphone authentication,IEEE
Trans. Inf. Forensics Security, vol. 13, no. 1, pp. 4862, Jan. 2018.
[6]C. Shen, Y. Chen, X. Guan, and R. Maxion, Pattern-growth based mining
mouse-interaction behavior for an active user authentication system,
IEEE Trans. Dependable Secure Comput., to be published.
[7]Y. Cheng, D. Wang, P. Zhou, and T. Zhang, A survey of model compres-
sion and acceleration for deep neural networks, 2017,arXiv:1710.09282.
[Online]. Available: https://arxiv.org/abs/1710.09282
[8]C. Tai, T. Xiao, Y. Zhang, X. Wang, and E. Weinan, Convolutional
of Liuet al.[33] in VGG-16 on CIFAR-10. Again, the same neural networks with low-rank regularization, 2015,arXiv:1511.06067.
[Online]. Available: https://arxiv.org/abs/1511.06067
sparsity regularization coefcient (D0:005) was adopted [9]W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, Compressing
for both the methods. However, Liuet al.[33] adopted a neural networks with the hashing trick, inProc. Int. Conf. Mach. Learn.,
xed percentage threshold setting, whereas, the scheme of 2015, pp. 22852294.
[10]Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep con-
threshold setting of proposed method was different from Liu. volutional networks using vector quantization, 2014,arXiv:1412.6115.
The results (in TABLE 4) reveal that the proposed method [Online]. Available: https://arxiv.org/abs/1412.6115
was superior in terms of compression efciency, although [11]Z. Tian, S. Su, W. Shi, X. Du, M. Guizani, and X. Yu, A data-driven
method for future Internet route decision modeling,Future Gener. Com- there was a slight loss of accuracy. In general, the proposed put. Syst., vol. 95, pp. 212220, Jun. 2018.
method can not only generate sparsity but also achieve better [12]Z. Tian, W. Shi, Y. Wang, C. Zhu, X. Du, S. Su, Y. Sun, and N. Guizani,
pruning effect with its improved threshold. Real-time lateral movement detection based on evidence reasoning net-
Nevertheless, some shortcomings were also observed with work for edge computing environment,IEEE Trans. Ind. Informat.,
vol. 15, no. 7, pp. 42854294, Jul. 2019.
this approach. One is that though this approach doesnt [13]R. Liu, N. Fusi, and L. Mackey, Teacher-student compression with gener-
change the existing CNNs architecture, the added mask layer ative adversarial networks, 2018,arXiv:1812.02271. [Online]. Available:
essentiallyincreasesthenumberoflayersinthenetwork.This https://arxiv.org/abs/1812.02271
[14]Y. LeCun, J. S. Denker, and S. A. Solla, Optimal brain damage, inProc.
may increase optimization difculty. However, this problem Adv. Neural Inf. Process. Syst., 1990, pp. 598605.
can be solved by Batch Normalization (BN [38]). The other is [15]F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and
that, as this method introduces a threshold, the pruning effect K. Keutzer, SqueezeNet: AlexNet-level accuracy with 50x fewer param-
eters and<0.5 MB model size, 2016,arXiv:1602.07360. [Online]. Avail-
may not be smooth. The pruning rate may change drastically able: https://arxiv.org/abs/1602.07360
with small changes in the, which is not conducive to nding [16]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
the best. M. Andreetto, and H. Adam, MobileNets: Efcient convolutional neu-
ral networks for mobile vision applications, 2017,arXiv:1704.04861.
[Online]. Available: https://arxiv.org/abs/1704.04861
VI. CONCLUSION [17]X. Zhang, X. Zhou, M. Lin, and J. Sun, ShufeNet: An extremely
In this article, a structured pruning technology is proposed efcient convolutional neural network for mobile devices, inProc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 68486856.
to automatically tailor redundant lters or neurons based on [18]S. Anwar, K. Hwang, and W. Sung, Structured pruning of deep convolu-
regularization. A mask is introduced to remove unimportant tional neural networks,ACM J. Emerg. Technol. Comput. Syst., vol. 13,
lters or neurons by zeroing the values of some masks dur- no. 3, p. 32, 2017.
ing training. In addition, to deal with the problem that the [19]Y. He, X. Zhang, and J. Sun, Channel pruning for accelerating very
deep neural networks, inProc. IEEE Int. Conf. Comput. Vis., Jun. 2017,
mask cannot be completely zeroed in practice, a threshold pp. 13891397.
is designed to zero the mask. Experimentation with multiple [20]J.-H. Luo and J. Wu, An entropy-based pruning method for
datasets has proved that the proposed method can effectively CNN compression, arXiv:1706.05791, 2017. [Online]. Available:
https://arxiv.org/abs/1706.05791
remove parameters with a negligible loss of accuracy. In the [21]R. Tibshirani, Regression selection and shrinkage via the lasso,J. Roy.
future, establishing a relation between the hyper-parameter Stat. Soc. B, vol. 58, no. 1, pp. 267288, 1996.
and the pruning rate will be considered to facilitate the [22]B. Hassibi and D. G. Stork, Second order derivatives for network pruning:
Optimal brain surgeon, inProc. Adv. Neural Inf. Process. Syst., 1993,
adjustment of hyper-parameter. pp. 164171.
[23]Y. Guo, A. Yao, and Y. Chen, Dynamic network surgery for efcient
ACKNOWLEDGMENT DNNs, inProc. Adv. Neural Inf. Process. Syst., 2016, pp. 13791387.
[24]S. Han, J. Pool, J. Tran, and W. Dally, Learning both weights and con-
All the mentioned support is gratefully acknowledged. nections for efcient neural network, inProc. Adv. Neural Inf. Process.
Syst., 2015, pp. 11351143.
REFERENCES [25]S. Han, H. Mao, and W. J. Dally, Deep compression: Com-
pressing deep neural networks with pruning, trained quantization
[1]Y. LeCun, Y. Bengio, and G. Hinton, Deep learning,Nature, vol. 521, and Huffman coding, 2015,arXiv:1510.00149. [Online]. Available:
pp. 436444, May 2015. https://arxiv.org/abs/1510.00149
106392 VOLUME 7, 2019 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
[26]Z.Liu,J.Xu,X.Peng,andR.Xiong,Frequency-domaindynamicpruning ZHENGHONG YANGreceived the masters and
forconvolutionalneuralnetworks,inProc.Adv.NeuralInf.Process.Syst., Ph.D. degrees from Beijing Normal University,
2018, pp. 10431053. in 1990 and 2001, respectively. He is currently
[27]H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, Network trimming: a Professor with the College of Science, China A data-driven neuron pruning approach towards efcient deep archi- Agricultural University. He has presided two tectures, 2016, arXiv:1607.03250. [Online]. Available: https://arxiv. projects of National Natural Science Foundation. org/abs/1607.03250 He has written two teaching and research books [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, Pruning l- andhaspublishedmorethan40academicpapersin ters for efcient convNets, 2016,arXiv:1608.08710. [Online]. Available: domestic and foreign journals, among them, about https://arxiv.org/abs/1608.08710
[29]J.-H. Luo, J. Wu, and W. Lin, ThiNet: A lter level pruning method for 30 are cited by SCI/EI/ISTP. His major research
deep neural network compression, inProc. IEEE Int. Conf. Comput. Vis., interests include the matrix theory, numerical algebra, image processing, and
Jun. 2017, pp. 50585066. so on. He is a member of Beijing and Chinese Society of Computational
[30]S. Changpinyo, M. Sandler, and A. Zhmoginov, The power of sparsity in Mathematics.
convolutional neural networks,arXiv:1702.06257. [Online]. Available:
https://arxiv.org/abs/1702.06257
[31]W.Wen,C.Wu,Y.Wang,Y.Chen,andH.Li,Learningstructuredsparsity
in deep neural networks, inProc. Adv. Neural Inf. Process. Syst., 2016,
pp. 20742082.
[32]S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, Accelerating ABDUL MATEEN KHATTAKreceived the Ph.D.
convolutional networks via global & dynamic lter pruning, inProc. degree in horticulture and landscape from the
IJCAI, 2018, pp. 24252432. University of Reading, U.K., in 1999. He was a
[33]Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, Learning efcient Research Scientist in different agriculture research
convolutional networks through network slimming, inProc. IEEE Int. organizations before joining the University of Conf. Comput. Vis., Jun. 2017, pp. 27362744. Agriculture, Peshawar, Pakistan, where he is [34]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learn- currently a Professor with the Department of ing applied to document recognition,Proc. IEEE, vol. 86, no. 11, Horticulture. He has conducted academic and pp. 22782324, Nov. 1998. applied research on different aspects of tropical [35]K. Simonyan and A. Zisserman, Very deep convolutional networks for fruits, vegetables, and ornamental plants. He has large-scale image recognition, 2014,arXiv:1409.1556. [Online]. Avail-
able: https://arxiv.org/abs/1409.1556 also worked for Alberta Agriculture and Forestry, Canada, as a Research
[36]Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, Rethinking the Associate, and Organic Agriculture Centre of Canada as a Research and
value of network pruning, 2018,arXiv:1810.05270. [Online]. Available: Extension Coordinator, for Alberta province. There he helped in developing
https://arxiv.org/abs/1810.05270 organic standards for greenhouse production and energy saving technologies
[37]X. Ding, G. Ding, J. Han, and S. Tang, Auto-balanced lter pruning for for Alberta greenhouses. He is a Professor with considerable experience in
efcient convolutional neural networks, inProc. 32nd AAAI Conf. Artif. teaching and research. He is currently a Visiting Professor with the College
Intell., 2018, pp. 67976804. of Information and Electrical Engineering, China Agricultural University,
[38]S. Ioffe and C. Szegedy, Batch normalization: Accelerating Beijing. He has published 59 research articles in scientic journals of inter-
deep network training by reducing internal covariate shift, 2015, national repute. He has also attended and presented in several international
arXiv:1502.03167. [Online]. Available: https://arxiv.org/abs/1502. scientic conferences. His research interests include greenhouse produc- 03167 tion, medicinal, aromatic and ornamental plants, light quality, supplemental [39]A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi, lighting, temperature effects on greenhouse crops, aquaponics, and organic MorphNet: Fast & simple resource-constrained structure learning of deep production. networks, inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 15861595.
[40]M. Yuan and Y. Lin, Model selection and estimation in regression with
grouped variables,J. Roy. Statist. Soc., B (Statist. Methodol.), vol. 68,
no. 1, pp. 4967, 2006.
[41]A. Krizhevsky and G. Hinton, Learning multiple layers of features from LIU YANG is currently pursuing the masters tiny images, Univ. Toronto, Toronto, ON, Canada, Tech. Rep. 4, 2009. degree with the College of Information and Elec- [42]K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for
image recognition, inProc. IEEE Conf. Comput. Vis. Pattern Recognit., trical Engineering, China Agricultural University,
Jun. 2016, pp. 770778. Beijing, China. Her research interests include the
[43]D. P. Kingma and J. Ba, Adam: A method for stochastic opti- application of image recognition and intelligent
mization, 2014,arXiv:1412.6980. [Online]. Available: https://arxiv.org/ robots in the eld of agriculture.
abs/1412.6980
CHEN YANG is currently pursuing the mas- WENXIN ZHANGis currently pursuing the mas-
ters degree with the Department of College of ters degree with the School of Information and
Information and Electrical Engineering, China Electrical Engineering, China agricultural univer-
Agricultural University, Beijing, China. His sity, Beijing, China. Her research interest includes
research is about general deep learning and pose estimation methods about pig based on deep
machine learning but his main research interest learning for timely access to pig information.
includes deep models compression.
VOLUME 7, 2019 106393 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization
WANLIN GAO received the B.S., M.S., and MINJUAN WANGreceivedthePh.D.degreefrom
Ph.D. degrees from China Agricultural University, the School of Biological Science and Medical
in 1990, 2000, and 2010, respectively. He is the Engineering, Beihang University, under the super-
currently the Dean of the College of Information vision of Prof. Hong Liu, in June 2017. She was a
and Electrical Engineering, China Agricultural Visiting Scholar with the School of Environmen-
University. He has been the principal investiga- tal Science, Ontario Agriculture College, Univer-
tor (PI) of over 20 national plans and projects. sity of Guelph, from October 2015 to May 2017.
He has published 90 academic papers in domestic She is currently a Postdoctoral Fellow with the
and foreign journals, among them, over 40 are College of Information and Electrical Engineer-
cited by SCI/EI/ISTP. He has written two teaching ing, China Agricultural University. Her research
materials, which are supported by the National Key Technology Research interests mainly include bioinformatics and the Internet of Things key
and Development Program of China during the 11th Five-Year Plan Period, technologies.
and ve monographs. He holds 101 software copyrights, 11 patents for
inventions, and eight patents for new practical inventions. His major research
interests include the informationization of new rural areas, intelligence
agriculture, and the service for rural comprehensive information. He is a
member of Science and Technology Committee of the Ministry of Agricul-
ture, a member of Agriculture and Forestry Committee of Computer Basic
Education in colleges and universities, and a Senior Member of Society of
Chinese Agricultural Engineering, etc.
106394 VOLUME 7, 2019