Received July 24, 2019, accepted July 30, 2019, date of publication August 5, 2019, date of current version August 15, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2933032 Structured Pruning of Convolutional Neural Networks via L1 Regularization CHEN YANG 1,2 , ZHENGHONG YANG 1,2 , ABDUL MATEEN KHATTAK 2,3 , LIU YANG 1,2 , WENXIN ZHANG 1,2 , WANLIN GAO 1,2 , AND MINJUAN WANG 1,2 1 Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, Beijing 100083, China 2 College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China 3 Department of Horticulture, The University of Agriculture, Peshawar 25120, Pakistan Corresponding authors: Wanlin Gao (wanlin_cau@163.com) and Minjuan Wang (minjuan@cau.edu.cn) This work was supported by the Project of Scientic Operating Expenses from Ministry of Education of China under Grant 2017PT19. ABSTRACT Deep learning architecture has achieved amazing success in many areas with the recent advancements in convolutional neural networks (CNNs). However, real-time applications of CNNs are seri- ously hindered by the signicant storage and computational costs. Structured pruning is a promising method tocompressandaccelerateCNNsanddoesnotneedspecialhardwareorsoftwareforanauxiliarycalculation. Here a simple strategy of structured pruning approach is proposed to crop unimportant lters or neurons automatically during the training stage. The proposed method introduces a mask for all lters or neurons to evaluate their importance. Thus the lters or neurons with zero mask are removed. To achieve this, the proposed method adopted L1 regularization to zero lters or neurons of CNNs. Experiments were conducted to assess the validity of this technique. The experiments showed that the proposed approach could crop 90.4%, 95.6% and 34.04% parameters on LeNet-5, VGG-16, and ResNet-32respectively, with a negligible loss of accuracy. INDEX TERMSConvolutional neural networks, regularization, structured pruning. I. INTRODUCTION network pruning [14]. For the deep neural networks (DNN) During the recent years, convolutional neural netwo- that have been trained, the low-rank decomposition tech- rks (CNNs) [1] have accomplished successful applications nology decomposes and approximates a tensor to a smaller in many areas such as image classication [2], object detec- level to achieve compression. The low-rank decomposition tion [3], neural style transfer [4], identity authentication [5], achieves efcient speedup because it reduces the elements of information security [6], speech recognition and natural lan- the matrix. However, it can only decompose or approximate guage processing. However, these achievements were made tensors one by one within every layer, and cannot discover the through leveraging large-scale networks, which possessed redundant parameters of DNN. Besides, more research has millions or even billions of parameters. Those large-scale been focused on network module designs, which are smaller, networks heavily relied on GPUs to accelerate computation. more efcient and more sophisticated. These models, such Moreover, devices with limited resources, such as mobile, as SqueezeNet [15], MobileNet [16] and Shufenet [17], are FPGA or embedded devices, etc. have difculties to deploy basically made up of low resolutions convolution with lesser CNNs in actual applications. Thus, it is critical to acceler- parameters and better performance. ate the inference of CNNs and reduce storage for a wide At present, network pruning is a major focus of research, range of applications [7]. which not only accelerates DNN, but also reduces redundant According to the studies done so far, the major approaches parameters. Actually, using a large-scale network directly for compressing deep neural networks can be categorized may provide state-of-the-art performance, so learning a large- into four groups, i.e. low-rank decomposition [8], param- scale network is needed. However, optimum network archi- eter quantization [9], knowledge distillation [10][13], and tecture may not be known. Thus, a massive redundancy exists in large neural networks. To combat this problem, network pruning is useful to remove redundant parameters, The associate editor coordinating the review of this manuscript and approving it for publication was Chao Shen. lters, channels or neurons, and address the over-tting issue. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019 106385 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization FIGURE 1.The architecture of the layer with the mask. (a) The architecture of a convolutional layer with the mask. (b) The architecture of a fully-connected layer with the mask. The proposed approach chooses the unimportant filters and neurons (highlighted in yellow) by the order of magnitude of mask value. Network pruning techniques can also be broadly catego- Contrarily, in the proposed method, a mask is introduced rized as structured pruning and non-structured pruning. to address this issue and the regularization term only is the Non-structured pruning aims to remove single parameters l1 -norm of the mask, which easily calculates the gradients that have little inuence on the accuracy of networks and of the mask. In this method, the parameters of lters or non-structured pruning is efcient and effective for compact- neurons are multiplied by a mask to pick unimportant lters ing networks. Nonetheless, non-structured pruning is dif- or neurons, and once the mask is zero the corresponding cult to be widely used in practical applications. Actually, lter or neuron will be removed. Here, though a mask is the operation of convolution is reformulated as a matrix- introduced for lters or neurons, the method does not change by-matrix multiplication in many prevalent deep learning the architecture of the network. This allows for other com- frameworks. This requires additional information to rep- pression methods to be used with the proposed technique. resent pruned locations in non-structured pruning method. Similar to the proposed method, Linet al.[32] also adopted a Therefore, special hardware or software is needed to assist mask to identify unimportant lters or neurons, but the value with the calculation, which may increase computation time. of the mask could not be changed by training. In addition, Instead, structured pruning directly removes the entire lters, removing unimportant lters or neurons may temporarily channels or neurons. Thus, the remaining network archi- degrade accuracy, but the network can be retrained for recov- tecture can be used directly by the existing hardware. For ery performance. FIGURE 1 shows the framework of the example, Anwaret al.[18] employed particle ltering to proposed method. structured sparsity convolutional neural network at channel- Inthisarticle,astructuredpruningtechnologyispresented, wise, kernel-wise, and intra-kernel stride levels. At present, which allows for simultaneously learning and removing severalstructuredpruningmethods[24],[25],[27]aremainly unimportant lters or neurons of CNNs. The main contribu- based on the statistical information of parameters or acti- tions are as follows: vation outputs. These methods do not consider the loss and A simple yet effective method based L1 regularization is are unable to remove parameters during training. In addition, presented to compress CNNs model during the training some methods, such as those mentioned by [19], [20], require stage. layer-by-layer iterative pruning and recovery accuracy, which A threshold is adopted to solve the optimization problem involves enormous calculations. On the contrary, the pro- ofl1 -norm. In this approach, only some mask values are posed approach links pruning with minimization of loss and required to be near zero, though not completely zero. can be implemented during the training. The detail is provided in the following section. It is inspiring that the lters who’s weights are all zero can be safely removed, because, whatever the input, they would II. PREVIOUS WORK not extract any features. This study presents a scheme to The importance of compressing deep learning models before prune lters or neurons of fully-connected layers based on the application is self-evident, especially for expanding the L1 regularization [21] to zero out the weights of some lters application scenarios of deep learning [11]. For example, a or neurons. Similar to this method, Wenet al.[31] adopted compressed deep learning model can be combined with edge groupLASSOregularization[40]tozerooutlters.However, computing [12] to enable Internet of things devices under- all the weights are required to compute an extra gradient, stand data. In this section, we will review the contributions of whichiscomputationallyexpensiveforalarge-scalenetwork. others. 106386 VOLUME 7, 2019 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization Le Cunet al.[14] rst proposed a saliency measurement III. The APPROACH OF STRUCTURED method called Optimal Brain Damage (OBD) to selectively PRUNING FOR CNNs delete weights by second-derivative information of error A. NOTATIONS function. Later, Hassibi and Strok [22] proposed the Opti- First of all, notations are claried in this section. CNN is mal Brain Surgeon (OBS) algorithm based on OBD. The a multi-layer deep feed-forward neural network, which is OBS not only removed unimportant weights but also auto- composed of a stack of convolutional layers, pooling layers, matically adjusted the remaining weights, which improved and full-connected layers. In anl-layer CNNs model,Wk 2l accuracy and generalization ability. All these methods are RddCl1 represents thek-th lter ofllayer, C l1 denotes based on Taylor expansion (even OBD and OBS are required the number of feature maps inl-1 layer anddindicates the to compute Hessian matrix), which may be computationally kernel size. Let us denote feature maps in thellayer by intensive especially for large networks. In addition, they Zl 2RHl Wl Cl , whereHl Wl is the size, C l is the number use a criterion of minimal increase in error on the train- of channels, andZl is the output ofl-1 layer. In addition,Zkl ing data. Guoet al.[23] introduced a binary matrix to represents thek-th feature map ofllayer. The output feature dynamically choose important weights. Hanet al.[24], [25] mapZk can be computed as: l directly removed weights with values lower than a predened threshold to compress networks, then followed by retraining Zk Df(Z Wk Cbk ); (1) l l1 l l to recover accuracy. Considering most lters in CNNs that tended to be smooth in the spatial domain, Liuet al.[26] wheref is a non-linear activation function, is the con- extended Guo’s work to the frequency domain by imple- volutional operation andbk is the bias.D D fX Dl menting Discrete Cosine Transform (DCT) to lters in the fx1 ;x2 ;:::;xN g;YD fy1 ;y2 ;:::;yN ggrepresents the train- spatial domain. However, these non-structured pruning tech- ing set, wherexi andyi represent the training sample and label nologies were hard to use in real applications, because extra respectively, andNindicates the number of samples. software or hardware was required for the calculation. Directly cropping a trained model by the value of weight B. THE PROPOSED SCHEME is a wide method. Normally it is used to nd an effective The goal of pruning is to remove those redundant l- evaluation to judge the importance of weights and to cut the ters or neurons, which are unimportant or useless for the unimportant connection or lter to reduce the redundancy of performance of the networks. Essentially, the main role of a model. Huet al.[27] thought the activation outputs of a the convolutional layer lters is to extract local features. signicant portion of neurons were zero in a large network, However, once all the parameters of a lter are zeroed, the l- whatever inputs the network received. These zero activa- ter is conrmed unimportant. Whatever the inputs for the tion neurons were unimportant, so they dened the Average lter, the outputs are always zero. Under the circumstance, Percentage of Zeros (ApoZ) to observe the percentage of the lters are unable to extract any information. When the activations of a neuron and cropped the neurons with fewer lters are multiplied by zero, all the parameters of the lters activations. Liet al.[28] introduced a structured pruning become zero. Based on this observation, a mask is introduced method by measuring the norm of lters to remove unim- for every lter to estimate its importance. This can be for- portant lters. Luoet al.[29] took advantage of a subset of mulized as: input channels to approximate output for compressing con- Zk Df(Z mk )Cbk ); (2)volutional layers. Changpinyoet al.[30] proposed a random l l1 (Wk l l l method to compress CNNs. They randomly connected the wheremk represents thek-th mask ofl-layer. outputchanneltoasmallsubsetofinputchannelstocompress l Therefore, the problem of zeroing out the values of someCNNs. Though successful to an extent, their method did not lters can be transformed to zero some mask. For this pur-directly relate to the loss, hence it was necessary to retrain pose, the following optimization solution is proposed:the network for the recovery of accuracy. On the other hand, such a scheme could only be used layer-by-layer. Thus, it was minL(Y;F(XIW;m)) essential to iterate over and over to prune, which would result W s:t:kmkin massive computation costs. 0 C; (3) Dinget al.[37] applied a customized L2 regularization whereL() is a loss function, such as cross-entropy loss,F()to remove unimportant lters and simultaneously stimulate istheoutputofCNNsandCisahyper-parameterthatcontrolsimportant lters to grow stronger. Linet al.[32] proposed the number of pruned lters. Equation (3) is the core of thea Global & Dynamic Filter Pruning (GDP) method, which proposed method. Once the optimal solution of the equationcould dynamically recover the previously removed lters. is obtained, the pruning is achieved.Liuet al.[33] enforced channel-level sparsity in the net- In addition, this method can also remove redundant neu-work to compress DNNs in the training phase. In addition, rons in a fully-connected layer. The inference of fully-Gordonet al.[39] iteratively shrank and expanded a network connected layer can be represented by:targetingreductionofparticularresources(e.g.FLOPS,orthe number of parameters). Zl Df(Wl Zl1 Cbl ); (4) VOLUME 7, 2019 106387 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization whereW Algorithm 1The Proposed Pruning Approach l 2 Rmn is a weight matrix andZl1 2 Rn1 is the input ofl-th layer. Here, when fully-connected layers Input:TraindataD,CNNsmodel,threshold,penaltyfactor introduce mask, the inference of these layers can be reformu- C, maskm. lated as: DO: 1. Initializing weightWand maskmD1 Z l Df(Wl Zl1 ml Cbl ); (5) 2. Training the CNNs with the mask, for suitableC 3. Pruning the lters or neuron based the value of the wherem ml 2R is a mask vector andis Hadamard product mask operator. 4. Fine-tuning the network by retraining Equation (3) can be transformed into the following form End based Lagrange multiplier: Merging weights and masks and then removing the mask layer.minL(Y;f(X;W;m))Ckmk0 ; (6) Return the pruned network architecture and preserved W weights. whereis a coefcient associated with C. Equation (6) is an NP-hard problem because of the zero norm. Thus, it is quite difcult to obtain an optimal solution with equation (6). D. FINE-TUNING AND OTHER REGULARIZATION Therefore,l1 -norm is adopted to replacel0 -norm, as: STRATEGIES Pruning may temporarily lead to degradation in accuracy, sominL(Y;f(X;W;m))Ckmk1 : (7) W ne-tuning is necessary to improve accuracy. Furthermore, the proposed method can be employed iteratively to obtainEquation (7) can be solved by SGD in practical application, a narrower architecture. Actually, a single iteration of pro-so the proposed method is simple and easy to implement. posed method is enough to yield noticeable compaction. TheWe just need to introduce a mask for each layer and train method is elaborated in Algorithm 1.the network. Though the proposed method introduces mask, Essentially, the purpose of this approach is to adjust somethe network topology will be preserved because the mask can masks to adequately small order of magnitude. Therefore,be absorbed into weight. L2 regularization can also serve as a regularization strategy in this approach. C. THRESHOLD L1 regularization is a widely used sparse technology, which IV. EXPERIMENTS pushes the coefcients of uninformative features to zero. So a The approach was primarily evaluated through three net- sparse network is achieved by solving equation (7). However, works: LeNet-5 on MNIST dataset, VGG-16 on CIFAR-10 there is a problem in solving equation (7). Here the mask dataset and ResNet-32 on CIFAR-10 dataset. The implemen- value cannot be completely zeroed in practical application, tation of this approach was accomplished through the stan- because the objective function (7) is non-convex and the dard Keras library. All experiments were conducted through global optimal solution may not be obtained. A strategy is Intel E5-2630 V4 CPU and NVIDIA 1080Ti GPU. adopted in the proposed method to solve this problem. If the order of magnitude of the mask value is small enough, it can A. DATASETS be considered almost as zero. Thus, to decide whether the 1) MNIST mask is zero, a threshold is introduced. However, considering MNIST dataset of handwritten digits from 0 to 9 is widely only the value of the mask is meaningless if the mask is applied to evaluate machine learning models. This dataset not completely zero. Because there is a linear transformation owns 60000 train samples and 10000 test samples. between mask and convolution. One can shrink the masks while expanding the weights to keep the product of them 2) CIFAR-10 the same. Hence, considering the mask and weight simul- The CIFAR-10 dataset [41] has a total of 60000 images con- taneously is necessary. The average value of the product of sisting of 10 classes, each having 6000 images with 3232 the mask and the weight is used to determine whether the resolution. There are 50000 training images and 10000 test mask is exactly zero or not? The specic denition can be images. During training, a data augmentation scheme was presented as: adopted, which contained random horizontal ip, rotation, ( and translation. The input data was normalized using the mk if abs(E(mk wk )) means and standard deviations.mk D l l l (8) l 0 if abs(E(mk wk ))< ; l l B. NETWORK MODELS whereis a pre-dened threshold andE() is the average 1) LENET-5 operation. This strategy is efcient and reasonable, which can LeNet-5 is a convolutional neural network designed by be proved by the results of the experiment. LeCun et al. [34]. It has two convolutional and two 106388 VOLUME 7, 2019 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization TABLE 1.The result of lenet-5 on mnist. full-connected layers. This network has 44.2K learnable dropout rate was set to 0.5 for the fully-connected layer. parameters. In this network, dropout is used in the full- While implementing the pruning training, only the epochs connected layer. was modied. The epochs was set at 10 and the threshold mentioned above to select pruned lters was set at 0.01. The 2) VGG-16 pruned network was then retrained to compensate for the loss The original VGG-16 [35] has thirteen convolutional of accuracy. We adopted the same hyper-parameter setting as and two fully-connected layers and has 130M learn- in normal training. able parameters. However, VGG-16 is very complex for CIFAR-10 dataset. So the fully-connected layers were 2) VGG-16 ON CIFAR-10 removed.Moreover,BatchNormalizationwasusedaftereach To get the baseline accuracy, the network was normally convolution operation. The modied model has 14.7M learn- trained from scratch by SGD with a batch size of 128. The able parameters. total epochs were set to 60. The initial learning rate was set to 0.01 and then scaled up by 0.1 every 20 epochs. The 3) RESNET-32 weight decay was set at 0.0005 and the momentum at 0.9. Deep residual network (ResNet) [42] is a state-of-the-art mul- While implementing the pruning training, epochs was set tiple CNNs architecture. In this paper, ResNet-32 was imple- to 30 , the learning rate was scaled by 0.1 every 10 epochs mentedtoevaluatetheproposedmethod.TheusedResNet-32 and other settings remained the same, while the threshold was had the same architecture as described in [42], which con- set at 0.01. Finally, the pruned model was retrained following tained three stages of convolutional, one global average pool- the same pre-processing and hyper-parameter settings as the ing after last convolutional layer and one fully-connected normal training. layer. In addition, when the dimensions increased, 11 convolution was adopted as identity mapping to match the 3) RESNET-32 ON CIFAR-10 dimensions. This network has 0.47M learnable parameters. Generally, the network was trained from scratch by SGD as the baseline with a batch size of 128. The weight decay was C. THE DETAIL OF TRAINING, PRUNING, AND set at 0.0001, the epochs were set at 120, and the momentum FINE-TUNING was set at 0.9. The initial learning rate was set at 0.1 and then To obtain the baseline of accuracy in the experiments, scaledby0.1at60and100epochs.Here,forpruningtraining, we trained LeNet-5 on MNIST, VGG-16 on CIFAR-10, and the epoch was set at 30, the learning rate was scaled by ResNet-32 on CIFAR-10 from scratch. Then, the pruning was 0.1 every 10 epochs and the other settings remained the same. performed on the basis of the trained network and the strategy After pruning, the network was retrained from scratch. The of regularization was chosen as L1 regularization, with the epochs was modied to 60 and the learning rate was scaled mask initialized to 1. Later, we would retrain the pruned by 0.1 every 20 epochs. network for the recovery of accuracy. D. RESULTS OF THE EXPERIMENTS 1) LENET-5 ON MNIST 1) LENET-5 ON MNIST The original network was normally trained from scratch, for AspertheresultsinTABLE1,88.84%oftheparameterswere a total of 30 epochs, by Adam [43] with a batch sizes of 128. removed without any impact on performance. Based on the The learning rate was initialized to 0.001, the weight decay proposed method, 95.46% of the parameters were discarded was set to 0.0005. The momentum was set to 0.9 and the as well with an accuracy loss of 0.57%. VOLUME 7, 2019 106389 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization TABLE 2.Result of VGG-16 on CIFAR-10 datasets. TABLE 1 also reveals that there was enormous redundancy in fully-connected layers because at least 90% parameters of fully-connected layers could easily be dropped. According to the form, the proposed method may indeed seek important connections. The reasons can be summarized in two points. First, when parameters of 83.83% are removed, the accuracy doesn’t change. This indicates that the pruned parameters are unimportant for maintaining the accuracy of the network. Second, it is difcult to remove some lters or neurons, espe- ciallytheneuronsoffully-connectedlayers,whenthepruning rate gradually increases. So the remaining connections are crucial. In addition, the convolutional layer, especially the rst one, is hard to prune in comparison with the next layer. The possible explanation could be that the proposed method auto- matically selects the unimportant lters through a backprop- agation algorithm. However, the backpropagation algorithm FIGURE 2.Comparison of L1 regularization and L2 regularization. will cause the previous layer to suffer gradient vanishing ‘‘accuracy loss’’ represents the difference of accuracy between pruned CNNs and original CNNs. A positive value indicates the improvement of problem. That is why the former layers are hard to prune network accuracy after pruning, while a negative value indicates the compared to the later ones. decrease of accuracy. 2) VGG-16 ON CIFAR-10 is expensive in terms of computation cost, especially in case As depicted in TABLE 2, over 94.4% of parameters could of large-scale datasets and networks. be removed with a negligible accuracy loss of 0.51%. It can also be observed that the loss of accuracy was 3) RESNET-32 ON CIFAR-10 only 2.04% when prune parameters of 97.76%. The pro- Pruning ResNet-32 based on the order of magnitude of posed method proved to be effective again in reducing the mask may result in different output map dimensions redundancy. in the residual module. So a 11 convolution is needed In fact, preserving the remaining architecture without as identity mapping to match dimensions. However, this retaining the parameters (training the pruned network from operation brings about extra parameter and computation. scratch) is also a strategy to ne-tune network. This strategy To avoid this problem, a percentile was dened to remove was adopted here to retrain the network and the results were lters of the same proportion in every convolutional layer. promising, as shown in TABLE 2. The results reveal that a TABLE 3 shows that the proposed method removed 34% better effect can be achieved through directly retraining the parameters with accuracy loss of 0.65%. Moreover, over pruned network from scratch. Perhaps the signicance of the 62.3%ofparameterscouldalsobediscardedwithanaccuracy proposed method is that it furnishes the facility to discover loss of 1.76%. Thus, it was conrmed that the proposed excellent architectures, as mentioned by Liuet al.[36] as method could reduce the redundancy of complex network, well. Nevertheless, training a pruned network from scratch i.e. ResNet. 106390 VOLUME 7, 2019 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization FIGURE 3.The comparison of pruned and reserved filters. (a) The comparison of parameters order of magnitude between pruned and reserved filters. The x-axis represents the distribution interval and the y-axis represents the percentage of the parameter in the interval. (b) The comparison of non-zero activations. The left bar represents average non-zero activation percentage, and the right bar represents average non-zero activation value. TABLE 3.Restlt of RESNET-32 on CIFAR-10 datasets. be removed. Empirically, a weak lter or neuron always has lower activation outputs, lower activation frequency, and lower weight value. Hence weight values and activation outputs were chosen here to evaluate the difference between pruned and preserved lters. As shown in Figure 2, the bulk of values of pruned param- eters, with a percentage of 96.9, are less than 10 6 , in terms of the weight absolute values. However, most of the values of reserved parameters, with a percentage of 94.5, are greater than 0.001. The results indicate an enormous distribution difference between the values of the pruned and the reserved V. ANALYSIS parameters. Therefore, the present approach can effectively A. L2 REGULARIZATION reduce the order of magnitude of the pruned parameters.L2 regularization was also explored as a regularization strat- In addition, the test set was chosen as a sample to calcu-egy in this study. As shown in FIGURE 2, the LeNet-5 late the average non-zero activation values and percentagecan also be compressed without degrading accuracy based of CONV3-1. As obvious from Figure 3, both the averageL2 regularization. Nevertheless, there is some difference percentage of non-zero activation and the average values ofbetweenL1regularizationandL2regularization.BothL1and non-zero activation of the pruned lters was much lower thanL2 regularizations can improve accuracy when pruning rate those of the reserved lters. From the activation perspective,is less than 84%, but the effect of L2 regularization is better. the pruned lters were weak, because the output and weightThe main reason is that regularization techniques can prevent values of pruned lters were negligible compared with theovertting and improve the generalization ability. Moreover, reserved lters and could be completely ignored. Thus, usingwith the pruning rate increasing, L1 regularization can the order of magnitude of the mask to determine prunedachieve a greater compression effect in the same accuracy. lters or neurons was reasonable.As per Hanet al.[24], L1 regularization pushes more parameters closer to zero, so it can prune more parameters. Having studied the difference between L1 regularization C. COMPARISON WITH OTHER METHODS and L2 regularization, the inclination is more towards the In this section, two classical structured prune methods wereL1 regularization from the perspective of compression and compared with the proposed method. First, in LeNet-5 onaccuracy trade-off. MNIST-10 dataset, the proposed method was compared with that of Wenet al.[31]. In this experiment, both the proposed B. THE EFFECT OF PRUNING and Wenet al.[31] methods adopted the same coefcient of To better describe the effect of the proposed method, a com- sparsity regularization (D0:03). The results (TABLE 5) parison was made between the pruned lters and reserved show that both the methods were analogous in terms of lters. The CONV3-1 layer of VGG-16, which owned accuracy and compression effect. However, the proposed 256 lters, was chosen while theset at 0.008. Based method is simpler and costs less computation in practice. on the above setting, 125 lters of CONV3-1 layer could Further, the proposed method was also compared with that VOLUME 7, 2019 106391 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization TABLE 4.Compare of VGG-16 on CIFAR-10. [2]A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classication with deep convolutional neural networks,’’ inProc. Adv. Neural Inf. Pro- cess. Syst. (NIPS), 2012, pp. 10971105. [3]R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies for accurate object detection and semantic segmentation,’’ inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580587. [4]I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 26722680. TABLE 5.Compare of LENET-5 on MNIST. [5]C.Shen,Y.Li,Y.Chen,X.Guan,andR.Maxion,‘‘Performanceanalysisof multi-motion sensor behavior for active smartphone authentication,’’IEEE Trans. Inf. Forensics Security, vol. 13, no. 1, pp. 4862, Jan. 2018. [6]C. Shen, Y. Chen, X. Guan, and R. Maxion, ‘‘Pattern-growth based mining mouse-interaction behavior for an active user authentication system,’’ IEEE Trans. Dependable Secure Comput., to be published. [7]Y. Cheng, D. Wang, P. Zhou, and T. Zhang, ‘‘A survey of model compres- sion and acceleration for deep neural networks,’’ 2017,arXiv:1710.09282. [Online]. Available: https://arxiv.org/abs/1710.09282 [8]C. Tai, T. Xiao, Y. Zhang, X. Wang, and E. Weinan, ‘‘Convolutional of Liuet al.[33] in VGG-16 on CIFAR-10. Again, the same neural networks with low-rank regularization,’’ 2015,arXiv:1511.06067. [Online]. Available: https://arxiv.org/abs/1511.06067 sparsity regularization coefcient (D0:005) was adopted [9]W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, ‘‘Compressing for both the methods. However, Liuet al.[33] adopted a neural networks with the hashing trick,’’ inProc. Int. Conf. Mach. Learn., xed percentage threshold setting, whereas, the scheme of 2015, pp. 22852294. [10]Y. Gong, L. Liu, M. Yang, and L. Bourdev, ‘‘Compressing deep con- threshold setting of proposed method was different from Liu. volutional networks using vector quantization,’’ 2014,arXiv:1412.6115. The results (in TABLE 4) reveal that the proposed method [Online]. Available: https://arxiv.org/abs/1412.6115 was superior in terms of compression efciency, although [11]Z. Tian, S. Su, W. Shi, X. Du, M. Guizani, and X. Yu, ‘‘A data-driven method for future Internet route decision modeling,’’Future Gener. Com- there was a slight loss of accuracy. In general, the proposed put. Syst., vol. 95, pp. 212220, Jun. 2018. method can not only generate sparsity but also achieve better [12]Z. Tian, W. Shi, Y. Wang, C. Zhu, X. Du, S. Su, Y. Sun, and N. Guizani, pruning effect with its improved threshold. ‘‘Real-time lateral movement detection based on evidence reasoning net- Nevertheless, some shortcomings were also observed with work for edge computing environment,’’IEEE Trans. Ind. Informat., vol. 15, no. 7, pp. 42854294, Jul. 2019. this approach. One is that though this approach doesn’t [13]R. Liu, N. Fusi, and L. Mackey, ‘‘Teacher-student compression with gener- change the existing CNNs architecture, the added mask layer ative adversarial networks,’’ 2018,arXiv:1812.02271. [Online]. Available: essentiallyincreasesthenumberoflayersinthenetwork.This https://arxiv.org/abs/1812.02271 [14]Y. LeCun, J. S. Denker, and S. A. Solla, ‘‘Optimal brain damage,’’ inProc. may increase optimization difculty. However, this problem Adv. Neural Inf. Process. Syst., 1990, pp. 598605. can be solved by Batch Normalization (BN [38]). The other is [15]F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and that, as this method introduces a threshold, the pruning effect K. Keutzer, ‘‘SqueezeNet: AlexNet-level accuracy with 50x fewer param- eters and<0.5 MB model size,’’ 2016,arXiv:1602.07360. [Online]. Avail- may not be smooth. The pruning rate may change drastically able: https://arxiv.org/abs/1602.07360 with small changes in the, which is not conducive to nding [16]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, the best. M. Andreetto, and H. Adam, ‘‘MobileNets: Efcient convolutional neu- ral networks for mobile vision applications,’’ 2017,arXiv:1704.04861. [Online]. Available: https://arxiv.org/abs/1704.04861 VI. CONCLUSION [17]X. Zhang, X. Zhou, M. Lin, and J. Sun, ‘‘ShufeNet: An extremely In this article, a structured pruning technology is proposed efcient convolutional neural network for mobile devices,’’ inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 68486856. to automatically tailor redundant lters or neurons based on [18]S. Anwar, K. Hwang, and W. Sung, ‘‘Structured pruning of deep convolu- regularization. A mask is introduced to remove unimportant tional neural networks,’’ACM J. Emerg. Technol. Comput. Syst., vol. 13, lters or neurons by zeroing the values of some masks dur- no. 3, p. 32, 2017. ing training. In addition, to deal with the problem that the [19]Y. He, X. Zhang, and J. Sun, ‘‘Channel pruning for accelerating very deep neural networks,’’ inProc. IEEE Int. Conf. Comput. Vis., Jun. 2017, mask cannot be completely zeroed in practice, a threshold pp. 13891397. is designed to zero the mask. Experimentation with multiple [20]J.-H. Luo and J. Wu, ‘‘An entropy-based pruning method for datasets has proved that the proposed method can effectively CNN compression,’’ arXiv:1706.05791, 2017. [Online]. Available: https://arxiv.org/abs/1706.05791 remove parameters with a negligible loss of accuracy. In the [21]R. Tibshirani, ‘‘Regression selection and shrinkage via the lasso,’’J. Roy. future, establishing a relation between the hyper-parameter Stat. Soc. B, vol. 58, no. 1, pp. 267288, 1996. and the pruning rate will be considered to facilitate the [22]B. Hassibi and D. G. Stork, ‘‘Second order derivatives for network pruning: Optimal brain surgeon,’’ inProc. Adv. Neural Inf. Process. Syst., 1993, adjustment of hyper-parameter. pp. 164171. [23]Y. Guo, A. Yao, and Y. Chen, ‘‘Dynamic network surgery for efcient ACKNOWLEDGMENT DNNs,’’ inProc. Adv. Neural Inf. Process. Syst., 2016, pp. 13791387. [24]S. Han, J. Pool, J. Tran, and W. Dally, ‘‘Learning both weights and con- All the mentioned support is gratefully acknowledged. nections for efcient neural network,’’ inProc. Adv. Neural Inf. Process. Syst., 2015, pp. 11351143. REFERENCES [25]S. Han, H. Mao, and W. J. Dally, ‘‘Deep compression: Com- pressing deep neural networks with pruning, trained quantization [1]Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’Nature, vol. 521, and Huffman coding,’’ 2015,arXiv:1510.00149. [Online]. Available: pp. 436444, May 2015. https://arxiv.org/abs/1510.00149 106392 VOLUME 7, 2019 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization [26]Z.Liu,J.Xu,X.Peng,andR.Xiong,‘‘Frequency-domaindynamicpruning ZHENGHONG YANGreceived the master’s and forconvolutionalneuralnetworks,’’inProc.Adv.NeuralInf.Process.Syst., Ph.D. degrees from Beijing Normal University, 2018, pp. 10431053. in 1990 and 2001, respectively. He is currently [27]H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, ‘‘Network trimming: a Professor with the College of Science, China A data-driven neuron pruning approach towards efcient deep archi- Agricultural University. He has presided two tectures,’’ 2016, arXiv:1607.03250. [Online]. Available: https://arxiv. projects of National Natural Science Foundation. org/abs/1607.03250 He has written two teaching and research books [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, ‘‘Pruning l- andhaspublishedmorethan40academicpapersin ters for efcient convNets,’’ 2016,arXiv:1608.08710. [Online]. Available: domestic and foreign journals, among them, about https://arxiv.org/abs/1608.08710 [29]J.-H. Luo, J. Wu, and W. Lin, ‘‘ThiNet: A lter level pruning method for 30 are cited by SCI/EI/ISTP. His major research deep neural network compression,’’ inProc. IEEE Int. Conf. Comput. Vis., interests include the matrix theory, numerical algebra, image processing, and Jun. 2017, pp. 50585066. so on. He is a member of Beijing and Chinese Society of Computational [30]S. Changpinyo, M. Sandler, and A. Zhmoginov, ‘‘The power of sparsity in Mathematics. convolutional neural networks,’’arXiv:1702.06257. [Online]. Available: https://arxiv.org/abs/1702.06257 [31]W.Wen,C.Wu,Y.Wang,Y.Chen,andH.Li,‘‘Learningstructuredsparsity in deep neural networks,’’ inProc. Adv. Neural Inf. Process. Syst., 2016, pp. 20742082. [32]S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, ‘‘Accelerating ABDUL MATEEN KHATTAKreceived the Ph.D. convolutional networks via global & dynamic lter pruning,’’ inProc. degree in horticulture and landscape from the IJCAI, 2018, pp. 24252432. University of Reading, U.K., in 1999. He was a [33]Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, ‘‘Learning efcient Research Scientist in different agriculture research convolutional networks through network slimming,’’ inProc. IEEE Int. organizations before joining the University of Conf. Comput. Vis., Jun. 2017, pp. 27362744. Agriculture, Peshawar, Pakistan, where he is [34]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn- currently a Professor with the Department of ing applied to document recognition,’’Proc. IEEE, vol. 86, no. 11, Horticulture. He has conducted academic and pp. 22782324, Nov. 1998. applied research on different aspects of tropical [35]K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for fruits, vegetables, and ornamental plants. He has large-scale image recognition,’’ 2014,arXiv:1409.1556. [Online]. Avail- able: https://arxiv.org/abs/1409.1556 also worked for Alberta Agriculture and Forestry, Canada, as a Research [36]Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, ‘‘Rethinking the Associate, and Organic Agriculture Centre of Canada as a Research and value of network pruning,’’ 2018,arXiv:1810.05270. [Online]. Available: Extension Coordinator, for Alberta province. There he helped in developing https://arxiv.org/abs/1810.05270 organic standards for greenhouse production and energy saving technologies [37]X. Ding, G. Ding, J. Han, and S. Tang, ‘‘Auto-balanced lter pruning for for Alberta greenhouses. He is a Professor with considerable experience in efcient convolutional neural networks,’’ inProc. 32nd AAAI Conf. Artif. teaching and research. He is currently a Visiting Professor with the College Intell., 2018, pp. 67976804. of Information and Electrical Engineering, China Agricultural University, [38]S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating Beijing. He has published 59 research articles in scientic journals of inter- deep network training by reducing internal covariate shift,’’ 2015, national repute. He has also attended and presented in several international arXiv:1502.03167. [Online]. Available: https://arxiv.org/abs/1502. scientic conferences. His research interests include greenhouse produc- 03167 tion, medicinal, aromatic and ornamental plants, light quality, supplemental [39]A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi, lighting, temperature effects on greenhouse crops, aquaponics, and organic ‘‘MorphNet: Fast & simple resource-constrained structure learning of deep production. networks,’’ inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 15861595. [40]M. Yuan and Y. Lin, ‘‘Model selection and estimation in regression with grouped variables,’’J. Roy. Statist. Soc., B (Statist. Methodol.), vol. 68, no. 1, pp. 4967, 2006. [41]A. Krizhevsky and G. Hinton, ‘‘Learning multiple layers of features from LIU YANG is currently pursuing the master’s tiny images,’’ Univ. Toronto, Toronto, ON, Canada, Tech. Rep. 4, 2009. degree with the College of Information and Elec- [42]K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’ inProc. IEEE Conf. Comput. Vis. Pattern Recognit., trical Engineering, China Agricultural University, Jun. 2016, pp. 770778. Beijing, China. Her research interests include the [43]D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic opti- application of image recognition and intelligent mization,’’ 2014,arXiv:1412.6980. [Online]. Available: https://arxiv.org/ robots in the eld of agriculture. abs/1412.6980 CHEN YANG is currently pursuing the mas- WENXIN ZHANGis currently pursuing the mas- ter’s degree with the Department of College of ter’s degree with the School of Information and Information and Electrical Engineering, China Electrical Engineering, China agricultural univer- Agricultural University, Beijing, China. His sity, Beijing, China. Her research interest includes research is about general deep learning and pose estimation methods about pig based on deep machine learning but his main research interest learning for timely access to pig information. includes deep models compression. VOLUME 7, 2019 106393 C. Yanget al.: Structured Pruning of CNNs via L1 Regularization WANLIN GAO received the B.S., M.S., and MINJUAN WANGreceivedthePh.D.degreefrom Ph.D. degrees from China Agricultural University, the School of Biological Science and Medical in 1990, 2000, and 2010, respectively. He is the Engineering, Beihang University, under the super- currently the Dean of the College of Information vision of Prof. Hong Liu, in June 2017. She was a and Electrical Engineering, China Agricultural Visiting Scholar with the School of Environmen- University. He has been the principal investiga- tal Science, Ontario Agriculture College, Univer- tor (PI) of over 20 national plans and projects. sity of Guelph, from October 2015 to May 2017. He has published 90 academic papers in domestic She is currently a Postdoctoral Fellow with the and foreign journals, among them, over 40 are College of Information and Electrical Engineer- cited by SCI/EI/ISTP. He has written two teaching ing, China Agricultural University. Her research materials, which are supported by the National Key Technology Research interests mainly include bioinformatics and the Internet of Things key and Development Program of China during the 11th Five-Year Plan Period, technologies. and ve monographs. He holds 101 software copyrights, 11 patents for inventions, and eight patents for new practical inventions. His major research interests include the informationization of new rural areas, intelligence agriculture, and the service for rural comprehensive information. He is a member of Science and Technology Committee of the Ministry of Agricul- ture, a member of Agriculture and Forestry Committee of Computer Basic Education in colleges and universities, and a Senior Member of Society of Chinese Agricultural Engineering, etc. 106394 VOLUME 7, 2019