testing_generation/Corpus/Structured Pruning of Convo...

      Received July 24, 2019, accepted July 30, 2019, date of publication August 5, 2019, date of current version August 15, 2019.
      Digital Object Identifier 10.1109/ACCESS.2019.2933032


      Structured Pruning of Convolutional Neural

      Networks via L1 Regularization


      CHEN YANG 1,2 , ZHENGHONG YANG 1,2 , ABDUL MATEEN KHATTAK 2,3 , LIU YANG 1,2 ,
      WENXIN ZHANG 1,2 , WANLIN GAO 1,2 , AND MINJUAN WANG 1,2
      1 Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, Beijing 100083, China
      2 College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
      3 Department of Horticulture, The University of Agriculture, Peshawar 25120, Pakistan
      Corresponding authors: Wanlin Gao (wanlin_cau@163.com) and Minjuan Wang (minjuan@cau.edu.cn)
      This work was supported by the Project of Scientic Operating Expenses from Ministry of Education of China under Grant 2017PT19.


       ABSTRACT Deep learning architecture has achieved amazing success in many areas with the recent
       advancements in convolutional neural networks (CNNs). However, real-time applications of CNNs are seri-
       ously hindered by the signicant storage and computational costs. Structured pruning is a promising method
       tocompressandaccelerateCNNsanddoesnotneedspecialhardwareorsoftwareforanauxiliarycalculation.
       Here a simple strategy of structured pruning approach is proposed to crop unimportant lters or neurons
       automatically during the training stage. The proposed method introduces a mask for all lters or neurons
       to evaluate their importance. Thus the lters or neurons with zero mask are removed. To achieve this,
       the proposed method adopted L1 regularization to zero lters or neurons of CNNs. Experiments were
       conducted to assess the validity of this technique. The experiments showed that the proposed approach
       could crop 90.4%, 95.6% and 34.04% parameters on LeNet-5, VGG-16, and ResNet-32respectively, with a
       negligible loss of accuracy.


       INDEX TERMSConvolutional neural networks, regularization, structured pruning.


      I. INTRODUCTION                              network pruning [14]. For the deep neural networks (DNN)
      During the recent years, convolutional neural netwo-   that have been trained, the low-rank decomposition tech-
      rks (CNNs) [1] have accomplished successful applications   nology decomposes and approximates a tensor to a smaller
      in many areas such as image classication [2], object detec-   level to achieve compression. The low-rank decomposition
      tion [3], neural style transfer [4], identity authentication [5],   achieves efcient speedup because it reduces the elements of
      information security [6], speech recognition and natural lan-   the matrix. However, it can only decompose or approximate
      guage processing. However, these achievements were made   tensors one by one within every layer, and cannot discover the
      through leveraging large-scale networks, which possessed   redundant parameters of DNN. Besides, more research has
      millions or even billions of parameters. Those large-scale   been focused on network module designs, which are smaller,
      networks heavily relied on GPUs to accelerate computation.   more efcient and more sophisticated. These models, such
      Moreover, devices with limited resources, such as mobile,   as SqueezeNet [15], MobileNet [16] and Shufenet [17], are
      FPGA or embedded devices, etc. have difculties to deploy   basically made up of low resolutions convolution with lesser
      CNNs in actual applications. Thus, it is critical to acceler-   parameters and better performance.
      ate the inference of CNNs and reduce storage for a wide    At present, network pruning is a major focus of research,
      range of applications [7].                          which not only accelerates DNN, but also reduces redundant
       According to the studies done so far, the major approaches   parameters. Actually, using a large-scale network directly
      for compressing deep neural networks can be categorized   may provide state-of-the-art performance, so learning a large-
      into four groups, i.e. low-rank decomposition [8], param-   scale network is needed. However, optimum network archi-
      eter quantization [9], knowledge distillation [10][13], and   tecture may not be known. Thus, a massive redundancy
                                                 exists in large neural networks. To combat this problem,
                                                 network pruning is useful to remove redundant parameters, The associate editor coordinating the review of this manuscript and
      approving it for publication was Chao Shen.                    lters, channels or neurons, and address the over-tting issue.

               This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019                                                                        106385                                                        C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


        FIGURE 1.The architecture of the layer with the mask. (a) The architecture of a convolutional layer with the mask. (b) The architecture of a
        fully-connected layer with the mask. The proposed approach chooses the unimportant filters and neurons (highlighted in yellow) by the order
        of magnitude of mask value.

      Network pruning techniques can also be broadly catego-   Contrarily, in the proposed method, a mask is introduced
      rized as structured pruning and non-structured pruning.   to address this issue and the regularization term only is the
      Non-structured pruning aims to remove single parameters   l1 -norm of the mask, which easily calculates the gradients
      that have little inuence on the accuracy of networks and   of the mask. In this method, the parameters of lters or
      non-structured pruning is efcient and effective for compact-   neurons are multiplied by a mask to pick unimportant lters
      ing networks. Nonetheless, non-structured pruning is dif-   or neurons, and once the mask is zero the corresponding
      cult to be widely used in practical applications. Actually,   lter or neuron will be removed. Here, though a mask is
      the operation of convolution is reformulated as a matrix-   introduced for lters or neurons, the method does not change
      by-matrix multiplication in many prevalent deep learning   the architecture of the network. This allows for other com-
      frameworks. This requires additional information to rep-   pression methods to be used with the proposed technique.
      resent pruned locations in non-structured pruning method.   Similar to the proposed method, Linet al.[32] also adopted a
      Therefore, special hardware or software is needed to assist   mask to identify unimportant lters or neurons, but the value
      with the calculation, which may increase computation time.   of the mask could not be changed by training. In addition,
      Instead, structured pruning directly removes the entire lters,   removing unimportant lters or neurons may temporarily
      channels or neurons. Thus, the remaining network archi-   degrade accuracy, but the network can be retrained for recov-
      tecture can be used directly by the existing hardware. For   ery performance. FIGURE 1 shows the framework of the
      example, Anwaret al.[18] employed particle ltering to   proposed method.
      structured sparsity convolutional neural network at channel-    Inthisarticle,astructuredpruningtechnologyispresented,
      wise, kernel-wise, and intra-kernel stride levels. At present,   which allows for simultaneously learning and removing
      severalstructuredpruningmethods[24],[25],[27]aremainly   unimportant lters or neurons of CNNs. The main contribu-
      based on the statistical information of parameters or acti-   tions are as follows:
      vation outputs. These methods do not consider the loss and     A simple yet effective method based L1 regularization is
      are unable to remove parameters during training. In addition,      presented to compress CNNs model during the training
      some methods, such as those mentioned by [19], [20], require      stage.
      layer-by-layer iterative pruning and recovery accuracy, which     A threshold is adopted to solve the optimization problem
      involves enormous calculations. On the contrary, the pro-      ofl1 -norm. In this approach, only some mask values are
      posed approach links pruning with minimization of loss and      required to be near zero, though not completely zero.
      can be implemented during the training.                    The detail is provided in the following section.
       It is inspiring that the lters who’s weights are all zero can
      be safely removed, because, whatever the input, they would   II. PREVIOUS WORK
      not extract any features. This study presents a scheme to   The importance of compressing deep learning models before
      prune lters or neurons of fully-connected layers based on   the application is self-evident, especially for expanding the
      L1 regularization [21] to zero out the weights of some lters   application scenarios of deep learning [11]. For example, a
      or neurons. Similar to this method, Wenet al.[31] adopted   compressed deep learning model can be combined with edge
      groupLASSOregularization[40]tozerooutlters.However,   computing [12] to enable Internet of things devices under-
      all the weights are required to compute an extra gradient,   stand data. In this section, we will review the contributions of
      whichiscomputationallyexpensiveforalarge-scalenetwork.   others.


      106386                                                                        VOLUME 7, 2019      C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


       Le Cunet al.[14] rst proposed a saliency measurement   III. The APPROACH OF STRUCTURED
      method called Optimal Brain Damage (OBD) to selectively   PRUNING FOR CNNs
      delete weights by second-derivative information of error   A. NOTATIONS
      function. Later, Hassibi and Strok [22] proposed the Opti-   First of all, notations are claried in this section. CNN is
      mal Brain Surgeon (OBS) algorithm based on OBD. The   a multi-layer deep feed-forward neural network, which is
      OBS not only removed unimportant weights but also auto-   composed of a stack of convolutional layers, pooling layers,
      matically adjusted the remaining weights, which improved   and full-connected layers. In anl-layer CNNs model,Wk 2l accuracy and generalization ability. All these methods are   RddCl1 represents thek-th lter ofllayer, C l1 denotes
      based on Taylor expansion (even OBD and OBS are required   the number of feature maps inl-1 layer anddindicates the
      to compute Hessian matrix), which may be computationally   kernel size. Let us denote feature maps in thellayer by
      intensive especially for large networks. In addition, they   Zl 2RHl Wl Cl , whereHl Wl is the size, C l is the number
      use a criterion of minimal increase in error on the train-   of channels, andZl is the output ofl-1 layer. In addition,Zkl ing data. Guoet al.[23] introduced a binary matrix to   represents thek-th feature map ofllayer. The output feature
      dynamically choose important weights. Hanet al.[24], [25]   mapZk can be computed as: l directly removed weights with values lower than a predened
      threshold to compress networks, then followed by retraining               Zk Df(Z Wk Cbk );           (1) l    l1 l   l
      to recover accuracy. Considering most lters in CNNs that
      tended to be smooth in the spatial domain, Liuet al.[26]   wheref is a non-linear activation function,  is the con-
      extended Guo’s work to the frequency domain by imple-   volutional operation andbk is the bias.D D fX Dl
      menting Discrete Cosine Transform (DCT) to lters in the   fx1 ;x2 ;:::;xN g;YD fy1 ;y2 ;:::;yN ggrepresents the train-
      spatial domain. However, these non-structured pruning tech-   ing set, wherexi andyi represent the training sample and label
      nologies were hard to use in real applications, because extra   respectively, andNindicates the number of samples.
      software or hardware was required for the calculation.
       Directly cropping a trained model by the value of weight   B. THE PROPOSED SCHEME
      is a wide method. Normally it is used to nd an effective   The goal of pruning is to remove those redundant l-
      evaluation to judge the importance of weights and to cut the   ters or neurons, which are unimportant or useless for the
      unimportant connection or lter to reduce the redundancy of   performance of the networks. Essentially, the main role of
      a model. Huet al.[27] thought the activation outputs of a   the convolutional layer lters is to extract local features.
      signicant portion of neurons were zero in a large network,   However, once all the parameters of a lter are zeroed, the l-
      whatever inputs the network received. These zero activa-   ter is conrmed unimportant. Whatever the inputs for the
      tion neurons were unimportant, so they dened the Average   lter, the outputs are always zero. Under the circumstance,
      Percentage of Zeros (ApoZ) to observe the percentage of   the lters are unable to extract any information. When the
      activations of a neuron and cropped the neurons with fewer   lters are multiplied by zero, all the parameters of the lters
      activations. Liet al.[28] introduced a structured pruning   become zero. Based on this observation, a mask is introduced
      method by measuring the norm of lters to remove unim-   for every lter to estimate its importance. This can be for-
      portant lters. Luoet al.[29] took advantage of a subset of   mulized as:
      input channels to approximate output for compressing con-            Zk Df(Z       mk )Cbk );        (2)volutional layers. Changpinyoet al.[30] proposed a random             l     l1 (Wk l   l   l
      method to compress CNNs. They randomly connected the   wheremk represents thek-th mask ofl-layer. outputchanneltoasmallsubsetofinputchannelstocompress        l Therefore, the problem of zeroing out the values of someCNNs. Though successful to an extent, their method did not   lters can be transformed to zero some mask. For this pur-directly relate to the loss, hence it was necessary to retrain   pose, the following optimization solution is proposed:the network for the recovery of accuracy. On the other hand,
      such a scheme could only be used layer-by-layer. Thus, it was                minL(Y;F(XIW;m))
      essential to iterate over and over to prune, which would result                W
                                                              s:t:kmkin massive computation costs.                                           0 C;              (3)
       Dinget al.[37] applied a customized L2 regularization   whereL() is a loss function, such as cross-entropy loss,F()to remove unimportant lters and simultaneously stimulate   istheoutputofCNNsandCisahyper-parameterthatcontrolsimportant lters to grow stronger. Linet al.[32] proposed   the number of pruned lters. Equation (3) is the core of thea Global & Dynamic Filter Pruning (GDP) method, which   proposed method. Once the optimal solution of the equationcould dynamically recover the previously removed lters.   is obtained, the pruning is achieved.Liuet al.[33] enforced channel-level sparsity in the net-    In addition, this method can also remove redundant neu-work to compress DNNs in the training phase. In addition,   rons in a fully-connected layer. The inference of fully-Gordonet al.[39] iteratively shrank and expanded a network   connected layer can be represented by:targetingreductionofparticularresources(e.g.FLOPS,orthe
      number of parameters).                                        Zl Df(Wl Zl1 Cbl );           (4)

      VOLUME 7, 2019                                                                        106387                                                        C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


      whereW                                     Algorithm 1The Proposed Pruning Approach l 2 Rmn is a weight matrix andZl1 2 Rn1
      is the input ofl-th layer. Here, when fully-connected layers   Input:TraindataD,CNNsmodel,threshold,penaltyfactor
      introduce mask, the inference of these layers can be reformu-   C, maskm.
      lated as:                                     DO:
                                                    1. Initializing weightWand maskmD1
                 Z         l Df(Wl Zl1 ml Cbl );          (5)      2. Training the CNNs with the mask, for suitableC
                                                    3. Pruning the lters or neuron based the value of the
      wherem   ml 2R is a mask vector andis Hadamard product         mask
      operator.                                        4. Fine-tuning the network by retraining
       Equation (3) can be transformed into the following form   End
      based Lagrange multiplier:                         Merging weights and masks and then removing the mask
                                                 layer.minL(Y;f(X;W;m))Ckmk0 ;      (6)   Return the pruned network architecture and preserved W                                 weights.
      whereis a coefcient associated with C. Equation (6) is
      an NP-hard problem because of the zero norm. Thus, it is
      quite difcult to obtain an optimal solution with equation (6).   D. FINE-TUNING AND OTHER REGULARIZATION Therefore,l1 -norm is adopted to replacel0 -norm, as:        STRATEGIES
                                                 Pruning may temporarily lead to degradation in accuracy, sominL(Y;f(X;W;m))Ckmk1 :      (7) W                                  ne-tuning is necessary to improve accuracy. Furthermore,
                                                 the proposed method can be employed iteratively to obtainEquation (7) can be solved by SGD in practical application,   a narrower architecture. Actually, a single iteration of pro-so the proposed method is simple and easy to implement.   posed method is enough to yield noticeable compaction. TheWe just need to introduce a mask for each layer and train   method is elaborated in Algorithm 1.the network. Though the proposed method introduces mask,    Essentially, the purpose of this approach is to adjust somethe network topology will be preserved because the mask can   masks to adequately small order of magnitude. Therefore,be absorbed into weight.                           L2 regularization can also serve as a regularization strategy
                                                 in this approach.
      C. THRESHOLD
      L1 regularization is a widely used sparse technology, which   IV. EXPERIMENTS
      pushes the coefcients of uninformative features to zero. So a   The approach was primarily evaluated through three net-
      sparse network is achieved by solving equation (7). However,   works: LeNet-5 on MNIST dataset, VGG-16 on CIFAR-10
      there is a problem in solving equation (7). Here the mask   dataset and ResNet-32 on CIFAR-10 dataset. The implemen-
      value cannot be completely zeroed in practical application,   tation of this approach was accomplished through the stan-
      because the objective function (7) is non-convex and the   dard Keras library. All experiments were conducted through
      global optimal solution may not be obtained. A strategy is   Intel E5-2630 V4 CPU and NVIDIA 1080Ti GPU.
      adopted in the proposed method to solve this problem. If the
      order of magnitude of the mask value is small enough, it can   A. DATASETS
      be considered almost as zero. Thus, to decide whether the   1) MNIST
      mask is zero, a threshold is introduced. However, considering   MNIST dataset of handwritten digits from 0 to 9 is widely
      only the value of the mask is meaningless if the mask is   applied to evaluate machine learning models. This dataset
      not completely zero. Because there is a linear transformation   owns 60000 train samples and 10000 test samples.
      between mask and convolution. One can shrink the masks
      while expanding the weights to keep the product of them   2) CIFAR-10
      the same. Hence, considering the mask and weight simul-   The CIFAR-10 dataset [41] has a total of 60000 images con-
      taneously is necessary. The average value of the product of   sisting of 10 classes, each having 6000 images with 3232
      the mask and the weight is used to determine whether the   resolution. There are 50000 training images and 10000 test
      mask is exactly zero or not? The specic denition can be   images. During training, a data augmentation scheme was
      presented as:                                  adopted, which contained random horizontal ip, rotation,
                 (                               and translation. The input data was normalized using the
                  mk if abs(E(mk wk ))            means and standard deviations.mk D  l         l l            (8) l   0  if abs(E(mk wk ))< ; l l                 B. NETWORK MODELS
      whereis a pre-dened threshold andE() is the average   1) LENET-5
      operation. This strategy is efcient and reasonable, which can   LeNet-5 is a convolutional neural network designed by
      be proved by the results of the experiment.               LeCun et al. [34]. It has two convolutional and two

      106388                                                                        VOLUME 7, 2019      C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


      TABLE 1.The result of lenet-5 on mnist.


      full-connected layers. This network has 44.2K learnable   dropout rate was set to 0.5 for the fully-connected layer.
      parameters. In this network, dropout is used in the full-   While implementing the pruning training, only the epochs
      connected layer.                                was modied. The epochs was set at 10 and the threshold
                                                 mentioned above to select pruned lters was set at 0.01. The
      2) VGG-16                                    pruned network was then retrained to compensate for the loss
      The original VGG-16 [35] has thirteen convolutional   of accuracy. We adopted the same hyper-parameter setting as
      and two fully-connected layers and has 130M learn-   in normal training.
      able parameters. However, VGG-16 is very complex for
      CIFAR-10 dataset. So the fully-connected layers were   2) VGG-16 ON CIFAR-10
      removed.Moreover,BatchNormalizationwasusedaftereach   To get the baseline accuracy, the network was normally
      convolution operation. The modied model has 14.7M learn-   trained from scratch by SGD with a batch size of 128. The
      able parameters.                                total epochs were set to 60. The initial learning rate was
                                                 set to 0.01 and then scaled up by 0.1 every 20 epochs. The
      3) RESNET-32                                 weight decay was set at 0.0005 and the momentum at 0.9.
      Deep residual network (ResNet) [42] is a state-of-the-art mul-   While implementing the pruning training, epochs was set
      tiple CNNs architecture. In this paper, ResNet-32 was imple-   to 30 , the learning rate was scaled by 0.1 every 10 epochs
      mentedtoevaluatetheproposedmethod.TheusedResNet-32   and other settings remained the same, while the threshold was
      had the same architecture as described in [42], which con-   set at 0.01. Finally, the pruned model was retrained following
      tained three stages of convolutional, one global average pool-   the same pre-processing and hyper-parameter settings as the
      ing after last convolutional layer and one fully-connected   normal training.
      layer. In addition, when the dimensions increased, 11
      convolution was adopted as identity mapping to match the   3) RESNET-32 ON CIFAR-10
      dimensions. This network has 0.47M learnable parameters.    Generally, the network was trained from scratch by SGD as
                                                 the baseline with a batch size of 128. The weight decay was
      C. THE DETAIL OF TRAINING, PRUNING, AND            set at 0.0001, the epochs were set at 120, and the momentum
      FINE-TUNING                                  was set at 0.9. The initial learning rate was set at 0.1 and then
      To obtain the baseline of accuracy in the experiments,   scaledby0.1at60and100epochs.Here,forpruningtraining,
      we trained LeNet-5 on MNIST, VGG-16 on CIFAR-10, and   the epoch was set at 30, the learning rate was scaled by
      ResNet-32 on CIFAR-10 from scratch. Then, the pruning was   0.1 every 10 epochs and the other settings remained the same.
      performed on the basis of the trained network and the strategy   After pruning, the network was retrained from scratch. The
      of regularization was chosen as L1 regularization, with the   epochs was modied to 60 and the learning rate was scaled
      mask initialized to 1. Later, we would retrain the pruned   by 0.1 every 20 epochs.
      network for the recovery of accuracy.
                                                 D. RESULTS OF THE EXPERIMENTS
      1) LENET-5 ON MNIST                            1) LENET-5 ON MNIST
      The original network was normally trained from scratch, for   AspertheresultsinTABLE1,88.84%oftheparameterswere
      a total of 30 epochs, by Adam [43] with a batch sizes of 128.   removed without any impact on performance. Based on the
      The learning rate was initialized to 0.001, the weight decay   proposed method, 95.46% of the parameters were discarded
      was set to 0.0005. The momentum was set to 0.9 and the   as well with an accuracy loss of 0.57%.

      VOLUME 7, 2019                                                                        106389                                                        C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


      TABLE 2.Result of VGG-16 on CIFAR-10 datasets.


       TABLE 1 also reveals that there was enormous redundancy
      in fully-connected layers because at least 90% parameters of
      fully-connected layers could easily be dropped. According to
      the form, the proposed method may indeed seek important
      connections. The reasons can be summarized in two points.
      First, when parameters of 83.83% are removed, the accuracy
      doesn’t change. This indicates that the pruned parameters
      are unimportant for maintaining the accuracy of the network.
      Second, it is difcult to remove some lters or neurons, espe-
      ciallytheneuronsoffully-connectedlayers,whenthepruning
      rate gradually increases. So the remaining connections are
      crucial.
       In addition, the convolutional layer, especially the rst
      one, is hard to prune in comparison with the next layer. The
      possible explanation could be that the proposed method auto-
      matically selects the unimportant lters through a backprop-
      agation algorithm. However, the backpropagation algorithm   FIGURE 2.Comparison of L1 regularization and L2 regularization.
      will cause the previous layer to suffer gradient vanishing   ‘‘accuracy loss’’ represents the difference of accuracy between pruned
                                                 CNNs and original CNNs. A positive value indicates the improvement of problem. That is why the former layers are hard to prune   network accuracy after pruning, while a negative value indicates the
      compared to the later ones.                         decrease of accuracy.


      2) VGG-16 ON CIFAR-10                          is expensive in terms of computation cost, especially in case
      As depicted in TABLE 2, over 94.4% of parameters could   of large-scale datasets and networks.
      be removed with a negligible accuracy loss of 0.51%.
      It can also be observed that the loss of accuracy was   3) RESNET-32 ON CIFAR-10
      only 2.04% when prune parameters of 97.76%. The pro-   Pruning ResNet-32 based on the order of magnitude of
      posed method proved to be effective again in reducing   the mask may result in different output map dimensions
      redundancy.                                   in the residual module. So a 11 convolution is needed
       In fact, preserving the remaining architecture without   as identity mapping to match dimensions. However, this
      retaining the parameters (training the pruned network from   operation brings about extra parameter and computation.
      scratch) is also a strategy to ne-tune network. This strategy   To avoid this problem, a percentile was dened to remove
      was adopted here to retrain the network and the results were   lters of the same proportion in every convolutional layer.
      promising, as shown in TABLE 2. The results reveal that a   TABLE 3 shows that the proposed method removed 34%
      better effect can be achieved through directly retraining the   parameters with accuracy loss of 0.65%. Moreover, over
      pruned network from scratch. Perhaps the signicance of the   62.3%ofparameterscouldalsobediscardedwithanaccuracy
      proposed method is that it furnishes the facility to discover   loss of 1.76%. Thus, it was conrmed that the proposed
      excellent architectures, as mentioned by Liuet al.[36] as   method could reduce the redundancy of complex network,
      well. Nevertheless, training a pruned network from scratch   i.e. ResNet.

      106390                                                                        VOLUME 7, 2019      C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


             FIGURE 3.The comparison of pruned and reserved filters. (a) The comparison of parameters order of magnitude between
             pruned and reserved filters. The x-axis represents the distribution interval and the y-axis represents the percentage of the
             parameter in the interval. (b) The comparison of non-zero activations. The left bar represents average non-zero activation
             percentage, and the right bar represents average non-zero activation value.


      TABLE 3.Restlt of RESNET-32 on CIFAR-10 datasets.               be removed. Empirically, a weak lter or neuron always
                                                 has lower activation outputs, lower activation frequency, and
                                                 lower weight value. Hence weight values and activation
                                                 outputs were chosen here to evaluate the difference between
                                                 pruned and preserved lters.
                                                   As shown in Figure 2, the bulk of values of pruned param-
                                                 eters, with a percentage of 96.9, are less than 10 6 , in terms
                                                 of the weight absolute values. However, most of the values of
                                                 reserved parameters, with a percentage of 94.5, are greater
                                                 than 0.001. The results indicate an enormous distribution
                                                 difference between the values of the pruned and the reserved V. ANALYSIS                                  parameters. Therefore, the present approach can effectively A. L2 REGULARIZATION                          reduce the order of magnitude of the pruned parameters.L2 regularization was also explored as a regularization strat-    In addition, the test set was chosen as a sample to calcu-egy in this study. As shown in FIGURE 2, the LeNet-5   late the average non-zero activation values and percentagecan also be compressed without degrading accuracy based   of CONV3-1. As obvious from Figure 3, both the averageL2 regularization. Nevertheless, there is some difference   percentage of non-zero activation and the average values ofbetweenL1regularizationandL2regularization.BothL1and   non-zero activation of the pruned lters was much lower thanL2 regularizations can improve accuracy when pruning rate   those of the reserved lters. From the activation perspective,is less than 84%, but the effect of L2 regularization is better.   the pruned lters were weak, because the output and weightThe main reason is that regularization techniques can prevent   values of pruned lters were negligible compared with theovertting and improve the generalization ability. Moreover,   reserved lters and could be completely ignored. Thus, usingwith the pruning rate increasing, L1 regularization can   the order of magnitude of the mask to determine prunedachieve a greater compression effect in the same accuracy.   lters or neurons was reasonable.As per Hanet al.[24], L1 regularization pushes more
      parameters closer to zero, so it can prune more parameters.
      Having studied the difference between L1 regularization   C. COMPARISON WITH OTHER METHODS
      and L2 regularization, the inclination is more towards the   In this section, two classical structured prune methods wereL1 regularization from the perspective of compression and   compared with the proposed method. First, in LeNet-5 onaccuracy trade-off.                               MNIST-10 dataset, the proposed method was compared with
                                                 that of Wenet al.[31]. In this experiment, both the proposed
      B. THE EFFECT OF PRUNING                       and Wenet al.[31] methods adopted the same coefcient of
      To better describe the effect of the proposed method, a com-   sparsity regularization (D0:03). The results (TABLE 5)
      parison was made between the pruned lters and reserved   show that both the methods were analogous in terms of
      lters. The CONV3-1 layer of VGG-16, which owned   accuracy and compression effect. However, the proposed
      256 lters, was chosen while theset at 0.008. Based   method is simpler and costs less computation in practice.
      on the above setting, 125 lters of CONV3-1 layer could   Further, the proposed method was also compared with that

      VOLUME 7, 2019                                                                        106391                                                                          C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


       TABLE 4.Compare of VGG-16 on CIFAR-10.                           [2]A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classication
                                                                     with deep convolutional neural networks,’’ inProc. Adv. Neural Inf. Pro-
                                                                     cess. Syst. (NIPS), 2012, pp. 10971105.
                                                                 [3]R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies
                                                                     for accurate object detection and semantic segmentation,’’ inProc. IEEE
                                                                     Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580587.
                                                                 [4]I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
                                                                     S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in
                                                                     Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 26722680.
       TABLE 5.Compare of LENET-5 on MNIST.                            [5]C.Shen,Y.Li,Y.Chen,X.Guan,andR.Maxion,‘‘Performanceanalysisof
                                                                     multi-motion sensor behavior for active smartphone authentication,’’IEEE
                                                                     Trans. Inf. Forensics Security, vol. 13, no. 1, pp. 4862, Jan. 2018.
                                                                 [6]C. Shen, Y. Chen, X. Guan, and R. Maxion, ‘‘Pattern-growth based mining
                                                                     mouse-interaction behavior for an active user authentication system,’’
                                                                     IEEE Trans. Dependable Secure Comput., to be published.
                                                                 [7]Y. Cheng, D. Wang, P. Zhou, and T. Zhang, ‘‘A survey of model compres-
                                                                     sion and acceleration for deep neural networks,’’ 2017,arXiv:1710.09282.
                                                                     [Online]. Available: https://arxiv.org/abs/1710.09282
                                                                 [8]C. Tai, T. Xiao, Y. Zhang, X. Wang, and E. Weinan, ‘‘Convolutional
       of Liuet al.[33] in VGG-16 on CIFAR-10. Again, the same        neural networks with low-rank regularization,’’ 2015,arXiv:1511.06067.
                                                                     [Online]. Available: https://arxiv.org/abs/1511.06067
       sparsity regularization coefcient (D0:005) was adopted     [9]W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, ‘‘Compressing
       for both the methods. However, Liuet al.[33] adopted a        neural networks with the hashing trick,’’ inProc. Int. Conf. Mach. Learn.,
       xed percentage threshold setting, whereas, the scheme of        2015, pp. 22852294.
                                                                 [10]Y. Gong, L. Liu, M. Yang, and L. Bourdev, ‘‘Compressing deep con-
       threshold setting of proposed method was different from Liu.        volutional networks using vector quantization,’’ 2014,arXiv:1412.6115.
       The results (in TABLE 4) reveal that the proposed method        [Online]. Available: https://arxiv.org/abs/1412.6115
       was superior in terms of compression efciency, although    [11]Z. Tian, S. Su, W. Shi, X. Du, M. Guizani, and X. Yu, ‘‘A data-driven
                                                                     method for future Internet route decision modeling,’’Future Gener. Com- there was a slight loss of accuracy. In general, the proposed        put. Syst., vol. 95, pp. 212220, Jun. 2018.
       method can not only generate sparsity but also achieve better    [12]Z. Tian, W. Shi, Y. Wang, C. Zhu, X. Du, S. Su, Y. Sun, and N. Guizani,
       pruning effect with its improved threshold.                       ‘‘Real-time lateral movement detection based on evidence reasoning net-

          Nevertheless, some shortcomings were also observed with        work for edge computing environment,’’IEEE Trans. Ind. Informat.,
                                                                     vol. 15, no. 7, pp. 42854294, Jul. 2019.
       this approach. One is that though this approach doesn’t    [13]R. Liu, N. Fusi, and L. Mackey, ‘‘Teacher-student compression with gener-
       change the existing CNNs architecture, the added mask layer        ative adversarial networks,’’ 2018,arXiv:1812.02271. [Online]. Available:

       essentiallyincreasesthenumberoflayersinthenetwork.This        https://arxiv.org/abs/1812.02271
                                                                 [14]Y. LeCun, J. S. Denker, and S. A. Solla, ‘‘Optimal brain damage,’’ inProc.
       may increase optimization difculty. However, this problem        Adv. Neural Inf. Process. Syst., 1990, pp. 598605.
       can be solved by Batch Normalization (BN [38]). The other is    [15]F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and
       that, as this method introduces a threshold, the pruning effect        K. Keutzer, ‘‘SqueezeNet: AlexNet-level accuracy with 50x fewer param-
                                                                     eters and<0.5 MB model size,’’ 2016,arXiv:1602.07360. [Online]. Avail-
       may not be smooth. The pruning rate may change drastically        able: https://arxiv.org/abs/1602.07360
       with small changes in the, which is not conducive to nding    [16]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
       the best.                                                   M. Andreetto, and H. Adam, ‘‘MobileNets: Efcient convolutional neu-
                                                                     ral networks for mobile vision applications,’’ 2017,arXiv:1704.04861.
                                                                     [Online]. Available: https://arxiv.org/abs/1704.04861
        VI. CONCLUSION                                         [17]X. Zhang, X. Zhou, M. Lin, and J. Sun, ‘‘ShufeNet: An extremely
       In this article, a structured pruning technology is proposed        efcient convolutional neural network for mobile devices,’’ inProc. IEEE
                                                                     Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 68486856.
       to automatically tailor redundant lters or neurons based on    [18]S. Anwar, K. Hwang, and W. Sung, ‘‘Structured pruning of deep convolu-
       regularization. A mask is introduced to remove unimportant        tional neural networks,’’ACM J. Emerg. Technol. Comput. Syst., vol. 13,
       lters or neurons by zeroing the values of some masks dur-        no. 3, p. 32, 2017.

       ing training. In addition, to deal with the problem that the    [19]Y. He, X. Zhang, and J. Sun, ‘‘Channel pruning for accelerating very
                                                                     deep neural networks,’’ inProc. IEEE Int. Conf. Comput. Vis., Jun. 2017,
       mask cannot be completely zeroed in practice, a threshold        pp. 13891397.
       is designed to zero the mask. Experimentation with multiple    [20]J.-H. Luo and J. Wu, ‘‘An entropy-based pruning method for

       datasets has proved that the proposed method can effectively        CNN compression,’’ arXiv:1706.05791, 2017. [Online]. Available:
                                                                     https://arxiv.org/abs/1706.05791
       remove parameters with a negligible loss of accuracy. In the    [21]R. Tibshirani, ‘‘Regression selection and shrinkage via the lasso,’’J. Roy.
       future, establishing a relation between the hyper-parameter        Stat. Soc. B, vol. 58, no. 1, pp. 267288, 1996.
       and the pruning rate will be considered to facilitate the    [22]B. Hassibi and D. G. Stork, ‘‘Second order derivatives for network pruning:
                                                                     Optimal brain surgeon,’’ inProc. Adv. Neural Inf. Process. Syst., 1993,
       adjustment of hyper-parameter.                               pp. 164171.
                                                                 [23]Y. Guo, A. Yao, and Y. Chen, ‘‘Dynamic network surgery for efcient
        ACKNOWLEDGMENT                                         DNNs,’’ inProc. Adv. Neural Inf. Process. Syst., 2016, pp. 13791387.
                                                                 [24]S. Han, J. Pool, J. Tran, and W. Dally, ‘‘Learning both weights and con-
       All the mentioned support is gratefully acknowledged.             nections for efcient neural network,’’ inProc. Adv. Neural Inf. Process.
                                                                     Syst., 2015, pp. 11351143.
        REFERENCES                                             [25]S. Han, H. Mao, and W. J. Dally, ‘‘Deep compression: Com-
                                                                     pressing deep neural networks with pruning, trained quantization
        [1]Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’Nature, vol. 521,        and Huffman coding,’’ 2015,arXiv:1510.00149. [Online]. Available:
           pp. 436444, May 2015.                                        https://arxiv.org/abs/1510.00149

       106392                                                                                              VOLUME 7, 2019       C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


       [26]Z.Liu,J.Xu,X.Peng,andR.Xiong,‘‘Frequency-domaindynamicpruning                      ZHENGHONG YANGreceived the master’s and
           forconvolutionalneuralnetworks,’’inProc.Adv.NeuralInf.Process.Syst.,                      Ph.D. degrees from Beijing Normal University,
           2018, pp. 10431053.                                                        in 1990 and 2001, respectively. He is currently
       [27]H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, ‘‘Network trimming:                      a Professor with the College of Science, China A data-driven neuron pruning approach towards efcient deep archi-                      Agricultural University. He has presided two tectures,’’ 2016, arXiv:1607.03250. [Online]. Available: https://arxiv.                      projects of National Natural Science Foundation. org/abs/1607.03250                                                          He has written two teaching and research books [28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, ‘‘Pruning l-                      andhaspublishedmorethan40academicpapersin ters for efcient convNets,’’ 2016,arXiv:1608.08710. [Online]. Available:                      domestic and foreign journals, among them, about https://arxiv.org/abs/1608.08710
       [29]J.-H. Luo, J. Wu, and W. Lin, ‘‘ThiNet: A lter level pruning method for                      30 are cited by SCI/EI/ISTP. His major research
           deep neural network compression,’’ inProc. IEEE Int. Conf. Comput. Vis.,    interests include the matrix theory, numerical algebra, image processing, and
           Jun. 2017, pp. 50585066.                                   so on. He is a member of Beijing and Chinese Society of Computational
       [30]S. Changpinyo, M. Sandler, and A. Zhmoginov, ‘‘The power of sparsity in    Mathematics.
           convolutional neural networks,’’arXiv:1702.06257. [Online]. Available:
           https://arxiv.org/abs/1702.06257
       [31]W.Wen,C.Wu,Y.Wang,Y.Chen,andH.Li,‘‘Learningstructuredsparsity
           in deep neural networks,’’ inProc. Adv. Neural Inf. Process. Syst., 2016,
           pp. 20742082.
       [32]S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, ‘‘Accelerating                      ABDUL MATEEN KHATTAKreceived the Ph.D.
           convolutional networks via global & dynamic lter pruning,’’ inProc.                      degree in horticulture and landscape from the
           IJCAI, 2018, pp. 24252432.                                                    University of Reading, U.K., in 1999. He was a
       [33]Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, ‘‘Learning efcient                      Research Scientist in different agriculture research
           convolutional networks through network slimming,’’ inProc. IEEE Int.                      organizations before joining the University of Conf. Comput. Vis., Jun. 2017, pp. 27362744.                                        Agriculture, Peshawar, Pakistan, where he is [34]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-                      currently a Professor with the Department of ing applied to document recognition,’’Proc. IEEE, vol. 86, no. 11,                      Horticulture. He has conducted academic and pp. 22782324, Nov. 1998.                                                     applied research on different aspects of tropical [35]K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for                      fruits, vegetables, and ornamental plants. He has large-scale image recognition,’’ 2014,arXiv:1409.1556. [Online]. Avail-
           able: https://arxiv.org/abs/1409.1556                             also worked for Alberta Agriculture and Forestry, Canada, as a Research
       [36]Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, ‘‘Rethinking the    Associate, and Organic Agriculture Centre of Canada as a Research and
           value of network pruning,’’ 2018,arXiv:1810.05270. [Online]. Available:    Extension Coordinator, for Alberta province. There he helped in developing
           https://arxiv.org/abs/1810.05270                               organic standards for greenhouse production and energy saving technologies
       [37]X. Ding, G. Ding, J. Han, and S. Tang, ‘‘Auto-balanced lter pruning for    for Alberta greenhouses. He is a Professor with considerable experience in
           efcient convolutional neural networks,’’ inProc. 32nd AAAI Conf. Artif.    teaching and research. He is currently a Visiting Professor with the College
           Intell., 2018, pp. 67976804.                                  of Information and Electrical Engineering, China Agricultural University,
       [38]S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating    Beijing. He has published 59 research articles in scientic journals of inter-
           deep network training by reducing internal covariate shift,’’ 2015,    national repute. He has also attended and presented in several international
           arXiv:1502.03167. [Online]. Available: https://arxiv.org/abs/1502.    scientic conferences. His research interests include greenhouse produc- 03167                                                 tion, medicinal, aromatic and ornamental plants, light quality, supplemental [39]A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi,    lighting, temperature effects on greenhouse crops, aquaponics, and organic ‘‘MorphNet: Fast & simple resource-constrained structure learning of deep    production. networks,’’ inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
           pp. 15861595.
       [40]M. Yuan and Y. Lin, ‘‘Model selection and estimation in regression with
           grouped variables,’’J. Roy. Statist. Soc., B (Statist. Methodol.), vol. 68,
           no. 1, pp. 4967, 2006.
       [41]A. Krizhevsky and G. Hinton, ‘‘Learning multiple layers of features from                      LIU YANG is currently pursuing the master’s tiny images,’’ Univ. Toronto, Toronto, ON, Canada, Tech. Rep. 4, 2009.                       degree with the College of Information and Elec- [42]K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
           image recognition,’’ inProc. IEEE Conf. Comput. Vis. Pattern Recognit.,                      trical Engineering, China Agricultural University,
           Jun. 2016, pp. 770778.                                                       Beijing, China. Her research interests include the
       [43]D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic opti-                      application of image recognition and intelligent
           mization,’’ 2014,arXiv:1412.6980. [Online]. Available: https://arxiv.org/                      robots in the eld of agriculture.
           abs/1412.6980


                         CHEN YANG is currently pursuing the mas-                      WENXIN ZHANGis currently pursuing the mas-
                         ter’s degree with the Department of College of                      ter’s degree with the School of Information and
                         Information and Electrical Engineering, China                      Electrical Engineering, China agricultural univer-
                         Agricultural University, Beijing, China. His                      sity, Beijing, China. Her research interest includes
                         research is about general deep learning and                      pose estimation methods about pig based on deep
                         machine learning but his main research interest                      learning for timely access to pig information.
                         includes deep models compression.


       VOLUME 7, 2019                                                                                              106393                                                                       C. Yanget al.: Structured Pruning of CNNs via L1 Regularization


                        WANLIN GAO received the B.S., M.S., and                     MINJUAN WANGreceivedthePh.D.degreefrom
                        Ph.D. degrees from China Agricultural University,                     the School of Biological Science and Medical
                        in 1990, 2000, and 2010, respectively. He is the                     Engineering, Beihang University, under the super-
                        currently the Dean of the College of Information                     vision of Prof. Hong Liu, in June 2017. She was a
                        and Electrical Engineering, China Agricultural                     Visiting Scholar with the School of Environmen-
                        University. He has been the principal investiga-                     tal Science, Ontario Agriculture College, Univer-
                        tor (PI) of over 20 national plans and projects.                     sity of Guelph, from October 2015 to May 2017.
                        He has published 90 academic papers in domestic                     She is currently a Postdoctoral Fellow with the
                        and foreign journals, among them, over 40 are                     College of Information and Electrical Engineer-
                        cited by SCI/EI/ISTP. He has written two teaching                     ing, China Agricultural University. Her research
       materials, which are supported by the National Key Technology Research    interests mainly include bioinformatics and the Internet of Things key
       and Development Program of China during the 11th Five-Year Plan Period,    technologies.
       and ve monographs. He holds 101 software copyrights, 11 patents for
       inventions, and eight patents for new practical inventions. His major research
       interests include the informationization of new rural areas, intelligence
       agriculture, and the service for rural comprehensive information. He is a
       member of Science and Technology Committee of the Ministry of Agricul-
       ture, a member of Agriculture and Forestry Committee of Computer Basic
       Education in colleges and universities, and a Senior Member of Society of
       Chinese Agricultural Engineering, etc.


       106394                                                                                          VOLUME 7, 2019