Learning Structured Sparsity in Deep Neural
                                             Networks


                           Wei Wen            Chunpeng Wu          Yandan Wang
                      University of Pittsburgh     University of Pittsburgh     University of Pittsburgh
                       wew57@pitt.edu        chw127@pitt.edu        yaw46@pitt.edu


     arXiv:1608.03665v4  [cs.NE]  18 Oct 2016                           Yiran Chen                     Hai Li
                             University of Pittsburgh            University of Pittsburgh
                              yic52@pitt.edu               hal66@pitt.edu


                                               Abstract

                       High demand for computation resources severely hinders deployment of large-scale
                       Deep Neural Networks (DNN) in resource constrained devices. In this work, we
                       propose aStructured Sparsity Learning(SSL) method to regularize the structures
                       (i.e., ﬁlters, channels, ﬁlter shapes, and layer depth) of DNNs. SSL can: (1)
                       learn a compact structure from a bigger DNN to reduce computation cost; (2)
                       obtain a hardware-friendly structured sparsity of DNN to efﬁciently accelerate
                       the DNN’s evaluation. Experimental results show that SSL achieves on average
                       5.1 and 3.1 speedups of convolutional layer computation ofAlexNetagainst
                       CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about
                       twice speedups of non-structured sparsity; (3) regularize the DNN structure to
                       improve classiﬁcation accuracy. The results show that for CIFAR-10, regularization
                       on layer depth can reduce 20 layers of a Deep Residual Network (ResNet) to
                       18 layers while improve the accuracy from 91.25% to 92.60%, which is still
                       slightly higher than that of originalResNetwith 32 layers. ForAlexNet, structure
                       regularization by SSL also reduces the error by 1%. Our source code can be
                       found athttps://github.com/wenwei202/caffe/tree/scnn


                 1 Introduction

                 Deep neural networks (DNN), especially deep convolutional neural networks (CNN), made re-
                 markable success in visual tasks[1][2][3][4][5] by leveraging large-scale networks learning from
                 a huge volume of data. Deployment of such big models, however, is computation-intensive and
                 memory-intensive. To reduce computation cost, many studies are performed to compress the scale of
                 DNN, including sparsity regularization[6], connection pruning[7][8] and low rank approximation
                 [9][10][11][12][13]. Sparsity regularization and connection pruning approaches, however, often pro-
                 duce non-structured random connectivity in DNN and thus, irregular memory access that adversely
                 impactspracticalacceleration in hardware platforms. Figure 1 depicts practical speedup of each
                 layer of aAlexNet, which is non-structurally sparsiﬁed by‘1 -norm. Compared to original model,
                 the accuracy loss of the sparsiﬁed model is controlled within 2%. Because of the poor data locality
                 associated with the scattered weight distribution, the achieved speedups are either very limited or
                 negative even the actual sparsity is high, say, >95%. We deﬁne sparsity as the ratio of zeros in this
                 paper. In recently proposed low rank approximation approaches, the DNN is trained ﬁrst and then
                 each trained weight tensor is decomposed and approximated by a product of smaller factors. Finally,
                 ﬁne-tuning is performed to restore the model accuracy. Low rank approximation is able to achieve
                 practical speedups because it coordinates model parameters in dense matrixes and avoids the locality
                 problem of non-structured sparsity regularization. However, low rank approximation can only obtain                         Speedup  1.5                                   1    Quadro K600 
                           1 


                                                                Sparsity     Tesla K40c 
                                                                    GTX Titan  0.5                                        Sparsity 
                           0                                   0 
                              conv1   conv2   conv3   conv4   conv5 
                 Figure 1: Evaluation speedups of AlexNet on GPU platforms and the sparsity. conv1 refers to
                 convolutional layer 1, and so forth. Baseline is proﬁled by GEMM of cuBLAS. The sparse matrixes
                 are stored in the format of Compressed Sparse Row (CSR) and accelerated by cuSPARSE.


                 the compact structure within each layer, and the structures of the layers are ﬁxed during ﬁne-tuning
                 such that costly reiterations of decomposing and ﬁne-tuning are required to ﬁnd an optimal weight
                 approximation for performance speedup and accuracy retaining.
                 Inspired by the facts that (1) there is redundancy across ﬁlters and channels [11]; (2) shapes of
                 ﬁlters are usually ﬁxed as cuboid but enabling arbitrary shapes can potentially eliminate unnecessary
                 computation imposed by this ﬁxation; and (3) depth of the network is critical for classiﬁcation
                 but deeper layers cannot always guarantee a lower error because of the exploding gradients and
                 degradation problem [5], we propose Structured Sparsity Learning (SSL) method todirectlylearn
                 a compressed structure of deep CNNs by group Lasso regularization during the training. SSL is a
                 generic regularization to adaptively adjust mutiple structures in DNN, including structures of ﬁlters,
                 channels, and ﬁlter shapes within each layer, and structure of depth beyond the layers. SSL combines
                 structure regularization (on DNN for classiﬁcation accuracy) with locality optimization (on memory
                 access for computation efﬁciency), offering not only well-regularized big models with improved
                 accuracy but greatly accelerated computation (e.g.5.1 on CPU and 3.1 on GPU forAlexNet).

                 2 Related works
                 Connection pruning and weight sparsifying. Hanet al.[7][8] reduced number of parameters of
                 AlexNetby 9 andVGG-16by 13 using connection pruning. Since most reduction is achieved
                 on fully-connected layers, the authors obtained 3 to 4 layer-wise speedup for fully-connected
                 layers. However, no practical speedups of convolutional layers are observed because of the issue
                 shown in Figure 1. As convolution is the computational bottleneck and many new DNNs use fewer
                 fully-connected layers,e.g., only 3.99% parameters ofResNet-152in [5] are from fully-connected
                 layers, compression and acceleration on convolutional layers become essential. Liuet al.[6] achieved
                 >90% sparsity of convolutional layers inAlexNetwith 2% accuracy loss, and bypassed the issue
                 shown in Figure 1 by hardcoding the sparse weights into program, achieving layer-wise 4.59 
                 speedup on a CPU. In this work, we also focus on convolutional layers. Compared to the above
                 techniques, our SSL method can coordinate sparse weights in adjacent memory space and achieve
                 higher speedups with the same accuracy. Note that hardware and program optimizations can further
                 boost the system performance on top of the level of SSL but are not covered in this work.
                 Low rank approximation. Denilet al.[9] predicted 95% parameters in a DNN by exploiting the
                 redundancy across ﬁlters and channels. Inspired by it, Jaderberget al.[11] achieved 4.5 speedup
                 on CPUs for scene text character recognition and Dentonet al.[10] achieved 2 speedups on both
                 CPUs and GPUs for the ﬁrst two layers. Both of the works usedLow Rank Approximation(LRA)
                 with 1% accuracy drop. [13][12] improved and extended LRA to larger DNNs. However, the
                 network structure compressed by LRA is ﬁxed; reiterations of decomposing, training/ﬁne-tuning,
                 and cross-validating are still needed to ﬁnd an optimal structure for accuracy and speed trade-off.
                 As number of hyper-parameters in LRA method increases linearly with layer depth [10][13], the
                 search space increases linearly or even polynomially for very deep DNNs. Comparing to LRA, our
                 contributions are: (1) SSL can dynamically optimize the compactness of DNN structure with only
                 one hyper-parameter and no reiterations; (2) besides the redundancy within the layers, SSL also
                 exploits the necessity of deep layers and reduce them; (3) DNN ﬁlters regularized by SSL havelower
                 rank approximation, so it can work together with LRA for more efﬁcient model compression.
                 Model structure learning.Group Lasso [14] is an efﬁcient regularization to learn sparse structures.
                 Kimet al.[15] used group Lasso to regularize the structure of correlation tree for multi-task regression
                 problem and reduced prediction errors. Liuet al.[6] utilized group Lasso to constrain the scale

                                                  2                                            								 								
                                            				 				
                                            								 								
                                 W(l)            (1) nl ,:,:,:
                        channel-wise  W(l)            (2)                 shortcut  :,c l ,:,:
                                W(l)            (3) :,c l ,m l ,k l
                         				 				
                         				 				
                                  W(l)           (4)             				 				       W(l)            (1) nl ,:,:,:
                          				 				
                          				 				                       W(l) 				 				           (1) nl ,:,:,:         				 				
                                    				 				      …                             W(l)            (2) 				 				
                                                                         				 				 :,c 				 				        l ,:,:
                                               				 								      W(l)            (2) shape-wise :,c l ,:,:  				 				
                           				 				
                                                                 				 				     W(l)            (3) W(l)            (3)      :,c l ,m l ,k l
                                                    :,c
                         filter-wise  W(l)            (1)   l ,m l ,k l          depth-wise  W(l)           (4) nl ,:,:,:               W(l)           (4) Figure 2: The proposed structured sparsity learning (SSL) for DNNs. Weights in ﬁlters are split W(l)            (2) into multiple groups. Through group Lasso regularization, a more compact DNN is obtained by :,c l ,:,:
                 removing some groups. The ﬁgure illustrates the ﬁlter-wise, channel-wise, shape-wise, and depth-wise W(l)            (3)
                 structured sparsity that were explored in the work. :,c l ,m l ,k l
                                 W(l)           (4)

                 of the structure of LRA. To adapt DNN structure to different databases, Fenget al.[16] learned
                 the appropriate number of ﬁlters in DNN. Different from these prior arts, we apply group Lasso to
                 regularize multiple DNN structures (ﬁlters, channels, ﬁlter shapes, and layer depth). Our source code
                 can be found athttps://github.com/wenwei202/caffe/tree/scnn.


                 3 Structured Sparsity Learning Method for DNNs

                 We focus mainly on theStructured Sparsity Learning(SSL) on convolutional layers to regularize the
                 structure of DNNs. We ﬁrst propose a generic method to regularize structures of DNN in Section 3.1, 1
                 and then specify the method to structures of ﬁlters, channels, ﬁlter shapes and depth in section 3.2.
                 Variants of formulations are also discussed from computational efﬁciency viewpoint in Section 3.3.

                 3.1 Proposed structured sparsity learning for generic structures           1
                 Suppose weights of convolutional layers in a DNN form a sequence of 4-D tensors 1

                 W(l) 2 RNl  Cl  Ml  Kl , whereNl ,Cl ,Ml andKl are the dimensions of thel-th(1 l L)
                 weight tensor along the axes of ﬁlter, channel, spatial height and spatial width, respectively.Ldenotes
                 the number of convolutional layers. Then the proposed generic optimization target of a DNN with
                 structured sparsity regularization can be formulated as: 1

                                                          XL        
                                 E(W) =ED (W) +  R(W) + g    Rg W(l) :             (1)
                                                          l=1
                 HereWrepresents the collection of all weights in the DNN;ED (W)is the loss on data;R( )is
                 non-structured regularization applying on every weight,e.g.,‘2 -norm; andRg ( )is the structured
                 sparsity regularization on each layer. BecauseGroup Lassocan effectively zero out all weights in
                 some groups [14][15], we adopt it in our SSL. The regularization of group Lasso on a set of weightsPwcan be represented asRg (w) =  G jjw(g) jj g=1     g , wherew(g) is a group of partial weights inw
                 andGis the total number of groups. Different groups may overlap. Herer                                      jj   jj                               g is the group Lasso, or
                           P        2 jjw(g) jj             g)g =   jw(g) j w(   , wherejw(g) jis the number of weights inw(g) .i=1   i


                 3.2 Structured sparsity learning for structures of ﬁlters, channels, ﬁlter shapes and depth

                 In SSL, the learned “structure” is decided by the way of splitting groups ofw(g) . We investigate and
                 formulate theﬁler-wise,channel-wise,shape-wise, anddepth-wisestructured sparsity in Figure 2.
                 For simplicity, theR( )term of Eq. (1) is omitted in the following formulation expressions.
                 Penalizing unimportant ﬁlers and channels. SupposeW(l)                  (l)nl ;:;:;: is thenl -thﬁlter andW:;c l ;:;: is the
                 cl -thchannel of all ﬁlters in thel-thlayer. The optimization target of learning the ﬁlter-wise and

                                                  3                 channel-wise structured sparsity can be deﬁned as
                                           0          1      0          1
                                         XL  XNl              XL  XCl
                        E(W) =ED (W) + n    @   jjW(l) jjA                         (2) nl ;:;:;:g + c    @  jjW(l) jjA:;c l ;:;:g :
                                         l=1 nl =1             l=1 cl =1
                 As indicated in Eq. (2), our approach tends to remove less important ﬁlters and channels. Note
                 that zeroing out a ﬁlter in thel-thlayer results in a dummy zero output feature map, which in turn
                 makes a corresponding channel in the(l+ 1)-thlayer useless. Hence, we combine the ﬁlter-wise and
                 channel-wise structured sparsity in the learning simultaneously.
                 Learning arbitrary shapes of ﬁlers. As illustrated in Figure 2,W(l)    denotes the vector of :;c l ;m l ;k all corresponding weights located at spatial position of(m          l
                                                          l ;k l )in the 2D ﬁlters across thecl -th
                 channel. Thus, we deﬁneW(l)    as theshape ﬁberrelated to learning arbitrary ﬁlter shape :;c l ;m l ;k because a homogeneous non-cubic ﬁlter shape can be learned by zeroing out some shape ﬁbers. The l
                 optimization target of learning shapes of ﬁlers becomes:
                                                 0                  1
                                              XL  XCl XMl XKl
                              E(W) =ED (W) + s    @         jjW(l)   jjA:;c l ;m l ;k g :          (3) l
                                               l=1 cl =1ml =1kl =1

                 Regularizing layer depth. We also explore the depth-wise sparsity to regularize the depth of DNNs
                 in order to improve accuracy and reduce computation cost. The corresponding optimization target isPE(W) =E                 )D (W)+ d    L jjW(l jj l=1     g . Different from other discussed sparsiﬁcation techniques,
                 zeroing out all the ﬁlters in a layer will cut off the message propagation in the DNN so that the output
                 neurons cannot perform any classiﬁcation. Inspired by the structure of highway networks [17] and
                 deep residual networks [5], we propose to leverage the shortcuts across layers to solve this issue. As
                 illustrated in Figure 2, even when SSL removes an entire unimportant layers, feature maps will still
                 be forwarded through the shortcut.

                 3.3 Structured sparsity learning for computationally efﬁcient structures

                 All proposed schemes in section 3.2 can learn a compact DNN for computation cost reduction.
                 Moreover, some variants of the formulations of these schemes can directly learn structures that can
                 be efﬁciently computed.
                 2D-ﬁlter-wise sparsity for convolution. 3D convolution in DNNs essentially is a composition of 2D
                 convolutions. To perform efﬁcient convolution, we explored a ﬁne-grain variant of ﬁlter-wise sparsity,
                 namely,2D-ﬁlter-wisesparsity, to spatially enforce group Lasso on each 2D ﬁlter ofW(l)nl ;c l ;:;: . The
                 saved convolution is proportional to the percentage of the removed 2D ﬁlters. The ﬁne-grain version
                 of ﬁlter-wise sparsity can more efﬁciently reduce the computation associated with convolution:
                 Because the group sizes are much smaller and thus the weight updating gradients are shaper, it helps
                 group Lasso to quickly obtain a high ratio of zero groups for a large-scale DNN.
                 Combination of ﬁlter-wise and shape-wise sparsity for GEMM. Convolutional computation in
                 DNNs is commonly converted to modality ofGEneral Matrix Multiplication(GEMM) by lowering
                 weight tensors and feature tensors to matrices [18]. For example, in Caffe [19], a 3D ﬁlterW(l)nl ;:;:;: is
                 reshaped to a row in the weight matrix where each column is the collection of weightsW(l)
                                                                              :;c related to shape-wise sparsity. Combining ﬁlter-wise and shape-wise sparsity can directly reduce the l ;m l ;k l
                 dimension of weight matrix in GEMM by removing zero rows and columns. In this context, we use
                 row-wiseandcolumn-wisesparsity as the interchangeable terminology ofﬁlter-wiseandshape-wise
                 sparsity, respectively.

                 4 Experiments

                 We evaluated the effectiveness of our SSL using published models on three databases – MNIST,
                 CIFAR-10, and ImageNet. Without explicit explanation, SSL starts with the network whose weights
                 are initialized by the baseline, and speedups are measured in matrix-matrix multiplication by Caffe in
                 a single-thread Intel Xeon E5-2630 CPU .

                                                  4                          Table 1: Results after penalizing unimportant ﬁlters and channels inLeNet
                           LeNet# Error Filter # §  Channel # §    FLOP §     Speedup §
                         1 (baseline) 0.9% 20—50 1—20 100%—100% 1.00 —1.00 
                             2 0.8% 5—19 1—4 25%—7.6% 1.64 —5.23 
                             3 1.0% 3—12 1—3 15%—3.6% 1.99 —7.44 
                         § In the order ofconv1—conv2

                                 Table 2: Results after learning ﬁlter shapes inLeNet
                          LeNet# Error Filter size §  Channel # FLOP Speedup
                         1 (baseline) 0.9% 25—500 1—20 100%—100% 1.00 —1.00 
                            4 0.8% 21—41 1—2 8.4%—8.2% 2.33 —6.93 
                            5 1.0% 7—14 1—1 1.4%—2.8% 5.19 —10.82 
                         § The sizes of ﬁlters after removing zero shape ﬁbers, in the order ofconv1—conv2

                  4.1 LeNetand multilayer perceptron on MNIST

                 In the experiment of MNIST, we examined the effectiveness of SSL in two types of networks:
                 LeNet[20] implemented by Caffe and amultilayer perceptron(MLP) network. Both networks were
                 trained without data augmentation.
                 LeNet:When applying SSL toLeNet, we constrain the network with ﬁlter-wise and channel-wise
                 sparsity in convolutional layers to penalize unimportant ﬁlters and channels. Table 1 summarizes
                 the remained ﬁlters and channels,ﬂoating-point operations(FLOP), and practical speedups. In the
                 table,LeNet 1is the baseline and the others are the results after applying SSL in different strengths
                 of structured sparsity regularization. The results show that our method achieves the similar error
                 ( 0:1%) with much fewer ﬁlters and channels, and saves signiﬁcant FLOP and computation time.
                 To demonstrate the impact of SSL on the structures of ﬁlters, we present all learnedconv1ﬁlters
                 in Figure 3. It can be seen that most ﬁlters inLeNet 2are entirely zeroed out except for ﬁve most
                 important detectors of stroke patterns that are sufﬁcient for feature extraction. The accuracy of
                 LeNet 3(that further removes the weakest and redundant stroke detector) drops only 0.2% from that
                 ofLeNet 2. Compared to the random and blurry ﬁlter patterns inLeNet 1that resulted from the high
                 freedom of parameter space, the ﬁlters inLeNet 2 & 3are regularized and converge to smoother and
                 more natural patterns. This explains why our proposed SSL obtains the same-level accuracy but has
                 much less ﬁlters. The smoothness of the ﬁlters are also observed in the deeper layers.
                 The effectiveness of the shape-wise sparsity onLeNetis summarized in Table 2. The baselineLeNet 1
                 hasconv1ﬁlters with a regular5 5square (size = 25) whileLeNet 5reduces the dimension that
                 can be constrained by a2 4rectangle (size = 7). The 3D shape ofconv2ﬁlters in the baseline is
                 also regularized to the 2D shape inLeNet 5within only one channel, indicating that only one ﬁlter in
                 conv1is needed. This fact signiﬁcantly saves FLOP and computation time.


                     Figure 3: Learnedconv1ﬁlters inLeNet 1(top),LeNet 2(middle) andLeNet 3(bottom)

                 MLP:Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.the
                 number of neurons) of fully-connected layers. We enforce the group Lasso regularization on all the
                 input (or output) connections of each neuron. A neuron whose input connections are all zeroed out
                 can degenerate to a bias neuron in the next layer; similarly, a neuron can degenerate to a removable
                 dummy neuron if all of its output connections are zeroed out. Figure 4(a) summarizes the learned
                 structure and FLOP of differentMLPnetworks. The results show that SSL can not only remove
                 hidden neurons but also discover the sparsity of images. For example, Figure 4(b) depicts the number
                 of connections of each input neuron inMLP 2, where 40.18% of input neurons have zero connections
                 and they concentrate at the boundary of the image. Such a distribution is consistent with our intuition:

                                                  5                       Table 2: Results after learning ﬁlter shapes inLeNet
                LeNet# Error Filter size §  Channel # FLOP Speedup
               1(baseline) 0.9% 25–500 1–20 100%–100% 1.00⇥–1.00⇥
                  4 0.8% 21–41 1–2 8.4%–8.2% 2.33⇥–6.93⇥
                  5 1.0% 7–14 1–1 1.4%–2.8% 5.19⇥–10.82⇥
               § The sizes of ﬁlters after removing zero shape ﬁbers, in the order ofconv1–conv2

           50                    50                    50                 


          % Reconstruction error              conv1                  conv1                  conv1 40            conv2    40            conv2    40            conv2
                                              conv3                  conv3 30                   30                   30            conv4
                                                                   conv5
           20                   20                   20
           10                   10                   10
           0                    0                    0 0      50      100    0      50      100    0      50      100
                                       % ranks
       Figure 4: The normalized reconstructure error of weight matrix vs. the percent of ranks.Principal
       Component Analysis(PCA) is utilized to explore the redundancy among ﬁlters.% ranksof eigenvec-
       tors corresponding to the largest eigenvalues are selected as basis to perform low rank approximation.
       Left:LeNet2 in Table 1; middle:ConvNet2 in Table 4; right:AlexNet4 in Table 5. Dash lines
       indicate baselines and solid lines indicate results of SSL.


    170 detectors of stroke patterns which are sufﬁcient for feature extraction. The accuracy ofLeNet 3
    171 (that further removes one weakest and one redundant stroke detector) compared withLeNet 2drops
    172 only 0.2%. Although the training processes of three networks are independent, the corresponding
    173 regularized ﬁlters inLeNet 2andLeNet 3demonstrate very high similarity and represent certain level
    174 of alikeness to those inLeNet 1. Comparing with random and blurry ﬁlter patterns inLeNet 1resulted
    175 from the high freedom of parameter space, the ﬁlters inLeNet 2 & 3are regularized through the
    176 ﬁlter-wise and channel-wise sparsity and therefore converge at smoother and more natural patterns.
    177 This explains why our proposed SSL obtains the same-level accuracy but having much less ﬁlters.
    178 These regularity and similarity phenomena are also observed in deeper layers. Different from low
    179 rank decomposition which only explore the redundancy and does not change the rank, SSL can reduce
    180 the redundancy as shown in Figure 4.

    181 We also explore the effectiveness of the shape-wise sparsity onLeNetin Table 2. The baselineLeNet
    182 1has a regular5⇥5square size ofconv1ﬁlters, whileLeNet 5reduces the dimension to less than
    183 2⇥4. And the 3D shape of ﬁlters inconv2ofLeNet 1are regularized to 2D shape ofLeNet 5with
    184 only one channel, indicating that only one ﬁlter inconv1is needed. This saves signiﬁcant FLOP and
    185 computing time.

    186 MLP:Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.
    187 the number of neurons) in fully-connected layers. Here, the baselineMLPnetwork composed of
    188 two hidden layers with 500 and 300 neurons respectively obtains a test error of 1.43%. We enforced
    189 the group Lasso regularization on all the input (or output) connections of every neuron, including
    190 those of the input layer. Note that a neuron with all the input connections zeroed out degenerate
    191 to a bias neuron in the next layer; similarly, a neuron degenerates to a removable dummy neuron
    192 if all of its output connections are zeroed out. As such, the computation ofGEneral Matrix Vector
    193 (GEMV) product in fully-connected layers can be signiﬁcantly reduced. Table 3 summarizes the


                  Table 3: Learning the number of neurons in multi-layer perceptron
                   MLP# Error Neuron # per layer §  FLOP per layer §       1           291

                  1(baseline) 1.43% 784–500–300–10 100%–100%–100%
                     2 1.34% 469–294–166–10 35.18%–32.54%–55.33%
                     3 1.53% 434–174–78–10 19.26%–9.05%–26.00%
                  § In the order ofinput layer–hidden layer 1–hidden layer 2–output layer   281           028
                                       (a)                            (b)
                Figure 4: (a) Results of learning the number of neurons inMLP. (b) the connection numbers of input
                neurons (i.e.pixels) inMLP 2after SSL. 6

                      Table 3: Learning row-wise and column-wise sparsity ofConvNeton CIFAR-10
                     ConvNet #  Error Row sparsity §     Column sparsity §  Speedup §
                     1 (baseline) 17.9% 12.5%–0%–0% 0%–0%–0% 1.00 –1.00 –1.00 
                        2 17.9% 50.0%–28.1%–1.6% 0%–59.3%–35.1% 1.43 –3.05 –1.57 
                        3 16.9% 31.3%–0%–1.6% 0%–42.8%–9.8% 1.25 –2.01 –1.18 
                     § in the order ofconv1–conv2–conv3

                handwriting digits are usually written in the center and pixels close to the boundary contain little
                discriminative classiﬁcation information.

                4.2 ConvNetandResNeton CIFAR-10
                We implemented theConvNetof [1] anddeep residual networks(ResNet) [5] on CIFAR-10. When
                regularizing ﬁlters, channels, and ﬁlter shapes, the results and observations of both networks are
                similar to that of the MNIST experiment. Moreover, we simultaneously learn the ﬁlter-wise and
                shape-wise sparsity to reduce the dimension of weight matrix in GEMM ofConvNet. We also learn
                the depth-wise sparsity ofResNetto regularize the depth of the DNNs.
                ConvNet:We use the network from Alex Krizhevskyet al.[1] as the baseline and implement it
                using Caffe. All the conﬁgurations remain the same as the original implementation except that we
                added a dropout layer with a ratio of 0.5 in the fully-connected layer to avoid over-ﬁtting.ConvNetis
                trained without data augmentation. Table 3 summarizes the results of threeConvNetnetworks. Here,
                the row/column sparsity of a weight matrix is deﬁned as the percentage of all-zero rows/columns.
                Figure 5 shows their learnedconv1ﬁlters. In Table 3, SSL can reduce the size of weight matrix
                inConvNet 2by 50%, 70.7% and 36.1% for each convolutional layer and achieve good speedups
                without accuracy drop. Surprisingly, without SSL, fourconv1ﬁlters of the baseline are actually
                all-zeros as shown in Figure 5, demonstrating the great potential of ﬁlter sparsity. When SSL is
                applied, half ofconv1ﬁlters inConvNet 2can be zeroed out without accuracy drop.
                On the other hand, inConvNet 3, SSL achieves 1.0% ( 0.16%) lower error with a model even smaller
                than the baseline. In this scenario, SSL performs as a structure regularization to dynamically learn a
                better network structure (including the number of ﬁlters and ﬁler shapes) to reduce the error.


                 Figure 5: Learnedconv1ﬁlters inConvNet 1(top),ConvNet 2(middle) andConvNet 3(bottom)

                ResNet:To investigate the necessary depth of DNNs required by SSL, we use a 20-layer deep residual
                networks (ResNet-20) proposed in [5] as the baseline. The network has 19 convolutional layers and
                1 fully-connected layer.Identity shortcutsare utilized to connect the feature maps with the same
                dimension while 1 1 convolutional layers are chosen as shortcuts between the feature maps with
                different dimensions. Batch normalization [21] is adopted after convolution and before activation.
                We use the same data augmentation and training hyper-parameters as that in [5]. The ﬁnal error of
                baseline is 8.82%. In SSL, the depth ofResNet-20is regularized by depth-wise sparsity. Group Lasso
                regularization is only enforced on the convolutional layers between each pair of shortcut endpoints,
                excluding the ﬁrst convolutional layer and all convolutional shortcuts. After SSL converges, layers

                                               6                                                     10                              
                                                                  SSL  ResNet−20  ResNet−32


                                                    % error 9

                                                     8

                                                     7   12    14    16    18    20SSL−ResNet−#
                   10                                20                              
                                 SSL  ResNet−20  ResNet−32   18   32×32  16×16  8×8


                                                    # conv layers16


                  % error 9                                141210
                    8                                 8642 7                                 0 12    14    16    18    20         12    14    16    18    20 SSL−ResNet−#                          SSL−ResNet−#
                   20                               18 Figure 6: Error vs. layer number after depth regularization by SSL. 32×32  16×16  8×8


                  # conv layers 16                                          ResNet-#is the originalResNet
                 in [ 1412 5] with # layers.SSL-ResNet-#is the depth-regularizedResNetby SSL with # layers, including
                 the last fully-connected layer. 32 108                    32 indicates the convolutional layers with an output map size of
                 32 64 32, and so forth. 20 
                 with all zero weights are removed and the net is ﬁnally ﬁne-tuned with a base learning rate of 0.01, 12    14    16    18    20SSL−ResNet−# which is lower than that (i.e., 0.1) in the baseline.
                 Figure 6 plots the trend of the error vs. the number of layers under different strengths of depth
                 regularizations. Compared with originalResNetin [5], SSL learns aResNetwith 14 layers (SSL-
                 ResNet-14) that reaching a lower error than the one of the baseline with 20 layers (ResNet-20);
                 SSL-ResNet-18andResNet-32achieve an error of 7.40% and 7.51%, respectively. This result implies
                 that SSL can work as a depth regularization to improve classiﬁcation accuracy. Note that SSL can
                 efﬁciently learn shallower DNNs without accuracy loss to reduce computation cost; however, it
                 does not mean the depth of the network is not important. The trend in Figure 6 shows that the test
                 error generally declines as more layers are preserved. A slight error rise ofSSL-ResNet-20from
                 SSL-ResNet-18shows the suboptimal selection of the depth in the group of “32 32”.

                 4.3 AlexNeton ImageNet

                 To show the generalization of our method to large scale DNNs, we evaluate SSL usingAlexNetwith
                 ILSVRC 2012.CaffeNet[19] – the replication ofAlexNet[1] with mirror changes, is used in our
                 experiment. All training images are rescaled to the size of 256 256. A 227 227 image is randomly
                 cropped from each scaled image and mirrored for data augmentation and only the center crop is
                 used for validation. The ﬁnal top-1 validation error is 42.63%. In SSL,AlexNetis ﬁrst trained with
                 structure regularization; when it converges, zero groups are removed to obtain a DNN with the new
                 structure; ﬁnally, the network is ﬁne-tuned without SSL to regain the accuracy.
                 We ﬁrst studied 2D-ﬁlter-wise and shape-wise sparsity by exploring the trade-offs between com-
                 putation complexity and classiﬁcation accuracy. Figure 7(a) shows the 2D-ﬁlter sparsity (the ratio
                 between the removed 2D ﬁlters and total 2D ﬁlters) and the saved FLOP of 2D convolutions vs. the
                 validation error. In Figure 7(a), deeper layers generally have higher sparsity as the group size shrinks

                    % Sparsity    % FLOP reduction  100                 100   50                  6 


                                                                speedup 


                                         % Reconstruction error conv1                             conv1           l1  SSL 
                    80   conv2           80   40            conv2  5 
                        conv3                             conv3  4  60   conv4           60   30            conv4 conv5 
                    40   FLOP           40   20            conv5  3 
                                                             2 
                    20                 20   10                  1 
                    0                 0     0
                     41.5  42  42.5  43  43.5  44      0        50      100 0 
                          % top-1 error              % dimensionality       Quadro Tesla Titan Xeon Xeon Xeon Xeon 
                                                                     Black  T8  T4  T2  T1 
                            (a)                   (b)                   (c)
                 Figure 7: (a) 2D-ﬁlter-wise sparsity and FLOP reduction vs. top-1 error. Vertical dash line shows the
                 error of originalAlexNet; (b) The reconstruction error of weight tensor vs. dimensionality.Principal
                 Component Analysis(PCA) is utilized to perform dimensionality reduction to exploit ﬁlter redundancy.
                 The eigenvectors corresponding to the largest eigenvalues are selected as basis of lower-dimensional
                 space. Dash lines denote the results of the baselines and solid lines indicate the ones of theAlexNet 5
                 in Table 4; (c) Speedups of‘1 -norm and SSL on various CPU and GPU platforms (In labels of x-axis,
                 T# is number of the maximum physical threads in Xeon CPU).AlexNet 1andAlexNet 2in Table 4
                 are used as testbenches.


                                                  7                 and the number of 2D ﬁlters grows. 2D-ﬁlter sparsity regularization can reduce the total FLOP by
                 30%–40% without accuracy loss or reduce the error ofAlexNetby 1% down to 41.69% by retaining
                 the original number of parameters. Shape-wise sparsity also obtains similar results – In Table 4, for
                 example,AlexNet 5achieves on average 1.4 layer-wise speedup on both CPU and GPU without
                 accuracy loss after shape regularization; The top-1 error can also be reduced down to 41.83% if
                 the parameters are retained. In Figure 7(a), the obtained DNN with the lowest error has a very low
                 sparsity, indicating that the number of parameters in a DNN is still important to maintain learning
                 capacity. In this case, SSL works as a regularization to add restriction of smoothness to the model in
                 order to avoid over-ﬁtting. Figure 7(b) compares the results of dimensionality reduction of weight
                 tensors in the baseline and our SSL-regularizedAlexNet. The results show that the smoothness restric-
                 tion enforces parameter searching in lower-dimensional space and enableslowerrank approximation
                 of the DNNs. Therefore, SSL can work together with low rank approximation to achieve even higher
                 model compression.
                 Besides the above analyses, the computation efﬁciencies of structured sparsity and non-structured
                 sparsity are compared in Caffe using standard off-the-shelf libraries,i.e., Intel Math Kernel Library
                 on CPU and CUDA cuBLAS and cuSPARSE on GPU. We use SSL to learn aAlexNetwith high
                 column-wise and row-wise sparsity as the representative of structured sparsity method.‘1 -norm is
                 selected as the representative of non-structured sparsity method instead of connection pruning in
                 [7] because‘1 -norm get a higher sparsity on convolutional layers as the results ofAlexNet 3and
                 AlexNet 4depicted in Table 4. Speedups achieved by SSL are measured by subroutines of GEMM
                 where nonzero rows and columns in each weight matrix are concatenated in consecutive memory
                 space. Note that compared to GEMM, the overhead of concatenation can be ignored. To measure the
                 speedups of‘1 -norm, sparse weight matrices are stored in the format of Compressed Sparse Row
                 (CSR) and computed by sparse-dense matrix multiplication subroutines.
                 Table 4 compares the obtained sparsity and speedups of‘1 -norm and SSL on CPU (Intel Xeon)
                 and GPU (GeForce GTX TITAN Black) under approximately the same errors,e.g., with acceptable
                 or no accuracy loss. For a fair comparison, after‘1 -norm regularization, the DNN is also ﬁne-
                 tuned by disconnecting all zero-weighted connections so that 1.39% accuracy is recovered for the
                 AlexNet 1. Our experiments show that the DNNs require a very high non-structured sparsity to achieve
                 a reasonable speedup (The speedups are even negative when the sparsity is low). SSL, however, can
                 always achieve positive speedups. With an acceptable accuracy loss, our SSL achieves on average
                 5.1 and 3.1 layer-wise acceleration on CPU and GPU, respectively. Instead,‘1 -norm achieves
                 on average only 3.0 and 0.9 layer-wise acceleration on CPU and GPU, respectively. We note
                 that at the same accuracy, our average speedup is indeed higher than that of [6] which adopts heavy
                 hardware customization to overcome the negative impact of non-structured sparsity. Figure 7(c)
                 shows the speedups of‘1 -norm and SSL on various platforms, including both GPU (Quadro, Tesla


                               Table 4: Sparsity and speedup ofAlexNeton ILSVRC 2012

                    # Method Top1 err. Statistics conv1 conv2 conv3 conv4 conv5
                                          sparsity 67.6% 92.4% 97.2% 96.6% 94.3%
                    1    ‘1     44.67%     CPU     0.80 2.91 4.84 3.83 2.76
                                          GPU     0.25 0.52 1.38 1.04 1.36
                                       column sparsity 0.0% 63.2% 76.9% 84.7% 80.7%
                                         row sparsity 9.4% 12.9% 40.6% 46.9% 0.0%2 SSL 44.66%     CPU     1.05 3.37 6.27 9.73 4.93
                                          GPU     1.00 2.37 4.94 4.03 3.05
                    3 pruning[7] 42.80% sparsity 16.0% 62.0% 65.0% 63.0% 63.0%
                                          sparsity 14.7% 76.2% 85.3% 81.5% 76.3%
                    4    ‘1     42.51%     CPU     0.34 0.99 1.30 1.10 0.93
                                          GPU     0.08 0.17 0.42 0.30 0.32
                                       column sparsity 0.00% 20.9% 39.7% 39.7% 24.6%
                    5 SSL 42.53%     CPU     1.00 1.27 1.64 1.68 1.32
                                          GPU     1.00 1.25 1.63 1.72 1.36


                                                  8                   and Titan) and CPU (Intel Xeon E5-2630). SSL can achieve on average 3 speedup on GPU while
                   non-structured sparsity obtain no speedup on GPU platforms. On CPU platforms, both methods can
                   achieve good speedups and the beneﬁt grows as the processors become weaker. Nonetheless, SSL
                   can always achieve averagely 2 speedup compared to non-structured sparsity.


                   5 Conclusion

                   In this work, we have proposed aStructured Sparsity Learning(SSL) method to regularize ﬁlter,
                   channel, ﬁlter shape, and depth structures in deep neural networks (DNN). Our method can enforce
                   the DNN to dynamically learn more compact structures without accuracy loss. The structured
                   compactness of the DNN achieves signiﬁcant speedups for the DNN evaluation both on CPU
                   and GPU with off-the-shelf libraries. Moreover, a variant of SSL can be performed as structure
                   regularization to improve classiﬁcation accuracy of state-of-the-art DNNs.

                   Acknowledgments

                   This work was supported in part by NSF XPS-1337198 and NSF CCF-1615475. The authors thank
                   Drs. Sheng Li and Jongsoo Park for valuable feedback on this work.


                   References

                    [1]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep convolutional
                       neural networks. InAdvances in Neural Information Processing Systems, pages 1097–1105. 2012.
                    [2]Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate
                       object detection and semantic segmentation. InThe IEEE Conference on Computer Vision and Pattern
                       Recognition (CVPR), 2014.
                    [3]Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
                       tion.arXiv preprint arXiv:1409.1556, 2014.
                    [4]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
                       Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.arXiv preprint
                       arXiv:1409.4842, 2015.
                    [5]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
                       arXiv preprint arXiv:1512.03385, 2015.
                    [6]Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional
                       neural networks. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
                    [7]Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efﬁcient
                       neural network. InAdvances in Neural Information Processing Systems, pages 1135–1143. 2015.
                    [8]Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with
                       pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015.
                    [9] Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. Predicting
                       parameters in deep learning. InAdvances in Neural Information Processing Systems, pages 2148–2156.
                       2013.
                   [10]Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure
                       within convolutional networks for efﬁcient evaluation. InAdvances in Neural Information Processing
                       Systems, pages 1269–1277. 2014.
                   [11]Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with
                       low rank expansions.arXiv preprint arXiv:1405.3866, 2014.
                   [12]Yani Ioannou, Duncan P. Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training
                       cnns with low-rank ﬁlters for efﬁcient image classiﬁcation.arXiv preprint arXiv:1511.06744, 2015.
                   [13]Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with low-rank
                       regularization.arXiv preprint arXiv:1511.06067, 2015.
                   [14]Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.Journal of
                       the Royal Statistical Society. Series B (Statistical Methodology), 68(1):49–67, 2006.
                   [15]Seyoung Kim and Eric P Xing. Tree-guided group lasso for multi-task regression with structured sparsity.
                       InProceedings of the 27th International Conference on Machine Learning, 2010.


                                                        9                   [16]Jiashi Feng and Trevor Darrell. Learning the structure of deep convolutional networks. InThe IEEE
                       International Conference on Computer Vision (ICCV), 2015.
                   [17]Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint
                       arXiv:1505.00387, 2015.
                   [18]Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and
                       Evan Shelhamer. cudnn: Efﬁcient primitives for deep learning.arXiv preprint arXiv:1410.0759, 2014.
                   [19]Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio
                       Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding.arXiv
                       preprint arXiv:1408.5093, 2014.
                   [20]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
                       document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.
                   [21]Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
                       internal covariate shift.arXiv preprint arXiv:1502.03167, 2015.


                                                       10