Network Pruning


     As one of the earliest works in network pruning, Yann Lecun's Optimal brain 
     damage (OBD) paper has been cited in many of the papers.
     Some research focuses on module network designs. "These models, such as 
     SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions 
     convolution with lesser parameters and better performance."
     Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a 
     compression and regularization method, as opposed to other techniques such as 
     non-structured pruning (weight sparsifying and connection pruning), low rank 
     approximation and vector quantization (references to these approaches can be 
     found in the related work sections of the following papers). 
     Difference between structred and non-structured pruning:
       "Non-structured pruning aims to remove single parameters that have little 
       influence on the accuracy of networks". For example, L1-norm regularization on 
       weights is noted as non-structured pruning- since it's basically a weight 
       sparsifying method, i.e removes single parameter. 
       The term 'structure' refers to a structured unit in the network. So instead of 
       pruning individual weights or connections, structured pruning targets neurons, 
       filters, channels, layers etc. But the general implementation idea is the same as 
       penalizing individual weights: introducing a regularization term (mostly in the 
       form of L1-norm) to the loss function to penalize (sparsify) structures.
     I focused on structured pruning and read through the following papers:

   1. Structured Pruning of Convolutional Neural Networks via L1 
     Regularization (August 2019)
       "(...) network pruning is useful to remove redundant parameters, filters, 
       channels or neurons, and address the over-fitting issue."

       Provides a good review of previous work on non-structured and structured 
       pruning.
       "This study presents a scheme to prune filters or neurons of fully-connected 
       layers based on L1 regularization to zero out the weights of some filters or 
       neurons."
       Didn't quite understand the method and implementation. There are two key 
       elements: mask and threshold. "(...) the problem of zeroing out the values of 
       some filters can be transformed to zero some mask." || "Though the proposed 
       method introduces mask, the network topology will be preserved because the        mask can be absorbed into weight." || "Here the mask value cannot be 
       completely zeroed in practical application, because the objective function (7) is 
       non-convex and the global optimal solution may not be obtained. A strategy is 
       adopted in the proposed method to solve this problem. If the order of 
       magnitude of the mask value is small enough, it can be considered almost as 
       zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...) 
       The average value of the product of the mask and the weight is used to 
       determine whether the mask is exactly zero or not."
       From what I understand they use L1 norm in the loss function to penalize 
       useless filters through peenalizing masks. And a threshold value is introduced 
       to determine when the mask is small enough to be considered zero. 
       They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
       32)

   2. Learning Efficient Convolutional Networks through Network Slimming (August 
     2017) + Git repo
       "Our approach imposes L1 regular- ization on the scaling factors in batch 
       normalization (BN) layers, thus it is easy to implement without introducing any 
       change to existing CNN architectures. Pushing the values of BN scaling factors 
       towards zero with L1 regularization enables us to identify insignificant channels 
       (or neurons), as each scaling factor corresponds to a specific con- volutional 
       channel (or a neuron in a fully-connected layer)."
       They provide a good insight on advantages and disadvantages of other 
       computation reduction methods such as low rank approximation, vector 
       quantization etc. 
       I belive here they use the word 'channel' to refer to filters (?).
       "Our idea is introducing a scaling factor γ for each channel, which is multiplied 
       to the output of that channel. Then we jointly train the network weights and 
       these scaling factors, with sparsity regularization imposed on the latter. Finally 

       we prune those channels with small factors, and fine-tune the pruned network. 
       " --> so instead of 'mask' they use the 'scaling factor' and impose regularization 
       on that, but the idea is very similar.
       "The way BN normalizes the activations motivates us to design a simple and 
       efficient method to incorporates the channel-wise scaling factors. Particularly, 
       BN layer normalizes the internal activa- tions using mini-batch statistics." || " 
       (...) we can directly leverage the γ parameters in BN layers as the scaling factors 
       we need for network slim- ming. It has the great advantage of introducing no 
       overhead to the network."       They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40), 
       ImageNet (model: VGG-A) and MNIST (model: Lenet)

   3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
       " (...) we propose Structured Sparsity Learning (SSL) method to directly learn a 
       compressed structure of deep CNNs by group Lasso regularization during the 
       training. SSL is a generic regularization to adaptively adjust mutiple structures 
       in DNN, including structures of filters, channels, and filter shapes within each 
       layer, and structure of depth beyond the layers." || " (...) offering not only well-
       regularized big models with improved accuracy but greatly accelerated 
       computation."


        "Here W represents the collection of all weights in the DNN; ED(W) is the loss 
       on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
       norm; and Rg(·) is the structured sparsity regularization on each layer. Because 
       Group Lasso can effectively zero out all weights in some groups [14][15], we 
       adopt it in our SSL. The regularization of group Lasso on a set of weights w can 
       be represented as  


        , where w(g) is a group of partial weights in w and G is the total number of 
       groups. " || "In SSL, the learned “structure” is decided by the way of splitting 
       groups of w(g). We investigate and formulate the filer-wise, channel-wise, 
       shape-wise, and depth-wise structured sparsity (...)"
       They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
       20) and ImageNet (model:AlexNet)
       The authors also provide a visualization of filters after pruning, showing that 
       only important detectors of patterns remain after pruning.

       In conclusions: "Moreover, a variant of SSL can be performed as structure 
       regularization to improve classification accuracy of state-of-the-art DNNs."

   4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
       "After an initial training phase, we remove all connections whose weight is 
       lower than a threshold. This pruning converts a dense, fully-connected layer to 
       a sparse layer." || "We then retrain the sparse network so the remaining 
       connections can compensate for the connections that have been removed. The 
       phases of pruning and retraining may be repeated iteratively to further reduce        network complexity. In effect, this training process learns the network 
       connectivity in addition to the weights (...)"
       Although the description above implies the pruning was done only for FC 
       layers, they also do pruning on convolutional layers - although they don't 
       provide much detail on this in the methods. But there's this statement when 
       they explain retraining: "(...) we fix the parameters for CONV layers and only 
       retrain the FC layers after pruning the FC layers, and vice versa.". The results 
       section also shows that convolutional layer connections were also 
       pruned on the tested models.
       They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and 
       ImageNet (models: AlexNet, VGG-16)
       The authors provide a visualization of the sparsity patterns of neurons after 
       pruning (for an FC layer) which shows that pruning can detect visual attention 
       regions.
       The method used in this paper targets individual parameters (weights) to 
       prune. So, technically this should be considered as a non-structured pruning 
       method. However, the reason I think this is referenced as a structured pruning 
       method is that if all connections of a neuron is pruned (i.e all input and output 
       weights were below threshold), the neuron itself will be removed from the 
       network:  "After pruning connections, neurons with zero input connections or 
       zero output connections may be safely pruned."
       SIDENOTE: They touch on the use of global average pooling instead of fully 
       connected layers in CNNs: "There have been other attempts to reduce the 
       number of parameters of neural networks by replacing the fully connected 
       layer with global average pooling."

   5. Many more can be picked from the references of these papers. 


     There's a paper on Bayesion compression for Deep Learning from 2017. Their 
     hypothesis is: "By employing sparsity inducing priors for hidden units (and not 
     individual weights) we can prune neurons including all their ingoing and outgoing 
     weights." However, the method is mathematically heavy and the related work 
     references are quite old (1990s, 2000s).