Network Pruning As one of the earliest works in network pruning, Yann Lecun's Optimal brain damage (OBD) paper has been cited in many of the papers. Some research focuses on module network designs. "These models, such as SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions convolution with lesser parameters and better performance." Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a compression and regularization method, as opposed to other techniques such as non-structured pruning (weight sparsifying and connection pruning), low rank approximation and vector quantization (references to these approaches can be found in the related work sections of the following papers).  Difference between structred and non-structured pruning: "Non-structured pruning aims to remove single parameters that have little influence on the accuracy of networks". For example, L1-norm regularization on weights is noted as non-structured pruning- since it's basically a weight sparsifying method, i.e removes single parameter. The term 'structure' refers to a structured unit in the network. So instead of pruning individual weights or connections, structured pruning targets neurons, filters, channels, layers etc. But the general implementation idea is the same as penalizing individual weights: introducing a regularization term (mostly in the form of L1-norm) to the loss function to penalize (sparsify) structures. I focused on structured pruning and read through the following papers: 1. Structured Pruning of Convolutional Neural Networks via L1 Regularization (August 2019) "(...) network pruning is useful to remove redundant parameters, filters, channels or neurons, and address the over-fitting issue." Provides a good review of previous work on non-structured and structured pruning. "This study presents a scheme to prune filters or neurons of fully-connected layers based on L1 regularization to zero out the weights of some filters or neurons." Didn't quite understand the method and implementation. There are two key elements: mask and threshold. "(...) the problem of zeroing out the values of some filters can be transformed to zero some mask." || "Though the proposed method introduces mask, the network topology will be preserved because the mask can be absorbed into weight." || "Here the mask value cannot be completely zeroed in practical application, because the objective function (7) is non-convex and the global optimal solution may not be obtained. A strategy is adopted in the proposed method to solve this problem. If the order of magnitude of the mask value is small enough, it can be considered almost as zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...) The average value of the product of the mask and the weight is used to determine whether the mask is exactly zero or not." From what I understand they use L1 norm in the loss function to penalize useless filters through peenalizing masks. And a threshold value is introduced to determine when the mask is small enough to be considered zero.  They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet- 32) 2. Learning Efficient Convolutional Networks through Network Slimming (August 2017) + Git repo "Our approach imposes L1 regular- ization on the scaling factors in batch normalization (BN) layers, thus it is easy to implement without introducing any change to existing CNN architectures. Pushing the values of BN scaling factors towards zero with L1 regularization enables us to identify insignificant channels (or neurons), as each scaling factor corresponds to a specific con- volutional channel (or a neuron in a fully-connected layer)." They provide a good insight on advantages and disadvantages of other computation reduction methods such as low rank approximation, vector quantization etc.  I belive here they use the word 'channel' to refer to filters (?). "Our idea is introducing a scaling factor γ for each channel, which is multiplied to the output of that channel. Then we jointly train the network weights and these scaling factors, with sparsity regularization imposed on the latter. Finally we prune those channels with small factors, and fine-tune the pruned network. " --> so instead of 'mask' they use the 'scaling factor' and impose regularization on that, but the idea is very similar. "The way BN normalizes the activations motivates us to design a simple and efficient method to incorporates the channel-wise scaling factors. Particularly, BN layer normalizes the internal activa- tions using mini-batch statistics." || " (...) we can directly leverage the γ parameters in BN layers as the scaling factors we need for network slim- ming. It has the great advantage of introducing no overhead to the network." They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40), ImageNet (model: VGG-A) and MNIST (model: Lenet) 3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo " (...) we propose Structured Sparsity Learning (SSL) method to directly learn a compressed structure of deep CNNs by group Lasso regularization during the training. SSL is a generic regularization to adaptively adjust mutiple structures in DNN, including structures of filters, channels, and filter shapes within each layer, and structure of depth beyond the layers." || " (...) offering not only well- regularized big models with improved accuracy but greatly accelerated computation."  "Here W represents the collection of all weights in the DNN; ED(W) is the loss on data; R(·) is non-structured regularization applying on every weight, e.g., L2- norm; and Rg(·) is the structured sparsity regularization on each layer. Because Group Lasso can effectively zero out all weights in some groups [14][15], we adopt it in our SSL. The regularization of group Lasso on a set of weights w can be represented as    , where w(g) is a group of partial weights in w and G is the total number of groups. " || "In SSL, the learned “structure” is decided by the way of splitting groups of w(g). We investigate and formulate the filer-wise, channel-wise, shape-wise, and depth-wise structured sparsity (...)" They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet- 20) and ImageNet (model:AlexNet) The authors also provide a visualization of filters after pruning, showing that only important detectors of patterns remain after pruning. In conclusions: "Moreover, a variant of SSL can be performed as structure regularization to improve classification accuracy of state-of-the-art DNNs." 4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015) "After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer." || "We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity. In effect, this training process learns the network connectivity in addition to the weights (...)" Although the description above implies the pruning was done only for FC layers, they also do pruning on convolutional layers - although they don't provide much detail on this in the methods. But there's this statement when they explain retraining: "(...) we fix the parameters for CONV layers and only retrain the FC layers after pruning the FC layers, and vice versa.". The results section also shows that convolutional layer connections were also pruned on the tested models. They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and ImageNet (models: AlexNet, VGG-16) The authors provide a visualization of the sparsity patterns of neurons after pruning (for an FC layer) which shows that pruning can detect visual attention regions. The method used in this paper targets individual parameters (weights) to prune. So, technically this should be considered as a non-structured pruning method. However, the reason I think this is referenced as a structured pruning method is that if all connections of a neuron is pruned (i.e all input and output weights were below threshold), the neuron itself will be removed from the network:  "After pruning connections, neurons with zero input connections or zero output connections may be safely pruned." SIDENOTE: They touch on the use of global average pooling instead of fully connected layers in CNNs: "There have been other attempts to reduce the number of parameters of neural networks by replacing the fully connected layer with global average pooling." 5. Many more can be picked from the references of these papers. There's a paper on Bayesion compression for Deep Learning from 2017. Their hypothesis is: "By employing sparsity inducing priors for hidden units (and not individual weights) we can prune neurons including all their ingoing and outgoing weights." However, the method is mathematically heavy and the related work references are quite old (1990s, 2000s).