testing_generation/Corpus/Network Pruning notes.txt

150 lines
9.9 KiB
Plaintext
Raw Normal View History

2020-08-06 20:53:44 +00:00
Network Pruning
As one of the earliest works in network pruning, Yann Lecun's Optimal brain
damage (OBD) paper has been cited in many of the papers.
Some research focuses on module network designs. "These models, such as
SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions
convolution with lesser parameters and better performance."
Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a
compression and regularization method, as opposed to other techniques such as
non-structured pruning (weight sparsifying and connection pruning), low rank
approximation and vector quantization (references to these approaches can be
found in the related work sections of the following papers). 
Difference between structred and non-structured pruning:
"Non-structured pruning aims to remove single parameters that have little
influence on the accuracy of networks". For example, L1-norm regularization on
weights is noted as non-structured pruning- since it's basically a weight
sparsifying method, i.e removes single parameter.
The term 'structure' refers to a structured unit in the network. So instead of
pruning individual weights or connections, structured pruning targets neurons,
filters, channels, layers etc. But the general implementation idea is the same as
penalizing individual weights: introducing a regularization term (mostly in the
form of L1-norm) to the loss function to penalize (sparsify) structures.
I focused on structured pruning and read through the following papers:
1. Structured Pruning of Convolutional Neural Networks via L1
Regularization (August 2019)
"(...) network pruning is useful to remove redundant parameters, filters,
channels or neurons, and address the over-fitting issue."
Provides a good review of previous work on non-structured and structured
pruning.
"This study presents a scheme to prune filters or neurons of fully-connected
layers based on L1 regularization to zero out the weights of some filters or
neurons."
Didn't quite understand the method and implementation. There are two key
elements: mask and threshold. "(...) the problem of zeroing out the values of
some filters can be transformed to zero some mask." || "Though the proposed
method introduces mask, the network topology will be preserved because the mask can be absorbed into weight." || "Here the mask value cannot be
completely zeroed in practical application, because the objective function (7) is
non-convex and the global optimal solution may not be obtained. A strategy is
adopted in the proposed method to solve this problem. If the order of
magnitude of the mask value is small enough, it can be considered almost as
zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...)
The average value of the product of the mask and the weight is used to
determine whether the mask is exactly zero or not."
From what I understand they use L1 norm in the loss function to penalize
useless filters through peenalizing masks. And a threshold value is introduced
to determine when the mask is small enough to be considered zero. 
They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
32)
2. Learning Efficient Convolutional Networks through Network Slimming (August
2017) + Git repo
"Our approach imposes L1 regular- ization on the scaling factors in batch
normalization (BN) layers, thus it is easy to implement without introducing any
change to existing CNN architectures. Pushing the values of BN scaling factors
towards zero with L1 regularization enables us to identify insignificant channels
(or neurons), as each scaling factor corresponds to a specific con- volutional
channel (or a neuron in a fully-connected layer)."
They provide a good insight on advantages and disadvantages of other
computation reduction methods such as low rank approximation, vector
quantization etc. 
I belive here they use the word 'channel' to refer to filters (?).
"Our idea is introducing a scaling factor γ for each channel, which is multiplied
to the output of that channel. Then we jointly train the network weights and
these scaling factors, with sparsity regularization imposed on the latter. Finally
we prune those channels with small factors, and fine-tune the pruned network.
" --> so instead of 'mask' they use the 'scaling factor' and impose regularization
on that, but the idea is very similar.
"The way BN normalizes the activations motivates us to design a simple and
efficient method to incorporates the channel-wise scaling factors. Particularly,
BN layer normalizes the internal activa- tions using mini-batch statistics." || "
(...) we can directly leverage the γ parameters in BN layers as the scaling factors
we need for network slim- ming. It has the great advantage of introducing no
overhead to the network." They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40),
ImageNet (model: VGG-A) and MNIST (model: Lenet)
3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
" (...) we propose Structured Sparsity Learning (SSL) method to directly learn a
compressed structure of deep CNNs by group Lasso regularization during the
training. SSL is a generic regularization to adaptively adjust mutiple structures
in DNN, including structures of filters, channels, and filter shapes within each
layer, and structure of depth beyond the layers." || " (...) offering not only well-
regularized big models with improved accuracy but greatly accelerated
computation."
 "Here W represents the collection of all weights in the DNN; ED(W) is the loss
on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
norm; and Rg(·) is the structured sparsity regularization on each layer. Because
Group Lasso can effectively zero out all weights in some groups [14][15], we
adopt it in our SSL. The regularization of group Lasso on a set of weights w can
be represented as  
 , where w(g) is a group of partial weights in w and G is the total number of
groups. " || "In SSL, the learned “structure” is decided by the way of splitting
groups of w(g). We investigate and formulate the filer-wise, channel-wise,
shape-wise, and depth-wise structured sparsity (...)"
They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
20) and ImageNet (model:AlexNet)
The authors also provide a visualization of filters after pruning, showing that
only important detectors of patterns remain after pruning.
In conclusions: "Moreover, a variant of SSL can be performed as structure
regularization to improve classification accuracy of state-of-the-art DNNs."
4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
"After an initial training phase, we remove all connections whose weight is
lower than a threshold. This pruning converts a dense, fully-connected layer to
a sparse layer." || "We then retrain the sparse network so the remaining
connections can compensate for the connections that have been removed. The
phases of pruning and retraining may be repeated iteratively to further reduce network complexity. In effect, this training process learns the network
connectivity in addition to the weights (...)"
Although the description above implies the pruning was done only for FC
layers, they also do pruning on convolutional layers - although they don't
provide much detail on this in the methods. But there's this statement when
they explain retraining: "(...) we fix the parameters for CONV layers and only
retrain the FC layers after pruning the FC layers, and vice versa.". The results
section also shows that convolutional layer connections were also
pruned on the tested models.
They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and
ImageNet (models: AlexNet, VGG-16)
The authors provide a visualization of the sparsity patterns of neurons after
pruning (for an FC layer) which shows that pruning can detect visual attention
regions.
The method used in this paper targets individual parameters (weights) to
prune. So, technically this should be considered as a non-structured pruning
method. However, the reason I think this is referenced as a structured pruning
method is that if all connections of a neuron is pruned (i.e all input and output
weights were below threshold), the neuron itself will be removed from the
network:  "After pruning connections, neurons with zero input connections or
zero output connections may be safely pruned."
SIDENOTE: They touch on the use of global average pooling instead of fully
connected layers in CNNs: "There have been other attempts to reduce the
number of parameters of neural networks by replacing the fully connected
layer with global average pooling."
5. Many more can be picked from the references of these papers.
There's a paper on Bayesion compression for Deep Learning from 2017. Their
hypothesis is: "By employing sparsity inducing priors for hidden units (and not
individual weights) we can prune neurons including all their ingoing and outgoing
weights." However, the method is mathematically heavy and the related work
references are quite old (1990s, 2000s).