150 lines
9.9 KiB
Plaintext
150 lines
9.9 KiB
Plaintext
Network Pruning
|
||
|
||
|
||
As one of the earliest works in network pruning, Yann Lecun's Optimal brain
|
||
damage (OBD) paper has been cited in many of the papers.
|
||
Some research focuses on module network designs. "These models, such as
|
||
SqueezeNet , MobileNet and Shufflenet, are basically made up of low resolutions
|
||
convolution with lesser parameters and better performance."
|
||
Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a
|
||
compression and regularization method, as opposed to other techniques such as
|
||
non-structured pruning (weight sparsifying and connection pruning), low rank
|
||
approximation and vector quantization (references to these approaches can be
|
||
found in the related work sections of the following papers).
|
||
Difference between structred and non-structured pruning:
|
||
"Non-structured pruning aims to remove single parameters that have little
|
||
influence on the accuracy of networks". For example, L1-norm regularization on
|
||
weights is noted as non-structured pruning- since it's basically a weight
|
||
sparsifying method, i.e removes single parameter.
|
||
The term 'structure' refers to a structured unit in the network. So instead of
|
||
pruning individual weights or connections, structured pruning targets neurons,
|
||
filters, channels, layers etc. But the general implementation idea is the same as
|
||
penalizing individual weights: introducing a regularization term (mostly in the
|
||
form of L1-norm) to the loss function to penalize (sparsify) structures.
|
||
I focused on structured pruning and read through the following papers:
|
||
|
||
1. Structured Pruning of Convolutional Neural Networks via L1
|
||
Regularization (August 2019)
|
||
"(...) network pruning is useful to remove redundant parameters, filters,
|
||
channels or neurons, and address the over-fitting issue."
|
||
|
||
Provides a good review of previous work on non-structured and structured
|
||
pruning.
|
||
"This study presents a scheme to prune filters or neurons of fully-connected
|
||
layers based on L1 regularization to zero out the weights of some filters or
|
||
neurons."
|
||
Didn't quite understand the method and implementation. There are two key
|
||
elements: mask and threshold. "(...) the problem of zeroing out the values of
|
||
some filters can be transformed to zero some mask." || "Though the proposed
|
||
method introduces mask, the network topology will be preserved because the mask can be absorbed into weight." || "Here the mask value cannot be
|
||
completely zeroed in practical application, because the objective function (7) is
|
||
non-convex and the global optimal solution may not be obtained. A strategy is
|
||
adopted in the proposed method to solve this problem. If the order of
|
||
magnitude of the mask value is small enough, it can be considered almost as
|
||
zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...)
|
||
The average value of the product of the mask and the weight is used to
|
||
determine whether the mask is exactly zero or not."
|
||
From what I understand they use L1 norm in the loss function to penalize
|
||
useless filters through peenalizing masks. And a threshold value is introduced
|
||
to determine when the mask is small enough to be considered zero.
|
||
They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
|
||
32)
|
||
|
||
2. Learning Efficient Convolutional Networks through Network Slimming (August
|
||
2017) + Git repo
|
||
"Our approach imposes L1 regular- ization on the scaling factors in batch
|
||
normalization (BN) layers, thus it is easy to implement without introducing any
|
||
change to existing CNN architectures. Pushing the values of BN scaling factors
|
||
towards zero with L1 regularization enables us to identify insignificant channels
|
||
(or neurons), as each scaling factor corresponds to a specific con- volutional
|
||
channel (or a neuron in a fully-connected layer)."
|
||
They provide a good insight on advantages and disadvantages of other
|
||
computation reduction methods such as low rank approximation, vector
|
||
quantization etc.
|
||
I belive here they use the word 'channel' to refer to filters (?).
|
||
"Our idea is introducing a scaling factor γ for each channel, which is multiplied
|
||
to the output of that channel. Then we jointly train the network weights and
|
||
these scaling factors, with sparsity regularization imposed on the latter. Finally
|
||
|
||
we prune those channels with small factors, and fine-tune the pruned network.
|
||
" --> so instead of 'mask' they use the 'scaling factor' and impose regularization
|
||
on that, but the idea is very similar.
|
||
"The way BN normalizes the activations motivates us to design a simple and
|
||
efficient method to incorporates the channel-wise scaling factors. Particularly,
|
||
BN layer normalizes the internal activa- tions using mini-batch statistics." || "
|
||
(...) we can directly leverage the γ parameters in BN layers as the scaling factors
|
||
we need for network slim- ming. It has the great advantage of introducing no
|
||
overhead to the network." They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40),
|
||
ImageNet (model: VGG-A) and MNIST (model: Lenet)
|
||
|
||
3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
|
||
" (...) we propose Structured Sparsity Learning (SSL) method to directly learn a
|
||
compressed structure of deep CNNs by group Lasso regularization during the
|
||
training. SSL is a generic regularization to adaptively adjust mutiple structures
|
||
in DNN, including structures of filters, channels, and filter shapes within each
|
||
layer, and structure of depth beyond the layers." || " (...) offering not only well-
|
||
regularized big models with improved accuracy but greatly accelerated
|
||
computation."
|
||
|
||
|
||
|
||
"Here W represents the collection of all weights in the DNN; ED(W) is the loss
|
||
on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
|
||
norm; and Rg(·) is the structured sparsity regularization on each layer. Because
|
||
Group Lasso can effectively zero out all weights in some groups [14][15], we
|
||
adopt it in our SSL. The regularization of group Lasso on a set of weights w can
|
||
be represented as
|
||
|
||
|
||
, where w(g) is a group of partial weights in w and G is the total number of
|
||
groups. " || "In SSL, the learned “structure” is decided by the way of splitting
|
||
groups of w(g). We investigate and formulate the filer-wise, channel-wise,
|
||
shape-wise, and depth-wise structured sparsity (...)"
|
||
They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
|
||
20) and ImageNet (model:AlexNet)
|
||
The authors also provide a visualization of filters after pruning, showing that
|
||
only important detectors of patterns remain after pruning.
|
||
|
||
In conclusions: "Moreover, a variant of SSL can be performed as structure
|
||
regularization to improve classification accuracy of state-of-the-art DNNs."
|
||
|
||
4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
|
||
"After an initial training phase, we remove all connections whose weight is
|
||
lower than a threshold. This pruning converts a dense, fully-connected layer to
|
||
a sparse layer." || "We then retrain the sparse network so the remaining
|
||
connections can compensate for the connections that have been removed. The
|
||
phases of pruning and retraining may be repeated iteratively to further reduce network complexity. In effect, this training process learns the network
|
||
connectivity in addition to the weights (...)"
|
||
Although the description above implies the pruning was done only for FC
|
||
layers, they also do pruning on convolutional layers - although they don't
|
||
provide much detail on this in the methods. But there's this statement when
|
||
they explain retraining: "(...) we fix the parameters for CONV layers and only
|
||
retrain the FC layers after pruning the FC layers, and vice versa.". The results
|
||
section also shows that convolutional layer connections were also
|
||
pruned on the tested models.
|
||
They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and
|
||
ImageNet (models: AlexNet, VGG-16)
|
||
The authors provide a visualization of the sparsity patterns of neurons after
|
||
pruning (for an FC layer) which shows that pruning can detect visual attention
|
||
regions.
|
||
The method used in this paper targets individual parameters (weights) to
|
||
prune. So, technically this should be considered as a non-structured pruning
|
||
method. However, the reason I think this is referenced as a structured pruning
|
||
method is that if all connections of a neuron is pruned (i.e all input and output
|
||
weights were below threshold), the neuron itself will be removed from the
|
||
network: "After pruning connections, neurons with zero input connections or
|
||
zero output connections may be safely pruned."
|
||
SIDENOTE: They touch on the use of global average pooling instead of fully
|
||
connected layers in CNNs: "There have been other attempts to reduce the
|
||
number of parameters of neural networks by replacing the fully connected
|
||
layer with global average pooling."
|
||
|
||
5. Many more can be picked from the references of these papers.
|
||
|
||
|
||
|
||
There's a paper on Bayesion compression for Deep Learning from 2017. Their
|
||
hypothesis is: "By employing sparsity inducing priors for hidden units (and not
|
||
individual weights) we can prune neurons including all their ingoing and outgoing
|
||
weights." However, the method is mathematically heavy and the related work
|
||
references are quite old (1990s, 2000s). |