150 lines
9.9 KiB
Plaintext
150 lines
9.9 KiB
Plaintext
|
Network Pruning
|
|||
|
|
|||
|
|
|||
|
As one of the earliest works in network pruning, Yann Lecun's Optimal brain
|
|||
|
damage (OBD) paper has been cited in many of the papers.
|
|||
|
Some research focuses on module network designs. "These models, such as
|
|||
|
SqueezeNet , MobileNet and Shufflenet, are basically made up of low resolutions
|
|||
|
convolution with lesser parameters and better performance."
|
|||
|
Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a
|
|||
|
compression and regularization method, as opposed to other techniques such as
|
|||
|
non-structured pruning (weight sparsifying and connection pruning), low rank
|
|||
|
approximation and vector quantization (references to these approaches can be
|
|||
|
found in the related work sections of the following papers).
|
|||
|
Difference between structred and non-structured pruning:
|
|||
|
"Non-structured pruning aims to remove single parameters that have little
|
|||
|
influence on the accuracy of networks". For example, L1-norm regularization on
|
|||
|
weights is noted as non-structured pruning- since it's basically a weight
|
|||
|
sparsifying method, i.e removes single parameter.
|
|||
|
The term 'structure' refers to a structured unit in the network. So instead of
|
|||
|
pruning individual weights or connections, structured pruning targets neurons,
|
|||
|
filters, channels, layers etc. But the general implementation idea is the same as
|
|||
|
penalizing individual weights: introducing a regularization term (mostly in the
|
|||
|
form of L1-norm) to the loss function to penalize (sparsify) structures.
|
|||
|
I focused on structured pruning and read through the following papers:
|
|||
|
|
|||
|
1. Structured Pruning of Convolutional Neural Networks via L1
|
|||
|
Regularization (August 2019)
|
|||
|
"(...) network pruning is useful to remove redundant parameters, filters,
|
|||
|
channels or neurons, and address the over-fitting issue."
|
|||
|
|
|||
|
Provides a good review of previous work on non-structured and structured
|
|||
|
pruning.
|
|||
|
"This study presents a scheme to prune filters or neurons of fully-connected
|
|||
|
layers based on L1 regularization to zero out the weights of some filters or
|
|||
|
neurons."
|
|||
|
Didn't quite understand the method and implementation. There are two key
|
|||
|
elements: mask and threshold. "(...) the problem of zeroing out the values of
|
|||
|
some filters can be transformed to zero some mask." || "Though the proposed
|
|||
|
method introduces mask, the network topology will be preserved because the mask can be absorbed into weight." || "Here the mask value cannot be
|
|||
|
completely zeroed in practical application, because the objective function (7) is
|
|||
|
non-convex and the global optimal solution may not be obtained. A strategy is
|
|||
|
adopted in the proposed method to solve this problem. If the order of
|
|||
|
magnitude of the mask value is small enough, it can be considered almost as
|
|||
|
zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...)
|
|||
|
The average value of the product of the mask and the weight is used to
|
|||
|
determine whether the mask is exactly zero or not."
|
|||
|
From what I understand they use L1 norm in the loss function to penalize
|
|||
|
useless filters through peenalizing masks. And a threshold value is introduced
|
|||
|
to determine when the mask is small enough to be considered zero.
|
|||
|
They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
|
|||
|
32)
|
|||
|
|
|||
|
2. Learning Efficient Convolutional Networks through Network Slimming (August
|
|||
|
2017) + Git repo
|
|||
|
"Our approach imposes L1 regular- ization on the scaling factors in batch
|
|||
|
normalization (BN) layers, thus it is easy to implement without introducing any
|
|||
|
change to existing CNN architectures. Pushing the values of BN scaling factors
|
|||
|
towards zero with L1 regularization enables us to identify insignificant channels
|
|||
|
(or neurons), as each scaling factor corresponds to a specific con- volutional
|
|||
|
channel (or a neuron in a fully-connected layer)."
|
|||
|
They provide a good insight on advantages and disadvantages of other
|
|||
|
computation reduction methods such as low rank approximation, vector
|
|||
|
quantization etc.
|
|||
|
I belive here they use the word 'channel' to refer to filters (?).
|
|||
|
"Our idea is introducing a scaling factor γ for each channel, which is multiplied
|
|||
|
to the output of that channel. Then we jointly train the network weights and
|
|||
|
these scaling factors, with sparsity regularization imposed on the latter. Finally
|
|||
|
|
|||
|
we prune those channels with small factors, and fine-tune the pruned network.
|
|||
|
" --> so instead of 'mask' they use the 'scaling factor' and impose regularization
|
|||
|
on that, but the idea is very similar.
|
|||
|
"The way BN normalizes the activations motivates us to design a simple and
|
|||
|
efficient method to incorporates the channel-wise scaling factors. Particularly,
|
|||
|
BN layer normalizes the internal activa- tions using mini-batch statistics." || "
|
|||
|
(...) we can directly leverage the γ parameters in BN layers as the scaling factors
|
|||
|
we need for network slim- ming. It has the great advantage of introducing no
|
|||
|
overhead to the network." They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40),
|
|||
|
ImageNet (model: VGG-A) and MNIST (model: Lenet)
|
|||
|
|
|||
|
3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
|
|||
|
" (...) we propose Structured Sparsity Learning (SSL) method to directly learn a
|
|||
|
compressed structure of deep CNNs by group Lasso regularization during the
|
|||
|
training. SSL is a generic regularization to adaptively adjust mutiple structures
|
|||
|
in DNN, including structures of filters, channels, and filter shapes within each
|
|||
|
layer, and structure of depth beyond the layers." || " (...) offering not only well-
|
|||
|
regularized big models with improved accuracy but greatly accelerated
|
|||
|
computation."
|
|||
|
|
|||
|
|
|||
|
|
|||
|
"Here W represents the collection of all weights in the DNN; ED(W) is the loss
|
|||
|
on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
|
|||
|
norm; and Rg(·) is the structured sparsity regularization on each layer. Because
|
|||
|
Group Lasso can effectively zero out all weights in some groups [14][15], we
|
|||
|
adopt it in our SSL. The regularization of group Lasso on a set of weights w can
|
|||
|
be represented as
|
|||
|
|
|||
|
|
|||
|
, where w(g) is a group of partial weights in w and G is the total number of
|
|||
|
groups. " || "In SSL, the learned “structure” is decided by the way of splitting
|
|||
|
groups of w(g). We investigate and formulate the filer-wise, channel-wise,
|
|||
|
shape-wise, and depth-wise structured sparsity (...)"
|
|||
|
They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
|
|||
|
20) and ImageNet (model:AlexNet)
|
|||
|
The authors also provide a visualization of filters after pruning, showing that
|
|||
|
only important detectors of patterns remain after pruning.
|
|||
|
|
|||
|
In conclusions: "Moreover, a variant of SSL can be performed as structure
|
|||
|
regularization to improve classification accuracy of state-of-the-art DNNs."
|
|||
|
|
|||
|
4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
|
|||
|
"After an initial training phase, we remove all connections whose weight is
|
|||
|
lower than a threshold. This pruning converts a dense, fully-connected layer to
|
|||
|
a sparse layer." || "We then retrain the sparse network so the remaining
|
|||
|
connections can compensate for the connections that have been removed. The
|
|||
|
phases of pruning and retraining may be repeated iteratively to further reduce network complexity. In effect, this training process learns the network
|
|||
|
connectivity in addition to the weights (...)"
|
|||
|
Although the description above implies the pruning was done only for FC
|
|||
|
layers, they also do pruning on convolutional layers - although they don't
|
|||
|
provide much detail on this in the methods. But there's this statement when
|
|||
|
they explain retraining: "(...) we fix the parameters for CONV layers and only
|
|||
|
retrain the FC layers after pruning the FC layers, and vice versa.". The results
|
|||
|
section also shows that convolutional layer connections were also
|
|||
|
pruned on the tested models.
|
|||
|
They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and
|
|||
|
ImageNet (models: AlexNet, VGG-16)
|
|||
|
The authors provide a visualization of the sparsity patterns of neurons after
|
|||
|
pruning (for an FC layer) which shows that pruning can detect visual attention
|
|||
|
regions.
|
|||
|
The method used in this paper targets individual parameters (weights) to
|
|||
|
prune. So, technically this should be considered as a non-structured pruning
|
|||
|
method. However, the reason I think this is referenced as a structured pruning
|
|||
|
method is that if all connections of a neuron is pruned (i.e all input and output
|
|||
|
weights were below threshold), the neuron itself will be removed from the
|
|||
|
network: "After pruning connections, neurons with zero input connections or
|
|||
|
zero output connections may be safely pruned."
|
|||
|
SIDENOTE: They touch on the use of global average pooling instead of fully
|
|||
|
connected layers in CNNs: "There have been other attempts to reduce the
|
|||
|
number of parameters of neural networks by replacing the fully connected
|
|||
|
layer with global average pooling."
|
|||
|
|
|||
|
5. Many more can be picked from the references of these papers.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
There's a paper on Bayesion compression for Deep Learning from 2017. Their
|
|||
|
hypothesis is: "By employing sparsity inducing priors for hidden units (and not
|
|||
|
individual weights) we can prune neurons including all their ingoing and outgoing
|
|||
|
weights." However, the method is mathematically heavy and the related work
|
|||
|
references are quite old (1990s, 2000s).
|