Processed various texts for the NN
This commit is contained in:
parent
93b3da7a7d
commit
e78ae20e92
5546
Corpus/CORPUS.txt
5546
Corpus/CORPUS.txt
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
@ -1,399 +0,0 @@
|
||||||
Learning Efficient Convolutional Networks through Network Slimming
|
|
||||||
|
|
||||||
|
|
||||||
Zhuang Liu 1∗ Jianguo Li 2 Zhiqiang Shen 3 Gao Huang 4 Shoumeng Yan 2 Changshui Zhang 1
|
|
||||||
1 CSAI, TNList, Tsinghua University 2 Intel Labs China 3 Fudan University 4 Cornell University
|
|
||||||
{liuzhuangthu, zhiqiangshen0214}@gmail.com,{jianguo.li, shoumeng.yan}@intel.com,
|
|
||||||
gh349@cornell.edu, zcs@mail.tsinghua.edu.cn
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Abstract However, larger CNNs, although with stronger represen-
|
|
||||||
tation power, are more resource-hungry. For instance, a
|
|
||||||
The deployment of deep convolutional neural networks 152-layer ResNet [14] has more than 60 million parame-
|
|
||||||
(CNNs) in many real world applications is largely hindered ters and requires more than 20 Giga float-point-operations
|
|
||||||
by their high computational cost. In this paper, we propose (FLOPs) when inferencing an image with resolution 224×
|
|
||||||
a novel learning scheme for CNNs to simultaneously 1) re- 224. This is unlikely to be affordable on resource con-
|
|
||||||
duce the model size; 2) decrease the run-time memory foot- strained platforms such as mobile devices, wearables or In-
|
|
||||||
print; and 3) lower the number of computing operations, ternet of Things (IoT) devices.
|
|
||||||
without compromising accuracy. This is achieved by en- The deployment of CNNs in real world applications areforcing channel-level sparsity in the network in a simple but mostly constrained by1) Model size: CNNs’ strong repre-effective way. Different from many existing approaches, the sentation power comes from their millions of trainable pa-proposed method directly applies to modern CNN architec- rameters. Those parameters, along with network structuretures, introduces minimum overhead to the training process, information, need to be stored on disk and loaded into mem-and requires no special software/hardware accelerators for ory during inference time. As an example, storing a typi-the resulting models. We call our approachnetwork slim- cal CNN trained on ImageNet consumes more than 300MBming, which takes wide and large networks as input mod- space, which is a big resource burden to embedded devices.els, but during training insignificant channels are automat- 2) Run-time memory: During inference time, the interme-ically identified and pruned afterwards, yielding thin and diate activations/responses of CNNs could even take morecompact models with comparable accuracy. We empirically memory space than storing the model parameters, even withdemonstrate the effectiveness of our approach with several batch size 1. This is not a problem for high-end GPUs, butstate-of-the-art CNN models, including VGGNet, ResNet unaffordable for many applications with low computationaland DenseNet, on various image classification datasets. For power.3) Number of computing operations:The convolu-VGGNet, a multi-pass version of network slimming gives a tion operations are computationally intensive on high reso-20×reduction in model size and a 5×reduction in comput- lution images. A large CNN may take several minutes toing operations. process one single image on a mobile device, making it un-
|
|
||||||
realistic to be adopted for real applications.
|
|
||||||
1. Introduction Many works have been proposed to compress large
|
|
||||||
CNNs or directly learn more efficient CNN models for fast
|
|
||||||
In recent years, convolutional neural networks (CNNs) inference. These include low-rank approximation [7], net-
|
|
||||||
have become the dominant approach for a variety of com- work quantization [3, 12] and binarization [28, 6], weight
|
|
||||||
puter vision tasks, e.g., image classification [22], object pruning [12], dynamic inference [16], etc. However, most
|
|
||||||
detection [8], semantic segmentation [26]. Large-scale of these methods can only address one or two challenges
|
|
||||||
datasets, high-end modern GPUs and new network architec- mentioned above. Moreover, some of the techniques require
|
|
||||||
tures allow the development of unprecedented large CNN specially designed software/hardware accelerators for exe-
|
|
||||||
models. For instance, from AlexNet [22], VGGNet [31] and cution speedup [28, 6, 12].
|
|
||||||
GoogleNet [34] to ResNets [14], the ImageNet Classifica- Another direction to reduce the resource consumption of
|
|
||||||
tion Challenge winner models have evolved from 8 layers large CNNs is to sparsify the network. Sparsity can be im-
|
|
||||||
to more than 100 layers. posed on different level of structures [2, 37, 35, 29, 25],
|
|
||||||
∗ This work was done when Zhuang Liu and Zhiqiang Shen were interns which yields considerable model-size compression and in-
|
|
||||||
at Intel Labs China. Jianguo Li is the corresponding author. ference speedup. However, these approaches generally re-
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
2736 channel scaling channel scaling i-thconv-layer factors (i+1)=j-th i-thconv-layer factors (i+1)=j-th
|
|
||||||
conv-layer conv-layer Ci1 1.170 C 1.170
|
|
||||||
C C i1
|
|
||||||
i2 0.001 j1 Cj1
|
|
||||||
Ci3 0.290 pruning Ci3 0.290
|
|
||||||
C 0.003 Ci4 j2 Cj2
|
|
||||||
… … …
|
|
||||||
… …
|
|
||||||
…
|
|
||||||
|
|
||||||
C Cin 0.820 in 0.820
|
|
||||||
initial network compact network
|
|
||||||
Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity
|
|
||||||
regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small
|
|
||||||
scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then
|
|
||||||
fine-tuned to achieve comparable (or even higher) accuracy as normally trained full network.
|
|
||||||
|
|
||||||
quire special software/hardware accelerators to harvest the Low-rank Decompositionapproximates weight matrix in
|
|
||||||
gain in memory or time savings, though it is easier than neural networks with low-rank matrix using techniques like
|
|
||||||
non-structured sparse weight matrix as in [12]. Singular Value Decomposition (SVD) [7]. This method
|
|
||||||
In this paper, we proposenetwork slimming, a simple works especially well on fully-connected layers, yield-
|
|
||||||
yet effective network training scheme, which addresses all ing∼3x model-size compression however without notable
|
|
||||||
the aforementioned challenges when deploying large CNNs speed acceleration, since computing operations in CNN
|
|
||||||
under limited resources. Our approach imposes L1 regular- mainly come from convolutional layers.
|
|
||||||
ization on the scaling factors in batch normalization (BN) Weight Quantization. HashNet [3] proposes to quantizelayers, thus it is easy to implement without introducing any the network weights. Before training, network weights arechange to existing CNN architectures. Pushing the val- hashed to different groups and within each group weightues of BN scaling factors towards zero with L1 regulariza- the value is shared. In this way only the shared weights andtion enables us to identify insignificant channels (or neu- hash indices need to be stored, thus a large amount of stor-rons), as each scaling factor corresponds to a specific con- age space could be saved. [12] uses a improved quantizationvolutional channel (or a neuron in a fully-connected layer). technique in a deep compression pipeline and achieves 35xThis facilitates the channel-level pruning at the followed to 49x compression rates on AlexNet and VGGNet. How-step. The additional regularization term rarely hurt the per- ever, these techniques can neither save run-time memoryformance. In fact, in some cases it leads to higher gen- nor inference time, since during inference shared weightseralization accuracy. Pruning unimportant channels may need to be restored to their original positions.sometimes temporarily degrade the performance, but this [28, 6] quantize real-valued weights into binary/ternaryeffect can be compensated by the followed fine-tuning of weights (weight values restricted to{−1,1}or{−1,0,1}).the pruned network. After pruning, the resulting narrower This yields a large amount of model-size saving, and signifi-network is much more compact in terms of model size, run- cant speedup could also be obtained given bitwise operationtime memory, and computing operations compared to the libraries. However, this aggressive low-bit approximationinitial wide network. The above process can be repeated method usually comes with a moderate accuracy loss. for several times, yielding a multi-pass network slimming
|
|
||||||
scheme which leads to even more compact network. Weight Pruning / Sparsifying.[12] proposes to prune the
|
|
||||||
Experiments on several benchmark datasets and different unimportant connections with small weights in trained neu-
|
|
||||||
network architectures show that we can obtain CNN models ral networks. The resulting network’s weights are mostly
|
|
||||||
with up to 20x mode-size compression and 5x reduction in zeros thus the storage space can be reduced by storing the
|
|
||||||
computing operations of the original ones, while achieving model in a sparse format. However, these methods can only
|
|
||||||
the same or even higher accuracy. Moreover, our method achieve speedup with dedicated sparse matrix operation li-
|
|
||||||
achieves model compression and inference speedup with braries and/or hardware. The run-time memory saving is
|
|
||||||
conventional hardware and deep learning software pack- also very limited since most memory space is consumed by
|
|
||||||
ages, since the resulting narrower model is free of any the activation maps (still dense) instead of the weights.
|
|
||||||
sparse storing format or computing operations. In [12], there is no guidance for sparsity during training.
|
|
||||||
[32] overcomes this limitation by explicitly imposing sparse
|
|
||||||
2. Related Work constraint over each weight with additional gate variables,
|
|
||||||
and achieve high compression rates by pruning connections
|
|
||||||
In this section, we discuss related work from five aspects. with zero gate values. This method achieves better com-
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
2737 pression rate than [12], but suffers from the same drawback. Advantages of Channel-level Sparsity. As discussed in
|
|
||||||
prior works [35, 23, 11], sparsity can be realized at differ-Structured Pruning / Sparsifying. Recently, [23] pro- ent levels, e.g., weight-level, kernel-level, channel-level orposes to prune channels with small incoming weights in layer-level. Fine-grained level (e.g., weight-level) sparsitytrained CNNs, and then fine-tune the network to regain gives the highest flexibility and generality leads to higheraccuracy. [2] introduces sparsity by random deactivat- compression rate, but it usually requires special software oring input-output channel-wise connections in convolutional hardware accelerators to do fast inference on the sparsifiedlayers before training, which also yields smaller networks model [11]. On the contrary, the coarsest layer-level spar-with moderate accuracy loss. Compared with these works, sity does not require special packages to harvest the infer-we explicitly impose channel-wise sparsity in the optimiza- ence speedup, while it is less flexible as some whole layerstion objective during training, leading to smoother channel need to be pruned. In fact, removing layers is only effec-pruning process and little accuracy loss. tive when the depth is sufficiently large, e.g., more than 50[37] imposes neuron-level sparsity during training thus layers [35, 18]. In comparison, channel-level sparsity pro-some neurons could be pruned to obtain compact networks. vides a nice tradeoff between flexibility and ease of imple-[35] proposes a Structured Sparsity Learning (SSL) method mentation. It can be applied to any typical CNNs or fully-to sparsify different level of structures (e.g. filters, channels connected networks (treat each neuron as a channel), andor layers) in CNNs. Both methods utilize group sparsity the resulting network is essentially a “thinned” version ofregualarization during training to obtain structured spar- the unpruned network, which can be efficiently inferenced sity. Instead of resorting to group sparsity on convolu- on conventional CNN platforms.tional weights, our approach imposes simple L1 sparsity on
|
|
||||||
channel-wise scaling factors, thus the optimization objec- Challenges. Achieving channel-level sparsity requires
|
|
||||||
tive is much simpler. pruning all the incoming and outgoing connections asso-
|
|
||||||
Since these methods prune or sparsify part of the net- ciated with a channel. This renders the method of directly
|
|
||||||
work structures (e.g., neurons, channels) instead of individ- pruning weights on a pre-trained model ineffective, as it is
|
|
||||||
ual weights, they usually require less specialized libraries unlikely that all the weights at the input or output end of
|
|
||||||
(e.g. for sparse computing operation) to achieve inference a channel happen to have near zero values. As reported in
|
|
||||||
speedup and run-time memory saving. Our network slim- [23], pruning channels on pre-trained ResNets can only lead
|
|
||||||
ming also falls into this category, with absolutely no special to a reduction of∼10% in the number of parameters without
|
|
||||||
libraries needed to obtain the benefits. suffering from accuracy loss. [35] addresses this problem
|
|
||||||
by enforcing sparsity regularization into the training objec-Neural Architecture Learning. While state-of-the-art tive. Specifically, they adoptgroup LASSOto push all theCNNs are typically designed by experts [22, 31, 14], there filter weights corresponds to the same channel towards zeroare also some explorations on automatically learning net- simultaneously during training. However, this approach re-work architectures. [20] introduces sub-modular/super- quires computing the gradients of the additional regulariza-modular optimization for network architecture search with tion term with respect to all the filter weights, which is non-a given resource budget. Some recent works [38, 1] propose trivial. We introduce a simple idea to address the aboveto learn neural architecture automatically with reinforce- challenges, and the details are presented below.ment learning. The searching space of these methods are
|
|
||||||
extremely large, thus one needs to train hundreds of mod- Scaling Factors and Sparsity-induced Penalty.Our idea
|
|
||||||
els to distinguish good from bad ones. Network slimming is introducing a scaling factorγfor each channel, which is
|
|
||||||
can also be treated as an approach for architecture learning, multiplied to the output of that channel. Then we jointly
|
|
||||||
despite the choices are limited to the width of each layer. train the network weights and these scaling factors, with
|
|
||||||
However, in contrast to the aforementioned methods, net- sparsity regularization imposed on the latter. Finally we
|
|
||||||
work slimming learns network architecture through only a prune those channels with small factors, and fine-tune the
|
|
||||||
single training process, which is in line with our goal of pruned network. Specifically, the training objective of our
|
|
||||||
efficiency. approach is given by
|
|
||||||
|