Revised documents for corpus
This commit is contained in:
parent
514f272a6d
commit
8b5f469305
|
@ -1,555 +0,0 @@
|
|||
IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1
|
||||
|
||||
|
||||
|
||||
A Survey of Model Compression and Acceleration
|
||||
|
||||
for Deep Neural Networks
|
||||
|
||||
Yu Cheng, Duo Wang, Pan Zhou,Member, IEEE,and Tao Zhang,Senior Member, IEEE
|
||||
|
||||
|
||||
|
||||
|
||||
Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model
|
||||
recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment
|
||||
in devices with low memory resources or in applications with to billions [4].
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
arXiv:1710.09282v7 [cs.LG] 7 Feb 2019 strict latency requirements. Therefore, a natural thought is to As larger neural networks with more layers and nodes
|
||||
perform model compression and acceleration in deep networks are considered, reducing their storage and computational cost
|
||||
without significantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech-
|
||||
niques for compacting and accelerating CNNs model developed. tion, recent years witnessed significant progress in virtual
|
||||
These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre-
|
||||
parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle
|
||||
ferred/compact convolutional filters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced.
|
||||
For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efficient deep learning methods can have
|
||||
performance, related applications, advantages, and drawbacks significant impacts on distributed systems, embedded devices,
|
||||
etc. Then we will go through a few very recent additional and FPGA for Artificial Intelligence. For example, the ResNet-
|
||||
successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion floating number multiplications matrix, the main datasets used for evaluating the model per-
|
||||
formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant
|
||||
this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than
|
||||
on this topic. 75% of parameters and 50% computational time. For devices
|
||||
Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte
|
||||
Model Compression and Acceleration, resources, how to compact the models used on them is also
|
||||
important.
|
||||
Achieving these goal calls for joint solutions from manyI. I NTRODUCTION disciplines, including but not limited to machine learning, op-
|
||||
In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing,
|
||||
lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works
|
||||
achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which
|
||||
These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community
|
||||
billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years.
|
||||
very high computation capability plays a key role in their We classify these approaches into four categories: pa-
|
||||
success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans-
|
||||
achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional filters, and knowledge distil-
|
||||
using a network containing 60 million parameters with five lation. The parameter pruning and sharing based methods
|
||||
convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to
|
||||
it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor-
|
||||
ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to
|
||||
example is the top face verification results on the Labeled estimate the informative parameters of the deep CNNs. The
|
||||
Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional filters
|
||||
containing hundreds of millions of parameters, using a mix design special structural convolutional filters to reduce the
|
||||
of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge
|
||||
distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
|
||||
Way, Redmond, WA 98052, USA. compact neural network to reproduce the output of a larger
|
||||
Duo Wang and Tao Zhang are with the Department of Automation, network.
|
||||
Tsinghua University, Beijing 100084, China. In Table I, we briefly summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074,
|
||||
China. rank factorization and knowledge distillation approaches can IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2
|
||||
|
||||
|
||||
TABLE I
|
||||
SUMMARIZATION OF DIFFERENT APPROACHES FOR MODEL COMPRESSION AND ACCELERATION .
|
||||
Theme Name Description Applications More details
|
||||
Parameter pruning and sharing Reducing redundant parameters which Convolutional layer and Robust to various settings, can achieve
|
||||
are not sensitive to the performance fully connected layer good performance, can support both train
|
||||
from scratch and pre-trained model
|
||||
Low-rank factorization Using matrix/tensor decomposition to Convolutional layer and Standardized pipeline, easily to be
|
||||
estimate the informative parameters fully connected layer implemented, can support both train
|
||||
from scratch and pre-trained model
|
||||
Transferred/compact convolutional Designing special structural convolutional Convolutional layer Algorithms are dependent on applications,
|
||||
filters filters to save parameters only usually achieve good performance,
|
||||
only support train from scratch
|
||||
Knowledge distillation Training a compact neural network with Convolutional layer and Model performances are sensitive
|
||||
distilled knowledge of a large model fully connected layer to applications and network structure
|
||||
only support train from scratch
|
||||
|
||||
|
||||
be used in DNN models with fully connected layers and
|
||||
convolutional layers, achieving comparable performances. On
|
||||
the other hand, methods using transferred/compact filters are
|
||||
designed for models with convolutional layers only. Low-rank
|
||||
factorization and transfered/compact filters based approaches
|
||||
provide an end-to-end pipeline and can be easily implemented
|
||||
in CPU/GPU environment, which is straightforward. while
|
||||
parameter pruning & sharing use different methods such as
|
||||
vector quantization, binary coding and sparse constraints to
|
||||
perform the task. Generally it will take several steps to achieve
|
||||
the goal. Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output
|
||||
eter pruning/sharing low-rank factorization can be extracted is the compression model.
|
||||
from pre-trained ones or trained from scratch. While the
|
||||
transferred/compact filter and knowledge distillation models
|
||||
can only support train from scratch. These methods are inde- memory usage and float point operations with little loss in
|
||||
pendently designed and complement each other. For example, classification accuracy.
|
||||
transferred layers and parameter pruning & sharing can be The method proposed in [10] quantized the link weights
|
||||
used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the
|
||||
used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce
|
||||
speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con-
|
||||
properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the
|
||||
small-weight connections. Finally, the network was retrained
|
||||
II. P to learn the final weights for the remaining sparse connections. ARAMETER PRUNING AND SHARING This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importancefitting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classified into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix. instance, BinaryConnect [12], BinaryNet [13] and XNORNet-
|
||||
works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization activation during the model training. The systematic study in
|
||||
Network quantization compresses the original network by [15] showed that networks trained with back propagation could
|
||||
reducing the number of bits required to represent each weight. be resilient to specific weight distortions, including binary
|
||||
Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights.
|
||||
quantization to the parameter values. Vanhouckeet al.[8] Drawbacks: the accuracy of the binary nets is significantly
|
||||
showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet.
|
||||
in significant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina-
|
||||
work in [9] used 16-bit fixed-point representation in stochastic rization schemes are based on simple matrix approximations
|
||||
rounding based CNN training, which significantly reduced and ignore the effect of binarization on the accuracy loss. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3
|
||||
|
||||
|
||||
To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of
|
||||
Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear
|
||||
directly minimizes the loss with respect to the binary weights. transformsf(x;M) = |