More corpus documents

This commit is contained in:
Eduardo Cueto Mendoza 2020-08-06 14:53:44 -06:00
parent f30a0b2be3
commit 514f272a6d
47 changed files with 12133 additions and 0 deletions

View File

@ -0,0 +1,555 @@
IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1
A Survey of Model Compression and Acceleration
for Deep Neural Networks
Yu Cheng, Duo Wang, Pan Zhou,Member, IEEE,and Tao Zhang,Senior Member, IEEE
Abstract—Deep convolutional neural networks (CNNs) have [2], [3]. It is also very time-consuming to train such a model
recently achieved great success in many visual recognition tasks. to get reasonable performance. In architectures that rely only However, existing deep neural network models are computation- on fully-connected layers, the number of parameters can grow ally expensive and memory intensive, hindering their deployment
in devices with low memory resources or in applications with to billions [4].
arXiv:1710.09282v7 [cs.LG] 7 Feb 2019 strict latency requirements. Therefore, a natural thought is to As larger neural networks with more layers and nodes
perform model compression and acceleration in deep networks are considered, reducing their storage and computational cost
without significantly decreasing the model performance. During becomes critical, especially for some real-time applications the past few years, tremendous progress has been made in such as online learning and incremental learning. In addi- this area. In this paper, we survey the recent advanced tech-
niques for compacting and accelerating CNNs model developed. tion, recent years witnessed significant progress in virtual
These techniques are roughly categorized into four schemes: reality, augmented reality, and smart wearable devices, cre-
parameter pruning and sharing, low-rank factorization, trans- ating unprecedented opportunities for researchers to tackle
ferred/compact convolutional filters, and knowledge distillation. fundamental challenges in deploying deep learning systems to Methods of parameter pruning and sharing will be described at portable devices with limited resources (e.g. memory, CPU, the beginning, after that the other techniques will be introduced.
For each scheme, we provide insightful analysis regarding the energy, bandwidth). Efficient deep learning methods can have
performance, related applications, advantages, and drawbacks significant impacts on distributed systems, embedded devices,
etc. Then we will go through a few very recent additional and FPGA for Artificial Intelligence. For example, the ResNet-
successful methods, for example, dynamic capacity networks and 50 [5] with 50 convolutional layers needs over 95MB memory stochastic depths networks. After that, we survey the evaluation for storage and over 3.8 billion floating number multiplications matrix, the main datasets used for evaluating the model per-
formance and recent benchmarking efforts. Finally, we conclude when processing an image. After discarding some redundant
this paper, discuss remaining challenges and possible directions weights, the network still works as usual but saves more than
on this topic. 75% of parameters and 50% computational time. For devices
Index Terms—Deep Learning, Convolutional Neural Networks, like cell phones and FPGAs with only several megabyte
Model Compression and Acceleration, resources, how to compact the models used on them is also
important.
Achieving these goal calls for joint solutions from manyI. I NTRODUCTION disciplines, including but not limited to machine learning, op-
In recent years, deep neural networks have recently received timization, computer architecture, data compression, indexing,
lots of attention, been applied to different applications and and hardware design. In this paper, we review recent works
achieved dramatic accuracy improvements in many tasks. on compressing and accelerating deep neural networks, which
These works rely on deep networks with millions or even attracted a lot of attention from the deep learning community
billions of parameters, and the availability of GPUs with and already achieved lots of progress in the past years.
very high computation capability plays a key role in their We classify these approaches into four categories: pa-
success. For example, the work by Krizhevskyet al.[1] rameter pruning and sharing, low-rank factorization, trans-
achieved breakthrough results in the 2012 ImageNet Challenge ferred/compact convolutional filters, and knowledge distil-
using a network containing 60 million parameters with five lation. The parameter pruning and sharing based methods
convolutional layers and three fully-connected layers. Usually, explore the redundancy in the model parameters and try to
it takes two to three days to train the whole model on remove the redundant and uncritical ones. Low-rank factor-
ImagetNet dataset with a NVIDIA K40 machine. Another ization based techniques use matrix/tensor decomposition to
example is the top face verification results on the Labeled estimate the informative parameters of the deep CNNs. The
Faces in the Wild (LFW) dataset were obtained with networks approaches based on transferred/compact convolutional filters
containing hundreds of millions of parameters, using a mix design special structural convolutional filters to reduce the
of convolutional, locally-connected, and fully-connected layers parameter space and save storage/computation. The knowledge
distillation methods learn a distilled model and train a more Yu Cheng is a Researcher from Microsoft AI & Research, One Microsoft
Way, Redmond, WA 98052, USA. compact neural network to reproduce the output of a larger
Duo Wang and Tao Zhang are with the Department of Automation, network.
Tsinghua University, Beijing 100084, China. In Table I, we briefly summarize these four types of Pan Zhou is with the School of Electronic Information and Communi- methods. Generally, the parameter pruning & sharing, low- cations, Huazhong University of Science and Technology, Wuhan 430074,
China. rank factorization and knowledge distillation approaches can IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 2
TABLE I
SUMMARIZATION OF DIFFERENT APPROACHES FOR MODEL COMPRESSION AND ACCELERATION .
Theme Name Description Applications More details
Parameter pruning and sharing Reducing redundant parameters which Convolutional layer and Robust to various settings, can achieve
are not sensitive to the performance fully connected layer good performance, can support both train
from scratch and pre-trained model
Low-rank factorization Using matrix/tensor decomposition to Convolutional layer and Standardized pipeline, easily to be
estimate the informative parameters fully connected layer implemented, can support both train
from scratch and pre-trained model
Transferred/compact convolutional Designing special structural convolutional Convolutional layer Algorithms are dependent on applications,
filters filters to save parameters only usually achieve good performance,
only support train from scratch
Knowledge distillation Training a compact neural network with Convolutional layer and Model performances are sensitive
distilled knowledge of a large model fully connected layer to applications and network structure
only support train from scratch
be used in DNN models with fully connected layers and
convolutional layers, achieving comparable performances. On
the other hand, methods using transferred/compact filters are
designed for models with convolutional layers only. Low-rank
factorization and transfered/compact filters based approaches
provide an end-to-end pipeline and can be easily implemented
in CPU/GPU environment, which is straightforward. while
parameter pruning & sharing use different methods such as
vector quantization, binary coding and sparse constraints to
perform the task. Generally it will take several steps to achieve
the goal. Fig. 1. The three-stage compression method proposed in [10]: pruning, Regarding the training protocols, models based on param- quantization and encoding. The input is the original model and the output
eter pruning/sharing low-rank factorization can be extracted is the compression model.
from pre-trained ones or trained from scratch. While the
transferred/compact filter and knowledge distillation models
can only support train from scratch. These methods are inde- memory usage and float point operations with little loss in
pendently designed and complement each other. For example, classification accuracy.
transferred layers and parameter pruning & sharing can be The method proposed in [10] quantized the link weights
used together, and model quantization & binarization can be using weight sharing and then applied Huffman coding to the
used together with low-rank approximations to achieve further quantized weights as well as the codebook to further reduce
speedup. We will describe the details of each theme, their the rate. As shown in Figure 1, it started by learning the con-
properties, strengths and drawbacks in the following sections. nectivity via normal network training, followed by pruning the
small-weight connections. Finally, the network was retrained
II. P to learn the final weights for the remaining sparse connections. ARAMETER PRUNING AND SHARING This work achieved the state-of-art performance among allEarly works showed that network pruning is effective in parameter quantization based methods. It was shown in [11] reducing the network complexity and addressing the over- that Hessian weight could be used to measure the importancefitting problem [6]. After that researcher found pruning orig- of network parameters, and proposed to minimize Hessian-inally introduced to reduce the structure in neural networks weighted quantization errors in average for clustering networkand hence improve generalization, it has been widely studied parameters.to compress DNN models, trying to remove parameters which In the extreme case of the 1-bit representation of eachare not crucial to the model performance. These techniques can weight, that is binary weight neural networks. There arebe further classified into three sub-categories: quantization and many works that directly train CNNs with binary weights, forbinarization, parameter sharing, and structural matrix. instance, BinaryConnect [12], BinaryNet [13] and XNORNet-
works [14]. The main idea is to directly learn binary weights orA. Quantization and Binarization activation during the model training. The systematic study in
Network quantization compresses the original network by [15] showed that networks trained with back propagation could
reducing the number of bits required to represent each weight. be resilient to specific weight distortions, including binary
Gonget al.[6] and Wu et al. [7] appliedk-means scalar weights.
quantization to the parameter values. Vanhouckeet al.[8] Drawbacks: the accuracy of the binary nets is significantly
showed that 8-bit quantization of the parameters can result lowered when dealing with large CNNs such as GoogleNet.
in significant speed-up with minimal loss of accuracy. The Another drawback of such binary nets is that existing bina-
work in [9] used 16-bit fixed-point representation in stochastic rization schemes are based on simple matrix approximations
rounding based CNN training, which significantly reduced and ignore the effect of binarization on the accuracy loss. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 3
To address this issue, the work in [16] proposed a proximal connected layers, which is often the bottleneck in terms of
Newton algorithm with diagonal Hessian approximation that memory consumption. These network layers use the nonlinear
directly minimizes the loss with respect to the binary weights. transformsf(x;M) =(Mx), where()is an element-wise
The work in [17] reduced the time on float point multiplication nonlinear operator,xis the input vector, andMis themn
in the training stage by stochastically binarizing weights and matrix of parameters [29]. WhenMis a large general dense
converting multiplications in the hidden state computation to matrix, the cost of storingmnparameters and computing
significant changes. matrix-vector products inO(mn)time. Thus, an intuitive
way to prune parameters is to imposexas a parameterizedB. Pruning and Sharing structural matrix. Anmnmatrix that can be described
Network pruning and sharing has been used both to reduce using much fewer parameters thanmnis called a structured
network complexity and to address the over-fitting issue. An matrix. Typically, the structure should not only reduce the
early approach to pruning was the Biased Weight Decay memory cost, but also dramatically accelerate the inference
[18]. The Optimal Brain Damage [19] and the Optimal Brain and training stage via fast matrix-vector multiplication and
Surgeon [20] methods reduced the number of connections gradient computations.
based on the Hessian of the loss function, and their work sug- Following this direction, the work in [30], [31] proposed a
gested that such pruning gave higher accuracy than magnitude- simple and efficient approach based on circulant projections,
while maintaining competitive error rates. Given a vectorr=based pruning such as the weight decay method. The training (rprocedure of those methods followed the way training from 0 ;r 1 ;;r d1 ), a circulant matrixR2Rdd is defined
as:
scratch manner. 2 3 r A recent trend in this direction is to prune redundant, 0 rd1 ::: r 2 r1 6r6 1 r0 rd1 r2 77 non-informative weights in a pre-trained CNN model. For 6 .. . 7
example, Srinivas and Babu [21] explored the redundancy R= circ(r) :=66 . r . .. . 71 r0 . 7: (1)6 . 7 among neurons, and proposed a data-free pruning method to 4r . .. .. 5d2 rd1
remove redundant neurons. Hanet al.[22] proposed to reduce rd1 rd2 ::: r 1 r0
the total number of parameters and operations in the entire thus the memory cost becomesO(d)instead ofO(d2 ).network. Chenet al.[23] proposed a HashedNets model that This circulant structure also enables the use of Fast Fourierused a low-cost hash function to group weights into hash Transform (FFT) to speed up the computation. Given ad-buckets for parameter sharing. The deep compression method dimensional vectorr, the above 1-layer circulant neural net-in [10] removed the redundant connections and quantized the work in Eq. 1 has time complexity ofO(dlogd).weights, and then used Huffman coding to encode the quan- In [32], a novel Adaptive Fastfood transform was introducedtized weights. In [24], a simple regularization method based to reparameterize the matrix-vector multiplication of fullyon soft weight-sharing was proposed, which included both connected layers. The Adaptive Fastfood transform matrixquantization and pruning in one simple (re-)training procedure. R2Rnd was defined as:The above pruning schemes typically produce connections
pruning in CNNs. R=SHGHB (2)
There is also growing interest in training compact CNNs whereS,GandBare random diagonal matrices. 2
with sparsity constraints. Those sparsity constraints are typ- f0;1gdd is a random permutation matrix, andHdenotes
ically introduced in the optimization problem asl0 orl1 - the Walsh-Hadamard matrix. Reparameterizing a fully con-
norm regularizers. The work in [25] imposed group sparsity nected layer withdinputs andnoutputs using the Adaptive
constraint on the convolutional filters to achieve structured Fastfood transform reduces the storage and the computational
brain Damage, i.e., pruning entries of the convolution kernels costs fromO(nd)toO(n)and fromO(nd)toO(nlogd),
in a group-wise fashion. In [26], a group-sparse regularizer respectively.
on neurons was introduced during the training stage to learn The work in [29] showed the effectiveness of the new
compact CNNs with reduced filters. Wenet al.[27] added a notion of parsimony in the theory of structured matrices. Their
structured sparsity regularizer on each layer to reduce trivial proposed method can be extended to various other structured
filters, channels or even layers. In the filter-level pruning, all matrix classes, including block and multi-level Toeplitz-like
the above works usedl2;1 -norm regularizers. The work in [28] [33] matrices related to multi-dimensional convolution [34].
usedl1 -norm to select and prune unimportant filters. Following this idea, [35] proposed a general structured effi-
Drawbacks: there are some potential issues of the pruning cient linear layer for CNNs.
and sharing. First, pruning withl1 orl2 regularization requires Drawbacks: one problem of this kind of approaches is that
more iterations to converge than general. In addition, all the structural constraint will hurt the performance since the
pruning criteria require manual setup of sensitivity for layers, constraint might bring bias to the model. On the other hand,
which demands fine-tuning of the parameters and could be how to find a proper structural matrix is difficult. There is no
cumbersome for some applications. theoretical way to derive it out.
C. Designing Structural Matrix III. L OW -RANK FACTORIZATION AND SPARSITY
In architectures that contain fully-connected layers, it is Convolution operations contribute the bulk of most com-
critical to explore this redundancy of parameters in fully- putations in deep CNNs, thus reducing the convolution layer IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4
TABLE II
COMPARISONS BETWEEN THE LOW -RANK MODELS AND THEIR BASELINES
ON ILSVRC-2012.
Model TOP-5 Accuracy Speed-up Compression Rate
AlexNet 80.03% 1. 1.
BN Low-rank 80.56% 1.09 4.94
CP Low-rank 79.66% 1.82 5.
VGG-16 90.60% 1. 1.
Fig. 2. A typical framework of the low-rank regularization method. The left BN Low-rank 90.47% 1.53 2.72
is the original convolutional layer and the right is the low-rank constraint CP Low-rank 90.31% 2.05 2.75
convolutional layer with rank-K. GoogleNet 92.21% 1. 1.
BN Low-rank 91.88% 1.08 2.79
CP Low-rank 91.79% 1.20 2.84
would improve the compression rate as well as the overall
speedup. For the convolution kernels, it can be viewed as a
4D tensor. Ideas based on tensor decomposition is derived by For instance, Mishaet al.[41] reduced the number of dynamic
the intuition that there is a significant amount of redundancy parameters in deep models using the low-rank method. [42]
in the 4D tensor, which is a particularly promising way to explored a low-rank matrix factorization of the final weight
remove the redundancy. Regarding the fully-connected layer, layer in a DNN for acoustic modeling. In [3], Luet al.adopted
it can be view as a 2D matrix and the low-rankness can also truncated SVD (singular value decomposition) to decompsite
help. the fully connected layer for designing compact multi-task
It has been a long time for using low-rank filters to acceler- deep learning architectures.
ate convolution, for example, high dimensional DCT (discrete Drawbacks: low-rank approaches are straightforward for
cosine transform) and wavelet systems using tensor products model compression and acceleration. The idea complements
to be constructed from 1D DCT transform and 1D wavelets recent advances in deep learning, such as dropout, recti-
respectively. Learning separable 1D filters was introduced fied units and maxout. However, the implementation is not
by Rigamontiet al.[36], following the dictionary learning that easy since it involves decomposition operation, which
idea. Regarding some simple DNN models, a few low-rank is computationally expensive. Another issue is that current
approximation and clustering schemes for the convolutional methods perform low-rank approximation layer by layer, and
kernels were proposed in [37]. They achieved 2speedup thus cannot perform global parameter compression, which
for a single convolutional layer with 1% drop in classification is important as different layers hold different information.
accuracy. The work in [38] proposed using different tensor Finally, factorization requires extensive model retraining to
decomposition schemes, reporting a 4.5speedup with 1% achieve convergence when compared to the original model.
drop in accuracy in text recognition.
The low-rank approximation was done layer by layer. The IV. T RANSFERRED /COMPACT CONVOLUTIONAL FILTERS
parameters of one layer were fixed after it was done, and the CNNs are parameter efficient due to exploring the trans-layers above were fine-tuned based on a reconstruction error lation invariant property of the representations to the inputcriterion. These are typical low-rank methods for compressing image, which is the key to the success of training very deep2D convolutional layers, which is described in Figure 2. Fol- models without severe over-fitting. Although a strong theorylowing this direction, Canonical Polyadic (CP) decomposition is currently missing, a large number of empirical evidenceof was proposed for the kernel tensors in [39]. Their work support the notion that both the translation invariant propertyused nonlinear least squares to compute the CP decomposition. and the convolutional weight sharing are important for good In [40], a new algorithm for computing the low-rank tensor predictive performance. The idea of using transferred convolu-decomposition for training low-rank constrained CNNs from tional filters to compress CNN models is motivated by recentscratch were proposed. It used Batch Normalization (BN) to works in [43], which introduced the equivariant group theory.transform the activation of the internal hidden units. In general, Letxbe an input,()be a network or layer andT()be theboth the CP and the BN decomposition schemes in [40] (BN transform matrix. The concept of equivalence is defined as:Low-rank) can be used to train CNNs from scratch. However,
there are few differences between them. For example, finding T (x) = (Tx) (3)the best low-rank approximation in CP decomposition is an ill-
posed problem, and the best rank-K(Kis the rank number) indicating that transforming the inputxby the transformT()
approximation may not exist sometimes. While for the BN and then passing it through the network or layer()should
scheme, the decomposition always exists. We perform a simple give the same result as first mappingxthrough the network
comparison of both methods shown in Table II. The actual and then transforming the representation. Note that in Eq.
speedup and the compression rates are used to measure their (10), the transformsT()andT0 ()are not necessarily the
performances. same as they operate on different objects. According to this
As we mentioned before, the fully connected layers can theory, it is reasonable applying transform to layers or filters
be viewed as a 2D matrix and thus the above mentioned ()to compress the whole network models. From empirical
methods can also be applied there. There are several classical observation, deep CNNs also benefit from using a large set of
works on exploiting low-rankness in fully connected layers. convolutional filters by applying certain transformT()to a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 5
small set of base filters since it acts as a regularizer for the TABLE III
model. ASIMPLE COMPARISON OF DIFFERENT APPROACHES ON CIFAR-10 AND
Following this direction, there are many recent reworks CIFAR-100.
proposed to build a convolutional layer from a set of base Model CIFAR-100 CIFAR-10 Compression Rate
filters [43][46]. What they have in common is that the VGG-16 34.26% 9.85% 1.
transformT()lies in the family of functions that only operate MBA [46] 33.66% 9.76% 2.
CRELU [45] 34.57% 9.92% 2. in the spatial domain of the convolutional filters. For example, CIRC [43] 35.15% 10.23% 4.
the work in [45] found that the lower convolution layers of DCNN [44] 33.57% 9.65% 1.62
CNNs learned redundant filters to extract both positive and
negative phase information of an input signal, and definedT() Drawbacks: there are few issues to be addressed for ap-to be the simple negation function: proaches that apply transform constraints to convolutional fil-
T(Wx ) =W (4) ters. First, these methods can achieve competitive performance x for wide/flat architectures (like VGGNet) but not thin/deepwhereWx is the basis convolutional filter andW is the filter x ones (like GoogleNet, Residual Net). Secondly, the transferconsisting of the shifts whose activation is opposite to that assumptions sometimes are too strong to guide the learning,ofWx and selected after max-pooling operation. By doing making the results unstable in some cases.this, the work in [45] can easily achieve 2compression Using a compact filter for convolution can directly reducerate on all the convolutional layers. It is also shown that the the computation cost. The key idea is to replace the loosenegation transform acts as a strong regularizer to improve and over-parametric filters with compact blocks to improve the classification accuracy. The intuition is that the learning the speed, which significantly accelerate CNNs on severalalgorithm with pair-wise positive-negative constraint can lead benchmarks. Decomposing33convolution into two11to useful convolutional filters instead of redundant ones. convolutions was used in [48], which achieved significantIn [46], it was observed that magnitudes of the responses acceleration on object recognition. SqueezeNet [49] was pro-from convolutional kernels had a wide diversity of pattern posed to replace33convolution with11convolu-representations in the network, and it was not proper to discard tion, which created a compact neural network with about 50weaker signals with a single threshold. Thus a multi-bias non- fewer parameters and comparable accuracy when compared tolinearity activation function was proposed to generates more AlexNet.patterns in the feature space at low computational cost. The
transformT()was define as: V. K NOWLEDGE DISTILLATION T (x) =Wx + (5) To the best of our knowledge, exploiting knowledge transfer
wherewere the multi-bias factors. The work in [47] con- (KT) to compress model was first proposed by Caruanaet
sidered a combination of rotation by a multiple of90 and al.[50]. They trained a compressed/ensemble model of strong
horizontal/vertical flipping with: classifiers with pseudo-data labeled, and reproduced the output
of the original larger network. But the work is limited toT (x) =WT (6) shallow models. The idea has been recently adopted in [51]
whereWT was the transformation matrix which rotated the as knowledge distillation (KD) to compress deep and wide
original filters with angle2 f90;180;270g. In [43], the networks into shallower ones, where the compressed model
transform was generalized to any angle learned from data, and mimicked the function learned by the complex model. The
was directly obtained from data. Both works [47] and [43] main idea of KD based approaches is to shift knowledge from
can achieve good classification performance. a large teacher model into a small one by learning the class
The work in [44] definedT()as the set of translation distributions output via softmax.
functions applied to 2D filters: The work in [52] introduced a KD compression framework,
which eased the training of deep networks by following aT (x) =T(;x;y)x;y2fk;:::;kg;(x;y)6=(0;0) (7) student-teacher paradigm, in which the student was penalized
whereT(;x;y)denoted the translation of the first operand by according to a softened version of the teachers output. The
(x;y)along its spatial dimensions, with proper zero padding framework compressed an ensemble of teacher networks into
at borders to maintain the shape. The proposed framework a student network of similar depth. The student was trained
can be used to 1) improve the classification accuracy as a to predict the output and the classification labels. Despite
regularized version of maxout networks, and 2) to achieve its simplicity, KD demonstrates promising results in various
parameter efficiency by flexibly varying their architectures to image classification tasks. The work in [53] aimed to address
compress networks. the network compression problem by taking advantage of
Table III briefly compares the performance of different depth neural networks. It proposed an approach to train thin
methods with transferred convolutional filters, using VGGNet but deep networks, called FitNets, to compress wide and
(16 layers) as the baseline model. The results are reported shallower (but still deep) networks. The method was extended
on CIFAR-10 and CIFAR-100 datasets with Top-5 error. It is the idea to allow for thinner and deeper student models. In
observed that they can achieve reduction in parameters with order to learn from the intermediate representations of teacher
little or no drop in classification accuracy. network, FitNet made the student mimic the full feature maps IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 6
of the teacher. However, such assumptions are too strict since layer with global average pooling [44], [62]. Network architec-
the capacities of teacher and student may differ greatly. ture such as GoogleNet or Network in Network, can achieve
All the above approaches are validated on MNIST, CIFAR- state-of-the-art results on several benchmarks by adopting
10, CIFAR-100, SVHN and AFLW benchmark datasets, and this idea. However, these architectures have not been fully
experimental results show that these methods match or outper- optimized the utilization of the computing resources inside
form the teachers performance, while requiring notably fewer the network. This problem was noted by Szegedyet al.[62]
parameters and multiplications. and motivated them to increase the depth and width of the
There are several extension along this direction of dis- network while keeping the computational budget constant.
tillation knowledge. The work in [54] trained a parametric The work in [63] targeted the Residual Network based
student model to approximate a Monte Carlo teacher. The model with a spatially varying computation time, called
proposed framework used online training, and used deep stochastic depth, which enabled the seemingly contradictory
neural networks for the student model. Different from previous setup to train short networks and used deep networks at test
works which represented the knowledge using the soften label time. It started with very deep networks, while during training,
probabilities, [55] represented the knowledge by using the for each mini-batch, randomly dropped a subset of layers
neurons in the higher hidden layer, which preserved as much and bypassed them with the identity function. Following this
information as the label probabilities, but are more compact. direction, thew work in [64] proposed a pyramidal residual
The work in [56] accelerated the experimentation process by networks with stochastic depth. In [65], Wuet al.proposed
instantaneously transferring the knowledge from a previous an approach that learns to dynamically choose which layers
network to each new deeper or wider network. The techniques of a deep network to execute during inference so as to best
are based on the concept of function-preserving transfor- reduce total computation. Veitet al.exploited convolutional
mations between neural network specifications. Zagoruyko networks with adaptive inference graphs to adaptively define
et al.[57] proposed Attention Transfer (AT) to relax the their network topology conditioned on the input image [66].
assumption of FitNet. They transferred the attention maps that Other approaches to reduce the convolutional overheads in-are summaries of the full activations. clude using FFT based convolutions [67] and fast convolutionDrawbacks: KD-based approaches can make deeper models using the Winograd algorithm [68]. Zhaiet al.[69] proposed athinner and help significantly reduce the computational cost. strategy call stochastic spatial sampling pooling, which speed-However, there are a few disadvantages. One of those is that up the pooling operations by a more general stochastic version.KD can only be applied to classification tasks with softmax Saeedanet al.presented a novel pooling layer for convolu-loss function, which hinders its usage. Another drawback is tional neural networks termed detail-preserving pooling (DPP),the model assumptions sometimes are too strict to make the based on the idea of inverse bilateral filters [70]. Those worksperformance competitive with other type of approaches. only aim to speed up the computation but not reduce the
memory storage.VI. O THER TYPES OF APPROACHES
We first summarize the works utilizing attention-based
methods. Note that attention-based mechanism [58] can reduce VII. B ENCHMARKS , E VALUATION AND DATABASES
computations significantly by learning to selectively focus or In the past five years the deep learning community had“attend” to a few, task-relevant input regions. The work in made great efforts in benchmark models. One of the most[59] introduced the dynamic capacity network (DCN) that well-known model used in compression and acceleration forcombined two types of modules: the small sub-networks with CNNs is Alexnet [1], which has been occasionally usedlow capacity, and the large ones with high capacity. The low- for assessing the performance of compression. Other popularcapacity sub-networks were active on the whole input to first standard models include LeNets [71], All-CNN-nets [72] andfind the task-relevant areas, and then the attention mechanism many others. LeNet-300-100 is a fully connected networkwas used to direct the high-capacity sub-networks to focus on with two hidden layers, with 300 and 100 neurons each.the task-relevant regions. By dong this, the size of the CNNs LeNet-5 is a convolutional network that has two convolutionalmodel has been significantly reduced. layers and two fully connected layers. Recently, more andFollowing this direction, the work in [60] introduced the more state-of-the-art architectures are used as baseline modelsconditional computation idea, which only computes the gra- in many works, including network in networks (NIN) [73],dient for some important neurons. It proposed a sparsely- VGG nets [74] and residual networks (ResNet) [75]. Table IVgated mixture-of-experts Layer (MoE). The MoE module summarizes the baseline models commonly used in severalconsisted of a number of experts, each a simple feed-forward typical compression methods.neural network, and a trainable gating network that selected
a sparse combination of the experts to process each input. In The standard criteria to measure the quality of model
[61], dynamic deep neural networks (D2NN) were introduced, compression and acceleration are the compression and the
which were a type of feed-forward deep neural network that speedup rates. Assume thatais the number of the parameters
selected and executed a subset of D2NN neurons based on the in the original modelManda is that of the compressed
input. modelM , then the compression rate(M;M )ofM over
There have been other attempts to reduce the number of Mis aparameters of neural networks by replacing the fully connected (M;M ) = : (8)a IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 7
TABLE IV or low rank factorization based methods. If you need
SUMMARIZATION OF BASELINE MODELS USED IN DIFFERENT end-to-end solutions for your problem, the low rank REPRESENTATIVE WORKS OF NETWORK COMPRESSION . and transferred convolutional filters approaches could be
Baseline Models Representative Works considered.
Alexnet [1] structural matrix [29], [30], [32] For applications in some specific domains, methods with low-rank factorization [40] human prior (like the transferred convolutional filters, Network in network [73] low-rank factorization [40]
VGG nets [74] transferred filters [44] structural matrix) sometimes have benefits. For example,
low-rank factorization [40] when doing medical images classification, transferred Residual networks [75] compact filters [49], stochastic depth [63] convolutional filters could work well as medical images parameter sharing [24]
All-CNN-nets [72] transferred filters [45] (like organ) do have the rotation transformation property.
LeNets [71] parameter sharing [24] Usually the approaches of pruning & sharing could give parameter pruning [20], [22] reasonable compression rate while not hurt the accuracy.
Thus for applications which requires stable model accu-
Another widely used measurement is the index space saving racy, it is better to utilize pruning & sharing.
defined in several papers [30], [35] as If your problem involves small/medium size datasets, you
can try the knowledge distillation approaches. The com-aa
(M;M ) = ; (9) pressed student model can take the benefit of transferringa knowledge from teacher model, making it robust datasets
whereaandaare the number of the dimension of the index which are not large.
space in the original model and that of the compressed model, As we mentioned before, techniques of the four groups
respectively. are orthogonal. It is reasonable to combine two or three
Similarly, given the running timesofMands ofM , of them to maximize the performance. For some spe-
the speedup rate(M;M )is defined as: cific applications, like object detection, which requires
s both convolutional and fully connected layers, you can(M;M ) = : (10)s compress the convolutional layers with low rank based
Most work used the average training time per epoch to measure method and the fully connected layers with a pruning
the running time, while in [30], [35], the average testing time technique.
was used. Generally, the compression rate and speedup rate B. Technique Challengesare highly correlated, as smaller models often results in faster
computation for both the training and the testing stages. Techniques for deep model compression and acceleration
Good compression methods are expected to achieve almost are still in the early stage and the following challenges still
the same performance as the original model with much smaller need to be addressed.
parameters and less computational time. However, for different Most of the current state-of-the-art approaches are built
applications with different CNN designs, the relation between on well-designed CNN models, which have limited free-
parameter size and computational time may be different. dom to change the configuration (e.g., network structural,
For example, it is observed that for deep CNNs with fully hyper-parameters). To handle more complicated tasks,
connected layers, most of the parameters are in the fully it should provide more plausible ways to configure the
connected layers; while for image classification tasks, float compressed models.
point operations are mainly in the first few convolutional layers Pruning is an effective way to compress and acceler-
since each filter is convolved with the whole image, which is ate CNNs. The current pruning techniques are mostly
usually very large at the beginning. Thus compression and designed to eliminate connections between neurons. On
acceleration of the network should focus on different type of the other hand, pruning channel can directly reduce the
layers for different applications. feature map width and shrink the model into a thinner
one. It is efficient but also challenging because removing
VIII. D ISCUSSION AND CHALLENGES channels might dramatically change the input of the
following layer.In this paper, we summarized recent efforts on compressing
and accelerating deep neural networks (DNNs). Here we dis- As we mentioned before, methods of structural matrix
and transferred convolutional filters impose prior humancuss more details about how to choose different compression knowledge to the model, which could significantly affectapproaches, and possible challenges/solutions on this area. the performance and stability. It is critical to investigate
how to control the impact of those prior knowledge.A. General Suggestions The methods of knowledge distillation provide many ben-
There is no golden rule to measure which approach is the efits such as directly accelerating model without special
best. How to choose the proper method is really depending hardware or implementations. It is still worthy developing
on the applications and requirements. Here are some general KD-based approaches and exploring how to improve their
guidance we can provide: performances.
If the applications need compacted models from pre- Hardware constraints in various of small platforms (e.g.,
trained models, you can choose either pruning & sharing mobile, robotic, self-driving car) are still a major problem IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 8
to hinder the extension of deep CNNs. How to make full see more work for applications with larger deep nets (e.g.,
use of the limited computational source and how to design video and image frames [88], [89]).
special compression methods for such platforms are still
challenges that need to be addressed. IX. ACKNOWLEDGMENTS
Despite the great achievements of these compression ap-
proaches, the black box mechanism is still the key barrier The authors would like to thank the reviewers and broader
to the adoption. Exploring the knowledge interpret-ability community for their feedback on this survey. In particular,
is still an important problem. we would like to thank Hong Zhao from the Department of
Automation of Tsinghua University for her help on modifying
C. Possible Solutions the paper. This research is supported by National Science
Foundation of China with Grant number 61401169.To solve the hyper-parameters configuration problem, we
can rely on the recent learning-to-learn strategies [76], [77].
This framework provides a mechanism allowing the algorithm REFERENCES
to automatically learn how to exploit structure in the problem [1]A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with of interest. Very recently, leveraging reinforcement learning deep convolutional neural networks,” inNIPS, 2012.
to efficiently sample the design space and improve the model [2]Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
compression has also been tried [78]. gap to human-level performance in face verification,” inCVPR, 2014.
[3]Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris, “Fully- Channel pruning provides the efficiency benefit on both adaptive feature sharing in multi-task networks with applications in
CPU and GPU because no special implementation is required. person attribute classification,”CoRR, vol. abs/1611.05377, 2016.
But it is also challenging to handle the input configuration. [4]J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale One possible solution is to use the training-based channel distributed deep networks,” inNIPS, 2012.
pruning methods [79], which focus on imposing sparse con- [5]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
straints on weights during training. However, training from recognition,”CoRR, vol. abs/1512.03385, 2015.
[6]Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing scratch for such method is costly for very deep CNNs. In deep convolutional networks using vector quantization,”CoRR, vol.
[80], the authors provided an iterative two-step algorithm to abs/1412.6115, 2014.
effectively prune channels in each layer. [7]Y. W. Q. H. Jiaxiang Wu, Cong Leng and J. Cheng, “Quantized
convolutional neural networks for mobile devices,” inIEEE Conference Exploring new types of knowledge in the teacher models on Computer Vision and Pattern Recognition (CVPR), 2016.
and transferring it to the student models is useful for the [8]V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
knowledge distillation (KD) approaches. Instead of directly re- neural networks on cpus,” inDeep Learning and Unsupervised Feature
Learning Workshop, NIPS 2011, 2011. ducing and transferring parameters, passing selectivity knowl- [9]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
edge of neurons could be helpful. One can derive a way to learning with limited numerical precision,” inProceedings of the
select essential neurons related to the task [81], [82]. The 32Nd International Conference on International Conference on Machine
Learning - Volume 37, ser. ICML15, 2015, pp. 17371746. intuition is that if a neuron is activated in certain regions [10]S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
or samples, that implies these regions or samples share some deep neural networks with pruning, trained quantization and huffman
common properties that may relate to the task. coding,”International Conference on Learning Representations (ICLR),
2016. For methods with the convolutional filters and the structural [11]Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network
matrix, we can conclude that the transformation lies in the quantization,”CoRR, vol. abs/1612.01543, 2016.
family of functions that only operations on the spatial dimen- [12]M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep
neural networks with binary weights during propagations,” inAdvances sions. Hence to address the imposed prior issue, one solution is in Neural Information Processing Systems 28: Annual Conference on
to provide a generalization of the aforementioned approaches Neural Information Processing Systems 2015, December 7-12, 2015,
in two aspects: 1) instead of limiting the transformation to Montreal, Quebec, Canada, 2015, pp. 31233131.
[13]M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural net- belong to a set of predefined transformations, let it be the works with weights and activations constrained to +1 or -1,”CoRR, vol.
whole family of spatial transformations applied on 2D filters abs/1602.02830, 2016.
or matrix, and 2) learn the transformation jointly with all the [14]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in model parameters. ECCV, 2016.
Regarding the use of CNNs in small platforms, proposing [15]P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha,
some general/unified approaches is one direction. Wanget al. “Deep neural networks are robust to weight binarization and other non-
[83] presented a feature map dimensionality reduction method linear distortions,”CoRR, vol. abs/1606.01981, 2016.
[16]L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep by excavating and removing redundancy in feature maps gen- networks,”CoRR, vol. abs/1611.01600, 2016.
erated from different filters, which could also preserve intrinsic [17]Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
information of the original network. The idea can be applied with few multiplications,”CoRR, vol. abs/1510.03009, 2015.
[18]S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network to make CNNs more applicable for different platforms. The construction with back-propagation,” inAdvances in Neural Information
work in [84] proposed a one-shot whole network compression Processing Systems 1, D. S. Touretzky, Ed., 1989, pp. 177185.
scheme consisting of three components: rank selection, low- [19]Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information
processing systems 2,” D. S. Touretzky, Ed., 1990, ch. Optimal Brain rank tensor decomposition, and fine-tuning to make deep Damage, pp. 598605.
CNNs work in mobile devices. [20]B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives
Despite the classification task, people are also adapting the for network pruning: Optimal brain surgeon,” inAdvances in Neural
Information Processing Systems 5. Morgan Kaufmann, 1993, pp. 164 compacted models in other tasks [85][87]. We would like to 171. IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 9
[21]S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural [43]T. S. Cohen and M. Welling, “Group equivariant convolutional net-
networks,” inProceedings of the British Machine Vision Conference works,”arXiv preprint arXiv:1602.07576, 2016.
2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp. [44]S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural
31.131.12. networks,” inAdvances In Neural Information Processing Systems, 2016,
[22]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and pp. 10821090.
connections for efficient neural networks,” inProceedings of the 28th [45]W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and
International Conference on Neural Information Processing Systems, ser. improving convolutional neural networks via concatenated rectified
NIPS15, 2015. linear units,”arXiv preprint arXiv:1603.05201, 2016.
[23]W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Com- [46]H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in
pressing neural networks with the hashing trick.” JMLR Workshop and deep neural networks,”arXiv preprint arXiv:1604.00676, 2016.
Conference Proceedings, 2015. [47]S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic
[24]K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural symmetry in convolutional neural networks,” inProceedings of the
network compression,”CoRR, vol. abs/1702.04008, 2017. 33rd International Conference on International Conference on Machine
[25]V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain Learning - Volume 48, ser. ICML16, 2016.
damage,” in2016 IEEE Conference on Computer Vision and Pattern [48]C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, resnet and the impact of residual connections on learning.”CoRR, vol.
pp. 25542564. abs/1602.07261, 2016.
[26]H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact [49]B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified,
cnns,” inEuropean Conference on Computer Vision, Amsterdam, the small, low power fully convolutional neural networks for real-time object
Netherlands, October 2016, pp. 662677. detection for autonomous driving,”CoRR, vol. abs/1612.01051, 2016.
[27]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured [50]C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”ˇ
sparsity in deep neural networks,” inAdvances in Neural Information inProceedings of the 12th ACM SIGKDD International Conference on
Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, Knowledge Discovery and Data Mining, ser. KDD 06, 2006, pp. 535
I. Guyon, and R. Garnett, Eds., 2016, pp. 20742082. 541.
[28]H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning [51]J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
filters for efficient convnets,”CoRR, vol. abs/1608.08710, 2016. Advances in Neural Information Processing Systems 27: Annual Confer-
[29]V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for ence on Neural Information Processing Systems 2014, December 8-13
small-footprint deep learning,” inAdvances in Neural Information Pro- 2014, Montreal, Quebec, Canada, 2014, pp. 26542662.
cessing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, [52]G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
and R. Garnett, Eds., 2015, pp. 30883096. neural network,”CoRR, vol. abs/1503.02531, 2015.
[30]Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. [53]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
Chang, “An exploration of parameter redundancy in deep networks with Y. Bengio, “Fitnets: Hints for thin deep nets,”CoRR, vol. abs/1412.6550,
circulant projections,” inInternational Conference on Computer Vision 2014.
(ICCV), 2015. [54]A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling,
[31]Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and “Bayesian dark knowledge,” inAdvances in Neural Information Process-
S. Chang, “Fast neural networks with circulant projections,”CoRR, vol. ing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
abs/1502.03436, 2015. and R. Garnett, Eds., 2015, pp. 34203428.
[32]Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, [55]P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression
and Z. Wang, “Deep fried convnets,” inInternational Conference on by distilling knowledge from neurons,” inProceedings of the Thirtieth
Computer Vision (ICCV), 2015. AAAI Conference on Artificial Intelligence, February 12-17, 2016,
[33]J. Chun and T. Kailath,Generalized Displacement Structure for Block- Phoenix, Arizona, USA., 2016, pp. 35603566.
Toeplitz, Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidel- [56]T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning
berg: Springer Berlin Heidelberg, 1991, pp. 215236. via knowledge transfer,”CoRR, vol. abs/1511.05641, 2015.
[34]M. V. Rakhuba and I. V. Oseledets, “Fast multidimensional convolution [57]S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
in low-rank tensor formats via cross approximation,”SIAM J. Scientific Improving the performance of convolutional neural networks via atten-
Computing, vol. 37, no. 2, 2015. tion transfer,”CoRR, vol. abs/1612.03928, 2016.
[35]M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “Acdc: [58]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
A structured efficient linear layer,” inInternational Conference on jointly learning to align and translate,”CoRR, vol. abs/1409.0473, 2014.
Learning Representations (ICLR), 2016. [59]A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and
[36]R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable A. C. Courville, “Dynamic capacity networks,” inProceedings of the
filters,” in2013 IEEE Conference on Computer Vision and Pattern 33nd International Conference on Machine Learning, ICML 2016, New
Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 2754 York City, NY, USA, June 19-24, 2016, 2016, pp. 25492558.
2761. [60]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
[37]E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, and J. Dean, “Outrageously large neural networks: The sparsely-gated
“Exploiting linear structure within convolutional networks for efficient mixture-of-experts layer,” 2017.
evaluation,” inAdvances in Neural Information Processing Systems 27, [61]D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and
Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. J. Odobez, “Deep dynamic neural networks for multimodal gesture
Weinberger, Eds., 2014, pp. 12691277. segmentation and recognition,”IEEE Trans. Pattern Anal. Mach. Intell.,
[38]M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional vol. 38, no. 8, pp. 15831597, 2016.
neural networks with low rank expansions,” inProceedings of the British [62]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
Machine Vision Conference. BMVA Press, 2014. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
[39]V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempit- inComputer Vision and Pattern Recognition (CVPR), 2015.
sky, “Speeding-up convolutional neural networks using fine-tuned cp- [63]G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,Deep
decomposition,”CoRR, vol. abs/1412.6553, 2014. Networks with Stochastic Depth, 2016.
[40]C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networks [64]Y. Yamada, M. Iwamura, and K. Kise, “Deep pyramidal residual
with low-rank regularization,” vol. abs/1511.06067, 2015. networks with separated stochastic depth,”CoRR, vol. abs/1612.01230,
[41]M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas, 2016.
“Predicting parameters in deep learning,” in Advances in Neural [65]Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and
Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, R. Feris, “Blockdrop: Dynamic inference paths in residual networks,”
Z. Ghahramani, and K. Weinberger, Eds., 2013, pp. 21482156. inCVPR, 2018.
[Online]. Available: http://media.nips.cc/nipsbooks/nipspapers/paper [66]A. Veit and S. Belongie, “Convolutional networks with adaptive infer-
files/nips26/1053.pdf ence graphs,” 2018.
[42]T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab- [67]M. Mathieu, M. Henaff, and Y. Lecun,Fast training of convolutional
hadran, “Low-rank matrix factorization for deep neural network training networks through FFTs, 2014.
with high-dimensional output targets,” inin Proc. IEEE Int. Conf. on [68]A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
Acoustics, Speech and Signal Processing, 2013. works,” in2016 IEEE Conference on Computer Vision and Pattern IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 10
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, [89]L. Cao, S.-F. Chang, N. Codella, C. V. Cotton, D. Ellis, L. Gong,
pp. 40134021. M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, J. R. Smith, and F. X.
[69]S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. S. Yu, “Ibm research and columbia university trecvid-2012 multimedia
Feris, “S3pool: Pooling with stochastic spatial sampling,”CoRR, vol. event detection (med), multimedia event recounting (mer), and semantic
abs/1611.05138, 2016. indexing (sin) systems,” 2012.
[70]F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving
pooling in deep networks,” inProceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018. Yu Cheng(yu.cheng@microsoft.com) currently is a
[71]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning Researcher at Microsoft. Before that, he was a Re-
applied to document recognition,” inProceedings of the IEEE, 1998, pp. search Staff Member at IBM T.J. Watson Research
22782324. Center. Yu got his Ph.D. from Northwestern Univer-
[72]J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Ried- sity in 2015 and bachelor from Tsinghua University
miller, “Striving for simplicity: The all convolutional net,”CoRR, vol. in 2010. His research is about deep learning in
abs/1412.6806, 2014. general, with specific interests in the deep generative
[73]M. Lin, Q. Chen, and S. Yan, “Network in network,” inICLR, 2014. model, model compression, and transfer learning.
[74]K. Simonyan and A. Zisserman, “Very deep convolutional networks for He regularly serves on the program committees of
large-scale image recognition,”CoRR, vol. abs/1409.1556, 2014. top-tier AI conferences such as NIPS, ICML, ICLR,
[75]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image CVPR and ACL.
recognition,”arXiv preprint arXiv:1512.03385, 2015.
[76]M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient
descent by gradient descent,” inNeural Information Processing Systems
(NIPS), 2016. Duo Wang (d-wang15@mail.tsinghua.edu.cn) re-[77]D. Ha, A. Dai, and Q. Le, “Hypernetworks,” inInternational Conference ceived the B.S. degree in automation from theon Learning Representations 2016, 2016. Harbin Institute of Technology, China, in 2015.[78]Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl Currently he is purchasing his Ph.D. degree at thefor model compression and acceleration on mobile devices,” inThe Department of Automation, Tsinghua University,European Conference on Computer Vision (ECCV), September 2018. Beijing, P.R. China. Currently his research interests[79]J. M. Alvarez and M. Salzmann, “Learning the number of neurons in are about deep learning, particularly in few-shotdeep networks,” pp. 22702278, 2016. learning and deep generative models. He also works[80]Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating on a lot of applications in computer vision andvery deep neural networks,” inThe IEEE International Conference on robotics vision.Computer Vision (ICCV), Oct 2017.
[81]Z. Huang and N. Wang, “Data-driven sparse structure selection for deep
neural networks,”ECCV, 2018.
[82]Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric
learning via cross sample similarities transfer,” inProceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18),
New Orleans, Louisiana, USA, February 2-7, 2018, 2018, pp. 2852 Pan Zhou(panzhou@hust.edu.cn) is currently an
2859. associate professor with School of Electronic In-
[83]Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature formation and Communications, Wuhan, China. He
map for portable deep model,” inProceedings of the 34th International received his Ph.D. in the School of Electrical and
Conference on Machine Learning, ser. Proceedings of Machine Learning Computer Engineering at the Georgia Institute of
Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Technology in 2011. Before that, he received his
Convention Centre, Sydney, Australia: PMLR, 0611 Aug 2017, pp. B.S. degree in theAdvanced Classof HUST, and
37033711. a M.S. degree in the Department of Electronics
[84]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression and Information Engineering from HUST, Wuhan,
of deep convolutional neural networks for fast and low power mobile China, in 2006 and 2008, respectively. His current
applications,”CoRR, vol. abs/1511.06530, 2015. research interest includes big data analytics and
[85]G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient machine learning, security and privacy, and information networks.
object detection models with knowledge distillation,” inAdvances in
Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
Eds., 2017, pp. 742751.
[86]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Tao Zhang (taozhang@mail.tsinghua.edu.cn) ob-
“Mobilenetv2: Inverted residuals and linear bottlenecks,” inThe IEEE tained his B.S., M.S., and Ph.D. degrees from Ts-
Conference on Computer Vision and Pattern Recognition (CVPR), June inghua University, Beijing, China, in 1993, 1995,
2018. and 1999, respectively, and another Ph.D. degree
[87]J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, from Saga University, Saga, Japan, in 2002, all in
Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy control engineering. He is currently a Professor with
trade-offs for modern convolutional object detectors,” in2017 IEEE the Department of Automation, Tsinghua University.
Conference on Computer Vision and Pattern Recognition, CVPR 2017, He serves the Associate Dean, School of Information
Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 32963297. Science and Technology and Head of the Department
[88]Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary, “Temporal sequence of Automation. His current research interests include
modeling for video event detection,” in The IEEE Conference on artificial intelligence, robotics, image processing,
Computer Vision and Pattern Recognition (CVPR), June 2014. control theory, and control of spacecraft.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,391 @@
Channel Pruning for Accelerating Very Deep Neural Networks
Yihui He * Xiangyu Zhang Jian Sun
Xian Jiaotong University Megvii Inc. Megvii Inc.
Xian, 710049, China Beijing, 100190, China Beijing, 100190, China
heyihui@stu.xjtu.edu.cn zhangxiangyu@megvii.com sunjian@megvii.com
Abstract W1
In this paper, we introduce a new channel pruning number of channels
nonlinear method to accelerate very deep convolutional neural net-
works. Given a trained CNN model, we propose an it-
erative two-step algorithm to effectively prune each layer, W2
by a LASSO regression based channel selection and least nonlinear
square reconstruction. We further generalize this algorithm
to multi-layer and multi-branch cases. Our method re- W3
duces the accumulated error and enhance the compatibility
with various architectures. Our pruned VGG-16 achieves (a) (b) (c) (d)
the state-of-the-art results by5×speed-up along with only Figure 1. Structured simplification methods that accelerate CNNs: 0.3% increase of error. More importantly, our method is (a) a network with 3 conv layers. (b) sparse connection deacti-
able to accelerate modern networks like ResNet, Xception vates some connections between channels. (c) tensor factorization
and suffers only 1.4%, 1.0% accuracy loss under2×speed- factorizes a convolutional layer into several pieces. (d) channel
up respectively, which is significant. pruning reduces number of channels in each layer (focus of this
paper).
1. Introduction a network into thinner one, as shown in Fig.1(d). It is effi-
Recent CNN acceleration works fall into three cate- cient on both CPU and GPU because no special implemen-
gories: optimized implementation (e.g., FFT [47]), quan- tation is required.
tization (e.g., BinaryNet [8]), and structured simplification Pruning channels is simple but challenging because re-
that convert a CNN into compact one [22]. This work fo- moving channels in one layer might dramatically change
cuses on the last one. the input of the following layer. Recently,training-based
Structured simplification mainly involves: tensor fac- channel pruning works [1,48] have focused on imposing
torization [22], sparse connection [17], and channel prun- sparse constrain on weights during training, which could
ing [48]. Tensor factorization factorizes a convolutional adaptively determine hyper-parameters. However, training
layer into several efficient ones (Fig.1(c)). However, fea- from scratch is very costly and results for very deep CNNs
ture map width (number of channels) could not be reduced, on ImageNet have been rarely reported.Inference-timeat-
which makes it difficult to decompose1×1convolutional tempts [31,3] have focused on analysis of the importance
layer favored by modern networks (e.g., GoogleNet [45], of individual weight. The reported speed-up ratio is very
ResNet [18], Xception [7]). This type of method also intro- limited.
duces extra computation overhead. Sparse connection deac- In this paper, we propose a new inference-time approach
tivates connections between neurons or channels (Fig.1(b)). for channel pruning, utilizing redundancy inter channels.
Though it is able to achieves high theoretical speed-up ratio, Inspired by tensor factorization improvement by feature
the sparse convolutional layers have an ”irregular” shape maps reconstruction [52], instead of analyzing filter weights
which is not implementation friendly. In contrast, channel [22,31], we fully exploits redundancy within feature maps.
pruning directly reduces feature map width, which shrinks Specifically, given a trained CNN model, pruning each layer
is achieved by minimizing reconstruction error on its output
* This work was done when Yihui He was an intern at Megvii Inc. feature maps, as showned in Fig.2. We solve this mini-
1389 A B C maps. There are several training-based approaches. [1,48]
W regularize networks to improve accuracy. Channel-wise
SSL [48] reaches high compression ratio for first few conv
layers of LeNet [30] and AlexNet [26]. However,training- kh kc w basedapproaches are more costly, and the effectiveness for
c n very deep networks on large datasets is rarely exploited. nonlinear nonlinear
Figure 2. Channel pruning for accelerating a convolutional layer. Inference-time channel pruning is challenging, as re-
We aim to reduce the width of feature map B, while minimizing ported by previous works [2,39]. Some works [44,34,19]
the reconstruction error on feature map C. Our optimization algo- focus on model size compression, which mainly operate the
rithm (Sec. 3.1) performs within the dotted box, which does not fully connected layers. Data-free approaches [31,3] results
involve nonlinearity. This figure illustrates the situation that two for speed-up ratio (e.g.,5×) have not been reported, and
channels are pruned for feature map B. Thus corresponding chan- requires long retraining procedure. [3] select channels via
nels of filtersWcan be removed. Furthermore, even though not over 100 random trials, however it need long time to evalu- directly optimized by our algorithm, the corresponding filters in ate each trial on a deep network, which makes it infeasible the previous layer can also be removed (marked by dotted filters). to work on very deep models and large datasets. [31] is even c,n: number of channels for feature maps B and C,kh ×kw : worse than naive solution from our observation sometimes kernel size. (Sec.4.1.1).
mization problem by two alternative steps: channels selec- 3. Approach
tion and feature map reconstruction. In one step, we figure In this section, we first propose a channel pruning al-out the most representative channels, and prune redundant gorithm for a single layer, then generalize this approach toones, based on LASSO regression. In the other step, we multiple layers or the whole model. Furthermore, we dis-reconstruct the outputs with remaining channels with linear cuss variants of our approach for multi-branch networks.least squares. We alternatively take two steps. Further, we
approximate the network layer-by-layer, with accumulated 3.1. Formulation
error accounted. We also discuss methodologies to prune
multi-branch networks (e.g., ResNet [18], Xception [7]). Fig.2illustrates our channel pruning algorithm for a sin-
For VGG-16, we achieve4×acceleration, with only gle convolutional layer. We aim to reduce the width of
1.0%increase of top-5 error. Combined with tensor factor- feature map B, while maintaining outputs in feature map
ization, we reach5×acceleration but merely suffer0.3% C. Once channels are pruned, we can remove correspond-
increase of error, which outperforms previous state-of-the- ing channels of the filters that take these channels as in-
arts. We further speed up ResNet-50 and Xception-50 by put. Also, filters that produce these channels can also be
2×with only1.4%, 1.0%accuracy loss respectively. removed. It is clear that channel pruning involves two key
points. The first is channel selection, since we need to select
2. Related Work most representative channels to maintain as much informa-
tion. The second is reconstruction. We need to reconstruct
There has been a significant amount of work on acceler- the following feature maps using the selected channels.
ating CNNs. Many of them fall into three categories: opti- Motivated by this, we propose an iterative two-step al-
mized implementation [4], quantization [40], and structured gorithm. In one step, we aim to select most representative
simplification [22]. channels. Since an exhaustive search is infeasible even for
Optimized implementation based methods [35,47,27,4] tiny networks, we come up with a LASSO regression based
accelerate convolution, with special convolution algorithms method to figure out representative channels and prune re-
like FFT [47]. Quantization [8,40] reduces floating point dundant ones. In the other step, we reconstruct the outputs
computational complexity. with remaining channels with linear least squares. We alter-
Sparse connection eliminates connections between neu- natively take two steps.
rons [17,32,29,15,14]. [51] prunes connections based on Formally, to prune a feature map withcchannels, we
weights magnitude. [16] could accelerate fully connected consider applyingn×c×kh ×kw convolutional filtersWon
layers up to50×. However, in practice, the actual speed-up N×c×kh ×kw input volumesXsampled from this feature
maybe very related to implementation. map, which producesN×noutput matrixY. Here,Nis
Tensor factorization [22,28,13,24] decompose weights the number of samples,nis the number of output channels,
into several pieces. [50,10,12] accelerate fully connected andkh ,k w are the kernel size. For simple representation,
layers with truncated SVD. [52] factorize a layer into3×3 bias term is not included in our formulation. To prune the
and1×1combination, driven by feature map redundancy. input channels fromcto desiredc (0≤c ≤c), while
Channel pruning removes redundant channels on feature minimizing reconstruction error, we formulate our problem
1390 as follow: penalty, andβ =c. We gradually increaseλ. For each 0 change ofλ, we iterate these two steps untilβ is stable.
1 2 0 c Afterβ ≤c satisfies, we obtain the final solutionWarg min Y β 0i Xi W i from{ββ,W 2N (1) i Wi }. In practice, we found that the two steps it- i=1 F eration is time consuming. So we apply (i) multiple times,subject toβ ≤c
0 untilβ ≤c satisfies. Then apply (ii) just once, to obtain 0
· is Frobenius norm.X the final result. From our observation, this result is compa-
F i isN×kh kw matrix sliced
fromith channel of input volumesX,i= 1,...,c.W rable with two steps iterations. Therefore, in the following i is
n×k experiments, we adopt this approach for efficiency. h kw filter weights sliced fromith channel ofW.βis
coefficient vector of lengthcfor channel selection, andβ Discussion: Some recent works [48,1,17] (though train- i
isith entry ofβ. Notice that, ifβ ing based) also introduce1 -norm or LASSO. However, we i = 0,Xi will be no longer
useful, which could be safely pruned from feature map.W must emphasis that we use different formulations. Many of i
could also be removed. them introduced sparsity regularization into training loss,
Optimization instead of explicitly solving LASSO. Other work [1] solved
Solving this LASSO, while feature maps or data were not considered 0 minimization problem in Eqn.1is NP-hard.
Therefore, we relax the during optimization. Because of these differences, our ap- 0 to1 regularization: proach could be applied at inference time.
1 c 2
arg min Y β 3.2. Whole Model Pruning i Xi W
i β1β,W 2N (2) i=1 F Inspired by [52], we apply our approach layer by layersubject toβ ≤c ,∀iW = 1 0 iF sequentially. For each layer, we obtain input volumes from
the current input feature map, and output volumes from theλis a penalty coefficient. By increasingλ, there will be output feature map of the un-pruned model. This could bemore zero terms inβand one can get higher speed-up ratio. formalized as:We also add a constrain∀iWi = 1to this formulation, F which avoids trivial solution.
Now we solve this problem in two folds. First, we fixW, 1 c 2
arg min Y βsolveβfor channel selection. Second, we fixβ, solveWto i Xi W i
β,W 2N (5)
reconstruct error. i=1 F
(i) The subproblem ofβ. In this case,Wis fixed. We subject toβ ≤c
0
solveβfor channel selection. This problem can be solved Different from Eqn.1,Yis replaced byY , which is fromby LASSO regression [46,5], which is widely used for feature map of the original model. Therefore, the accumu-model selection. lated error could be accounted during sequential pruning. 2 c βˆLASSO 1(λ) = argmin Y β β 3.3. Pruning Multi­Branch Networks
β 2N i Zi 1
i=1 F The whole model pruning discussed above is enough for
subject toβ ≤c
0 single-branch networks like LeNet [30], AlexNet [26] and(3) VGG Nets [43]. However, it is insufficient for multi-branch HereZi = X i W i (sizeN×n). We will ignoreith channels networks like GoogLeNet [ 45] and ResNet [18]. We mainlyifβi = 0. focus on pruning the widely used residual structure (e.g.,(ii) The subproblem ofW. In this case,βis fixed. We ResNet [18], Xception [7]). Given a residual block shownutilize the selected channels to minimize reconstruction er- in Fig.3(left), the input bifurcates into shortcut and residualror. We can find optimized solution by least squares: branch. On the residual branch, there are several convolu-
tional layers (e.g., 3 convolutional layers which have spatialarg minYX (W ) 2 (4) F size of1×1,3×3,1×1, Fig.3, left). Other layers ex- W cept the first and last layer can be pruned as is described
HereX = [β1 X1 β2 X2 ... β i Xi ... β c Xc ](size previously. For the first layer, the challenge is that the large
N×ck h kw ). W isn×ck h kw reshapedW,W = input feature map width (for ResNet, 4 times of its output)
[W 1 W2 ...Wi ...Wc ]. After obtained resultW , it is re- cant be easily pruned, since its shared with shortcut. For
shaped back toW. Then we assignβi ←βi Wi ,W the last layer, accumulated error from the shortcut is hard to F i ←
Wi /Wi . Constrain∀iW be recovered, since theres no parameter on the shortcut. To F i = 1satisfies. F We alternatively optimize (i) and (ii). In the beginning, address these challenges, we propose several variants of our
Wis initialized from the trained model,λ= 0, namely no approach as follows.
1391 c ers, which need special library implementation support. We
Input (c) sampled (c') 0 do not adopt it in the following experiments. c 0 0
0
channel sampler
sampler 1x1,c c'0 4. Experiment 1
c 1x1 1 relu c' 3x3,c 1 relu We evaluation our approach for the popular VGG Nets 2
c 3x3 2 relu [43], ResNet [18], Xception [7] on ImageNet [9], CIFAR- c'1x1 2 relu 10 [25] and PASCAL VOC 2007 [11]. 1x1 For Batch Normalization [21], we first merge it into con- Y2 Y volutional weights, which do not affect the outputs of the Y+Y 1
1 2 networks. So that each convolutional layer is followed by
Figure 3. Illustration of multi-branch enhancement for residual ReLU [36]. We use Caffe [23] for deep network evalua-
block.Left: original residual block.Right: pruned residual block tion, and scikit-learn [38] for solvers implementation. For
with enhancement,cx denotes the feature map width. Input chan- channel pruning, we found that it is enough to extract 5000 nels of the first convolutional layer are sampled, so that the large images, and 10 samples per image. On ImageNet, we eval- input feature map width could be reduced. As for the last layer, uate the top-5 accuracy with single view. Images are re- rather than approximateY2 , we try to approximateY1 + Y 2 di- sized such that the shorter side is 256. The testing is on rectly (Sec.3.3Last layer of residual branch). center crop of224×224pixels. We could gain more per-
formance with fine-tuning. We use a batch size of 128 and
learning rate1e5 . We fine-tune our pruned models for 10Last layer of residual branch: Shown in Fig.3, the epoches. The augmentation for fine-tuning is random cropoutput layer of a residual block consists of two inputs: fea- of224×224and mirror.ture mapY1 andY2 from the shortcut and residual branch.
We aim to recoverY1 + Y 2 for this block. Here,Y1 ,Y2 4.1. Experiments with VGG­16 are the original feature maps before pruning.Y2 could be
approximated as in Eqn.1. However, shortcut branch is VGG-16 [43] is a 16 layers single path convolutional
parameter-free, thenY neural network, with 13 convolutional layers. It is widely 1 could not be recovered directly. To
compensate this error, the optimization goal of the last layer used in recognition, detection and segmentation,etc. Single
is changed fromY view top-5 accuracy for VGG-16 is 89.9% 1 .2 toY1 Y +Y, which does not change 1 2
our optimization. Here,Y is the current feature map after 1 previous layers pruned. When pruning, volumes should be 4.1.1 Single Layer Pruning
sampled correspondingly from these two branches. In this subsection, we evaluate single layer acceleration per-First layer of residual branch: Illustrated in formance using our algorithm in Sec.3.1. For better under-Fig.3(left), the input feature map of the residual block standing, we compare our algorithm with two naive chan-could not be pruned, since it is also shared with the short- nel selection strategies.first kselects the firstkchannels.cut branch. In this condition, we could performfeature max responseselects channels based on corresponding fil-map samplingbefore the first convolution to save compu- ters that have high absolute weights sum [31]. For fair com-tation. We still apply our algorithm as Eqn.1. Differently, parison, we obtain the feature map indexes selected by eachwe sample the selected channels on the shared feature maps of them, then perform reconstruction (Sec. 3.1(ii)). We to construct a new input for the later convolution, shown hope that this could demonstrate the importance of channelin Fig.3(right). Computational cost for this operation could selection. Performance is measured by increase of error af-be ignored. More importantly, after introducingfeature map ter a certain layer is pruned without fine-tuning, shown insampling, the convolution is still ”regular”. Fig.4.Filter-wise pruningis another option for the first con- As expected, error increases as speed-up ratio increases.volution on the residual branch. Since the input channels Our approach is consistently better than other approaches inof parameter-free shortcut branch could not be pruned, we different convolutional layers under different speed-up ra-apply our Eqn.1to each filter independently (each fil- tio. Unexpectedly, sometimesmax responseis even worseter chooses its own representative input channels). Under thanfirst k. We argue thatmax responseignores correla-single layer acceleration,filter-wise pruningis more accu- tions between different filters. Filters with large absoluterate than our original one. From our experiments, it im- weight may have strong correlation. Thus selection based proves 0.5% top-5 accuracy for2×ResNet-50 (applied on on filter weights is less meaningful. Correlation on featurethe first layer of each residual branch) without fine-tuning. maps is worth exploiting. We can find that channel selectionHowever, after fine-tuning, theres no noticeable improve-
ment. In addition, it outputs ”irregular” convolutional lay- 1 http://www.vlfeat.org/matconvnet/pretrained/
1392 conv1_1 conv2_1 conv3_1 5
first k first k first k
max response max response max response 4 ours ours ours
increase of error (%) 3
2
1
0
conv3_2 conv4_1 conv4_2 5
first k first k first k
max response max response max response 4 ours ours ours
increase of error (%) 3
2
1
01.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
speed-up ratio speed-up ratio speed-up ratio
Figure 4. Single layer performance analysis under different speed-up ratios (without fine-tuning), measured by increase of error. To verify
the importance of channel selection refered in Sec.3.1, we considered two naive baselines.first kselects the firstkfeature maps.max
responseselects channels based on absolute sum of corresponding weights filter [31]. Our approach is consistently better (smaller is
better).
Increase of top-5 error (1-view, baseline 89.9%) periments above, we pruning more aggressive for shal-
Solution 2× 4× 5× lower layers. Remaining channels ratios for shallow lay-
Jaderberget al. [22] ([52]s impl.) - 9.7 29.7 ers (conv1_xtoconv3_x) and deep layers (conv4_x)
Asym. [52] 0.28 3.84 - is1 : 1.5.conv5_xare not pruned, since they only con-
Filter pruning [31] tribute 9% computation in total and are not redundant.0.8 8.6 14.6(fine-tuned, our impl.) After fine-tuning, we could reach2×speed-up without
Ours (without fine-tune) 2.7 7.9 22.0 losing accuracy. Under4×, we only suffers 1.0% drops.
Ours (fine-tuned) 0 1.0 1.7 Consistent with single layer analysis, our approach outper-
Table 1. Accelerating the VGG-16 model [43] using a speedup forms previous channel pruning approach (Liet al. [31]) by
ratio of2×,4×, or5×(smaller is better). large margin. This is because we fully exploits channel re-
dundancy within feature maps. Compared with tensor fac-
affects reconstruction error a lot. Therefore, it is important torization algorithms, our approach is better than Jaderberg
for channel pruning. et al. [22], without fine-tuning. Though worse than Asym.
Also notice that channel pruning gradually becomes [52], our combined model outperforms its combined Asym.
hard, from shallower to deeper layers. It indicates that shal- 3D (Table2). This may indicate that channel pruning is
lower layers have much more redundancy, which is consis- more challenging than tensor factorization, since removing
tent with [52]. We could prune more aggressively on shal- channels in one layer might dramatically change the input
lower layers in whole model acceleration. of the following layer. However, channel pruning keeps the
original model architecture, do not introduce additional lay-
ers, and the absolute speed-up ratio on GPU is much higher4.1.2 Whole Model Pruning (Table 3).
Shown in Table1, whole model acceleration results under Since our approach exploits a new cardinality, we further
2×,4×,5×are demonstrated. We adopt whole model combine our channel pruning with spatial factorization [22]
pruning proposed in Sec.3.2. Guided by single layer ex- and channel factorization [52]. Demonstrated in Table2,
1393 Increase of top-5 error (1-view, 89.9%) scratch. This coincides with architecture design researches
Solution 4× 5× [20,1] that the model could be easier to train if there are
Asym. 3D [52] 0.9 2.0 more channels in shallower layers. However, channel prun-
Asym. 3D (fine-tuned) [52] 0.3 1.0 ing favors shallower layers.
Our 3C 0.7 1.3 For from scratch (uniformed), the filters in each layers
Our 3C (fine-tuned) 0.0 0.3 is reduced by half (eg. reduceconv1_1from 64 to 32).
Table 2. Performance of combined methods on the VGG-16 model We can observe that normal setting networks of the same
[43] using a speed-up ratio of4×or5×. Our 3C solution outper- complexity couldnt reach same accuracy either. This con-
forms previous approaches (smaller is better). solidates our idea that theres much redundancy in networks
while training. However, redundancy can be opt out at
inference-time. This maybe an advantage of inference-timeour 3 cardinalities acceleration (spatial, channel factoriza- acceleration approaches over training-based approaches.tion, and channel pruning, denoted by 3C) outperforms pre- Notice that theres a 0.6% gap between the from scratchvious state-of-the-arts. Asym. 3D [52] (spatial and chan- model and uniformed one, which indicates that theres roomnel factorization), factorizes a convolutional layer to three for model exploration. Adopting our approach is muchparts:1×3,3×1,1×1. faster than training a model from scratch, even for a thin-We apply spatial factorization, channel factorization, and ner one. Further researches could alleviate our approach to our channel pruning together sequentially layer-by-layer. do thin model exploring.We fine-tune the accelerated models for 20 epoches, since
they are 3 times deeper than the original ones. After fine-
tuning, our4×model suffers no degradation. Clearly, a 4.1.5 Acceleration for Detection
combination of different acceleration techniques is better VGG-16 is popular among object detection tasks [42,41,than any single one. This indicates that a model is redun- 33]. We evaluate transfer learning ability of our2×/4×dant in each cardinality. pruned VGG-16, for Faster R-CNN [42] object detections.
PASCAL VOC 2007 object detection benchmark [11] con-
4.1.3 Comparisons of Absolute Performance tains 5k trainval images and 5k test images. The per-
formance is evaluated by mean Average Precision (mAP).We further evaluate absolute performance of acceleration In our experiments, we first perform channel pruning foron GPU. Results in Table3are obtained under Caffe [23], VGG-16 on the ImageNet. Then we use the pruned modelCUDA8 [37] and cuDNN5 [6], with a mini-batch of 32 as the pre-trained model for Faster R-CNN.on a GPU (GeForce GTX TITAN X). Results are averaged The actual running time of Faster R-CNN is 220ms / im-from 50 runs. Tensor factorization approaches decompose age. The convolutional layers contributes about 64%. Weweights into too many pieces, which heavily increase over- got actual time of 94ms for4×acceleration. From Table5,head. They could not gain much absolute speed-up. Though we observe 0.4% mAP drops of our2×model, which is notour approach also encountered performance decadence, it harmful for practice consideration.generalizes better on GPU than other approaches. Our re-
sults for tensor factorization differ from previous research 4.2. Experiments with Residual Architecture Nets
[52,22], maybe because current library and hardware pre- For Multi-path networks [45,18,7], we further explorefer single large convolution instead of several small ones. the popular ResNet [18] and latest Xception [7], on Ima-
geNet and CIFAR-10. Pruning residual architecture nets is
4.1.4 Comparisons with Training from Scratch more challenging. These networks are designed for both ef-
ficiency and high accuracy. Tensor factorization algorithmsThough training a compact model from scratch is time- [52,22] have difficult to accelerate these model. Spatially,consuming (usually 120 epoches), it worths comparing our 1×1convolution is favored, which could hardly be factor-approach and from scratch counterparts. To be fair, we eval- ized.uated both from scratch counterpart, and normal setting net-
work that has the same computational complexity and same 4.2.1 ResNet Pruningarchitecture.
Shown in Table4, we observed that its difficult for ResNet complexity uniformly drops on each residual block.
from scratch counterparts to reach competitive accuracy. Guided by single layer experiments (Sec. 4.1.1), we still
our model outperforms from scratch one. Our approach prefer reducing shallower layers heavier than deeper ones.
successfully picks out informative channels and constructs Following similar setting as Filter pruning [31], we
highly compact models. We can safely draw the conclu- keep 70% channels for sensitive residual blocks (res5
sion that the same model is difficult to be obtained from and blocks close to the position where spatial size
1394 Model Solution Increased err. GPU time/ms
VGG-16 - 0 8.144
Jaderberget al. [22] ([52]s impl.) 9.7 8.051(1.01×)
Asym. [52] 3.8 5.244(1.55×)
VGG-16 (4×) Asym. 3D [52] 0.9 8.503(0.96×)
Asym. 3D (fine-tuned) [52] 0.3 8.503(0.96×)
Ours (fine-tuned) 1.0 3.264 (2.50×)
Table 3. GPU acceleration comparison. We measure forward-pass time per image. Our approach generalizes well on GPU (smaller is
better).
Original (acc. 89.9%) Top-5 err. Increased err. Solution Increased err.
From scratch 11.9 1.8 Filter pruning [31] (our impl.) 92.8
From scratch (uniformed) 12.5 2.4 Filter pruning [31] 4.3Ours 18.0 7.9 (fine-tuned, our impl.)
Ours (fine-tuned) 11.1 1.0 Ours 2.9
Table 4. Comparisons with training from scratch, under4×accel- Ours (fine-tuned) 1.0
eration. Our fine-tuned model outperforms scratch trained coun- Table 7. Comparisons for Xception-50, under2×acceleration ra-
terparts (smaller is better). tio. The baseline networks top-5 accuracy is 92.8%. Our ap-
proach outperforms previous approaches. Most structured sim-
plification methods are not effective on Xception architecture
Speedup mAP ∆mAP (smaller is better).
Baseline 68.7 -
2× 68.3 0.4
4× 66.9 1.8 4.2.2 Xception Pruning
Table 5.2×,4×acceleration for Faster R-CNN detection.
Since computational complexity becomes important in
model design, separable convolution has been payed muchSolution Increased err. attention [49,7]. Xception [7] is already spatially optimizedOurs 8.0 and tensor factorization on1×1convolutional layer is de-Ours 4.0 structive. Thanks to our approach, it could still be acceler-(enhanced) ated with graceful degradation. For the ease of comparison,Ours 1.4 we adopt Xception convolution on ResNet-50, denoted by(enhanced, fine-tuned) Xception-50. Based on ResNet-50, we swap all convolu- Table 6.2×acceleration for ResNet-50 on ImageNet, the base- tional layers with spatial conv blocks. To keep the same line networks top-5 accuracy is 92.2% (one view). We improve computational complexity, we increase the input channels performance with multi-branch enhancement (Sec.3.3,smaller is of allbranch2blayers by2×. The baseline Xception- better). 50 has a top-5 accuracy of 92.8% and complexity of 4450
MFLOPs.
We apply multi-branch variants of our approach as de-change, e.g. res3a,res3d). As for other blocks, scribed in Sec.3.3, and adopt the same pruning ratio settingwe keep 30% channels. With multi-branch enhance- as ResNet in previous section. Maybe because of Xcep-ment, we prunebranch2amore aggressively within tion block is unstable, Batch Normalization layers must beeach residual block. The remaining channels ratios for maintained during pruning. Otherwise it becomes nontrivialbranch2a,branch2b,branch2cis2 : 4 : 3(e.g., to fine-tune the pruned model.Given 30%, we keep 40%, 80%, 60% respectively). Shown in Table7, after fine-tuning, we only suffer1.0%
We evaluate performance of multi-branch variants of our increase of error under2×. Filter pruning [31] could also
approach (Sec. 3.3). From Table6, we improve 4.0% apply on Xception, though it is designed for small speed-
with our multi-branch enhancement. This is because we up ratio. Without fine-tuning, top-5 error is 100%. After
accounted the accumulated error from shortcut connection training 20 epochs which is like training from scratch, in-
which could broadcast to every layer after it. And the large creased error reach 4.3%. Our results for Xception-50 are
input feature map width at the entry of each residual block not as graceful as results for VGG-16, since modern net-
is well reduced by ourfeature map sampling. works tend to have less redundancy by design.
1395 Solution Increased err. [4] H. Bagherinezhad, M. Rastegari, and A. Farhadi. Lcnn:
Filter pruning [31] Lookup-based convolutional neural network.arXiv preprint 1.3(fine-tuned, our impl.) arXiv:1611.06473, 2016.2
From scratch 1.9 [5] L. Breiman. Better subset regression using the nonnegative
Ours 2.0 garrote.Technometrics, 37(4):373384, 1995.3
Ours (fine-tuned) 1.0 [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
Table 8.2×speed-up comparisons for ResNet-56 on CIFAR-10, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives
the baseline accuracy is 92.8% (one view). We outperforms previ- for deep learning.CoRR, abs/1410.0759, 2014.6
ous approaches and scratch trained counterpart (smaller is better). [7] F. Chollet. Xception: Deep learning with depthwise separa-
ble convolutions.arXiv preprint arXiv:1610.02357, 2016. 1,
2,3,4,6,7
4.2.3 Experiments on CIFAR-10 [8] M. Courbariaux and Y. Bengio. Binarynet: Training deep
neural networks with weights and activations constrained to+
Even though our approach is designed for large datasets, it 1 or-1.arXiv preprint arXiv:1602.02830, 2016.1,2
could generalize well on small datasets. We perform ex- [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
periments on CIFAR-10 dataset [25], which is favored by Fei. Imagenet: A large-scale hierarchical image database.
many acceleration researches. It consists of 50k images for InComputer Vision and Pattern Recognition, 2009. CVPR
training and 10k for testing in 10 classes. 2009. IEEE Conference on, pages 248255. IEEE, 2009. 4
We reproduce ResNet-56, which has accuracy of 92.8% [10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-
(Serve as a reference, the official ResNet-56 [18] has ac- gus. Exploiting linear structure within convolutional net-
curacy of 93.0%). For2×acceleration, we follow similar works for efficient evaluation. InAdvances in Neural In-
formation Processing Systems, pages 12691277, 2014.2 setting as Sec.4.2.1(keep the final stage unchanged, where
the spatial size is8×8). Shown in Table8, our approach [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The PASCAL Visual Object Classes is competitive with scratch trained one, without fine-tuning, Challenge 2007 (VOC2007) Results. http://www.pascal- under2×speed-up. After fine-tuning, our result is signif- network.org/challenges/VOC/voc2007/workshop/index.html. icantly better than Filter pruning [31] and scratch trained 4,6
one. [12] R. Girshick. Fast r-cnn. InProceedings of the IEEE Inter-
national Conference on Computer Vision, pages 14401448,
5. Conclusion 2015.2
[13] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress-
To conclude, current deep CNNs are accurate with high ing deep convolutional networks using vector quantization.
inference costs. In this paper, we have presented an arXiv preprint arXiv:1412.6115, 2014.2
inference-time channel pruning method for very deep net- [14] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for
works. The reduced CNNs are inference efficient networks efficient dnns. InAdvances In Neural Information Process-
while maintaining accuracy, and only require off-the-shelf ing Systems, pages 13791387, 2016.2
libraries. Compelling speed-ups and accuracy are demon- [15] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,
strated for both VGG Net and ResNet-like networks on Im- and W. J. Dally. Eie: efficient inference engine on com-
ageNet, CIFAR-10 and PASCAL VOC. pressed deep neural network. InProceedings of the 43rd
International Symposium on Computer Architecture, pages In the future, we plan to involve our approaches into 243254. IEEE Press, 2016. 2 training time, instead of inference time only, which may [16] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- also accelerate training procedure. pressing deep neural network with pruning, trained quantiza-
tion and huffman coding.CoRR, abs/1510.00149, 2, 2015.
References 2
[17] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights
[1] J. M. Alvarez and M. Salzmann. Learning the number of and connections for efficient neural network. InAdvances in
neurons in deep networks. InAdvances in Neural Informa- Neural Information Processing Systems, pages 11351143,
tion Processing Systems, pages 22622270, 2016. 1,2,3, 2015.1,2,3
6 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
[2] S. Anwar, K. Hwang, and W. Sung. Structured prun- ing for image recognition.arXiv preprint arXiv:1512.03385,
ing of deep convolutional neural networks. arXiv preprint 2015. 1,2,3,4,6,8
arXiv:1512.08571, 2015.2 [19] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim-
[3] S. Anwar and W. Sung. Compact deep convolutional ming: A data-driven neuron pruning approach towards effi-
neural networks with coarse pruning. arXiv preprint cient deep architectures. arXiv preprint arXiv:1607.03250,
arXiv:1610.09639, 2016.1,2 2016.2
1396 [20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
Speed/accuracy trade-offs for modern convolutional object V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
detectors.arXiv preprint arXiv:1611.10012, 2016. 6 M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating chine learning in Python.Journal of Machine Learning Re-
deep network training by reducing internal covariate shift. search, 12:28252830, 2011.4
arXiv preprint arXiv:1502.03167, 2015.4 [39] A. Polyak and L. Wolf. Channel-level acceleration of deep
[22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up face representations.IEEE Access, 3:21632175, 2015.2
convolutional neural networks with low rank expansions. [40] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
arXiv preprint arXiv:1405.3866, 2014.1,2,5,6,7 net: Imagenet classification using binary convolutional neu-
[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- ral networks. InEuropean Conference on Computer Vision,
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- pages 525542. Springer, 2016. 2
tional architecture for fast feature embedding.arXiv preprint [41] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. arXiv:1408.5093, 2014. 4,6 You only look once: Unified, real-time object detection.
[24] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. CoRR, abs/1506.02640, 2015. 6
Compression of deep convolutional neural networks for [42] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:fast and low power mobile applications. arXiv preprint towards real-time object detection with region proposal net-arXiv:1511.06530, 2015.2 works.CoRR, abs/1506.01497, 2015.6 [25] A. Krizhevsky and G. Hinton. Learning multiple layers of [43] K. Simonyan and A. Zisserman. Very deep convolutionalfeatures from tiny images. 2009.4,8 networks for large-scale image recognition. arXiv preprint[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet arXiv:1409.1556, 2014.3,4,5,6classification with deep convolutional neural networks. In [44] S. Srinivas and R. V. Babu. Data-free parameter pruningAdvances in neural information processing systems, pages for deep neural networks.arXiv preprint arXiv:1507.06149,10971105, 2012.2,3 2015.2[27] A. Lavin. Fast algorithms for convolutional neural networks. [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,arXiv preprint arXiv:1509.09308, 2015.2 D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.[28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and Going deeper with convolutions. InProceedings of the IEEEV. Lempitsky. Speeding-up convolutional neural net- Conference on Computer Vision and Pattern Recognition,works using fine-tuned cp-decomposition. arXiv preprint pages 19, 2015.1,3,6arXiv:1412.6553, 2014.2 [46] R. Tibshirani. Regression shrinkage and selection via the[29] V. Lebedev and V. Lempitsky. Fast convnets using group- lasso. Journal of the Royal Statistical Society. Series Bwise brain damage.arXiv preprint arXiv:1506.02515, 2015. (Methodological), pages 267288, 1996.32 [47] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- antino, and Y. LeCun. Fast convolutional nets withbased learning applied to document recognition. Proceed- fbfft: A gpu performance evaluation. arXiv preprintings of the IEEE, 86(11):22782324, 1998.2,3 arXiv:1412.7580, 2014.1,2[31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.
Graf. Pruning filters for efficient convnets. arXiv preprint [48] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning
arXiv:1608.08710, 2016.1,2,4,5,6,7,8 structured sparsity in deep neural networks. InAdvances In
[32] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Neural Information Processing Systems, pages 20742082,
Sparse convolutional neural networks. InProceedings of the 2016.1,2,3
IEEE Conference on Computer Vision and Pattern Recogni- [49] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated´
tion, pages 806814, 2015.2 residual transformations for deep neural networks. arXiv
[33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, preprint arXiv:1611.05431, 2016.7
C. Fu, and A. C. Berg. SSD: single shot multibox detector. [50] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural
CoRR, abs/1512.02325, 2015.6 network acoustic models with singular value decomposition.
[34] Z. Mariet and S. Sra. Diversity networks. arXiv preprint InINTERSPEECH, pages 23652369, 2013.2
arXiv:1511.05077, 2015.2 [51] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-
[35] M. Mathieu, M. Henaff, and Y. LeCun. Fast training efficient convolutional neural networks using energy-aware
of convolutional networks through ffts. arXiv preprint pruning.arXiv preprint arXiv:1611.05128, 2016.2
arXiv:1312.5851, 2013.2 [52] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very
[36] V. Nair and G. E. Hinton. Rectified linear units improve deep convolutional networks for classification and detection.
restricted boltzmann machines. InProceedings of the 27th IEEE transactions on pattern analysis and machine intelli-
international conference on machine learning (ICML-10), gence, 38(10):19431955, 2016.1,2,3,5,6,7
pages 807814, 2010.4
[37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable
parallel programming with CUDA.ACM Queue, 6(2):4053,
2008.6
1397

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,261 @@
Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell Ananya Ganesh Andrew McCallum
College of Information and Computer Sciences
University of Massachusetts Amherst
{strubell, aganesh, mccallum}@cs.umass.edu
Abstract Consumption CO 2 e (lbs)
Air travel, 1 passenger, NY↔SF 1984 Recent progress in hardware and methodol-
arXiv:1906.02243v1 [cs.CL] 5 Jun 2019 Human life, avg, 1 year 11,023 ogy for training neural networks has ushered
in a new generation of large networks trained American life, avg, 1 year 36,156
on abundant data. These models have ob- Car, avg incl. fuel, 1 lifetime 126,000
tained notable gains in accuracy across many
NLP tasks. However, these accuracy improve- Training one model (GPU)
ments depend on the availability of exception- NLP pipeline (parsing, SRL) 39 ally large computational resources that neces- w/ tuning & experimentation 78,468 sitate similarly substantial energy consump- Transformer (big) 192 tion. As a result these models are costly to
train and develop, both financially, due to the w/ neural architecture search 626,155
cost of hardware and electricity or cloud com- Table 1: Estimated COpute time, and environmentally,due to the car- 2 emissions from training com-
mon NLP models, compared to familiar consumption. 1 bon footprint required to fuel modern tensor
processing hardware. In this paper we bring
this issue to the attention of NLP researchers NLP models could be trained and developed on by quantifying the approximate financial and a commodity laptop or server, many now require environmental costs of training a variety of re-
cently successful neural network models for multiple instances of specialized hardware such as
NLP. Based on these findings, we propose ac- GPUs or TPUs, therefore limiting access to these
tionable recommendations to reduce costs and highly accurate models on the basis of finances.
improve equity in NLP research and practice. Even when these expensive computational re-
1 Introduction sources are available, model training also incurs a
substantial cost to the environment due to the en-
Advances in techniques and hardware for train- ergy required to power this hardware for weeks or
ing deep neural networks have recently en- months at a time. Though some of this energy may
abled impressive accuracy improvements across come from renewable or carbon credit-offset re-
many fundamental NLP tasks ( Bahdanau et al., sources, the high energy demands of these models
2015; Luong et al., 2015; Dozat and Man- are still a concern since (1) energy is not currently
ning, 2017; Vaswani et al., 2017), with the derived from carbon-neural sources in many loca-
most computationally-hungry models obtaining tions, and (2) when renewable energy is available,
the highest scores (Peters et al.,2018;Devlin et al., it is still limited to the equipment we have to pro-
2019;Radford et al.,2019;So et al.,2019). As duce and store it, and energy spent training a neu-
a result, training a state-of-the-art model now re- ral network might better be allocated to heating a
quires substantial computational resources which familys home. It is estimated that we must cut
demand considerable energy, along with the as- carbon emissions by half over the next decade to
sociated financial and environmental costs. Re- deter escalating rates of natural disaster, and based
search and development of new models multiplies on the estimated CO 2 emissions listed in Table 1,
these costs by thousands of times by requiring re-
training to experiment with model architectures 1 Sources: (1) Air travel and per-capita consump-
tion: https://bit.ly/2Hw0xWc; (2) car lifetime: and hyperparameters. Whereas a decade ago most https://bit.ly/2Qbr0w1. model training and development likely make up Consumer Renew. Gas Coal Nuc.
a substantial portion of the greenhouse gas emis- China 22% 3% 65% 4%
sions attributed to many NLP researchers. Germany 40% 7% 38% 13%
To heighten the awareness of the NLP commu- United States 17% 35% 27% 19%
nity to this issue and promote mindful practice and Amazon-AWS 17% 24% 30% 26%
policy, we characterize the dollar cost and carbon Google 56% 14% 15% 10%
emissions that result from training the neural net- Microsoft 32% 23% 31% 10%
works at the core of many state-of-the-art NLP
models. We do this by estimating the kilowatts Table 2: Percent energy sourced from: Renewable (e.g.
of energy required to train a variety of popular hydro, solar, wind), natural gas, coal and nuclear for
off-the-shelf NLP models, which can be converted the top 3 cloud compute providers (Cook et al.,2017),
to approximate carbon emissions and electricity compared to the United States, 4 China 5 and Germany
costs. To estimate the even greater resources re- (Burger,2019).
quired to transfer an existing model to a new task
or develop new models, we perform a case study We estimate the total time expected for mod-
of the full computational resources required for the els to train to completion using training times and
development and tuning of a recent state-of-the-art hardware reported in the original papers. We then
NLP pipeline (Strubell et al.,2018). We conclude calculate the power consumption in kilowatt-hours
with recommendations to the community based on (kWh) as follows. Letpc be the average power
our findings, namely: (1) Time to retrain and sen- draw (in watts) from all CPU sockets during train-
sitivity to hyperparameters should be reported for ing, letpr be the average power draw from all
NLP machine learning models; (2) academic re- DRAM (main memory) sockets, letpg be the aver-
searchers need equitable access to computational age power draw of a GPU during training, and let
resources; and (3) researchers should prioritize de- gbe the number of GPUs used to train. We esti-
veloping efficient models and hardware. mate total power consumption as combined GPU,
CPU and DRAM consumption, then multiply this
2 Methods by Power Usage Effectiveness (PUE), which ac-
counts for the additional energy required to sup-To quantify the computational and environmen- port the compute infrastructure (mainly cooling).tal cost of training deep neural network mod- We use a PUE coefficient of 1.58, the 2018 globalels for NLP, we perform an analysis of the en- average for data centers (Ascierto,2018). Then theergy required to train a variety of popular off- total powerpthe-shelf NLP models, as well as a case study of t required at a given instance during
training is given by:the complete sum of resources required to develop
LISA (Strubell et al.,2018), a state-of-the-art NLP 1.58t(pp c +pr +gp g )
model from EMNLP 2018, including all tuning t = (1)1000
and experimentation. The U.S. Environmental Protection Agency (EPA)We measure energy use as follows. We train the provides average COmodels described in§2.1using the default settings 2 produced (in pounds per
kilowatt-hour) for power consumed in the U.S.provided, and sample GPU and CPU power con- (EPA,2018), which we use to convert power tosumption during training. Each model was trained estimated COfor a maximum of 1 day. We train all models on 2 emissions:
a single NVIDIA Titan X GPU, with the excep- CO 2 e = 0.954pt (2)
tion of ELMo which was trained on 3 NVIDIA This conversion takes into account the relative pro-GTX 1080 Ti GPUs. While training, we repeat- portions of different energy sources (primarily nat-edly query the NVIDIA System Management In- ural gas, coal, nuclear and renewable) consumedterface 2 to sample the GPU power consumption to produce energy in the United States. Table2and report the average over all samples. To sample lists the relative energy sources for China, Ger-CPU power consumption, we use Intels Running many and the United States compared to the topAverage Power Limit interface. 3
5 U.S. Dept. of Energy:https://bit.ly/2JTbGnI
2 nvidia-smi:https://bit.ly/30sGEbi 5 China Electricity Council; trans. China Energy Portal:
3 RAPL power meter:https://bit.ly/2LObQhV https://bit.ly/2QHE5O3 three cloud service providers. The U.S. break- ence. Devlin et al.(2019) report that the BERT
down of energy is comparable to that of the most base model (110M parameters) was trained on 16
popular cloud compute service, Amazon Web Ser- TPU chips for 4 days (96 hours). NVIDIA reports
vices, so we believe this conversion to provide a that they can train a BERT model in 3.3 days (79.2
reasonable estimate of CO 2 emissions per kilowatt hours) using 4 DGX-2H servers, totaling 64 Tesla
hour of compute energy used. V100 GPUs (Forster et al.,2019).
GPT-2. This model is the latest edition of
2.1 Models OpenAIs GPT general-purpose token encoder,
We analyze four models, the computational re- also based on Transformer-style self-attention and
quirements of which we describe below. All mod- trained with a language modeling objective (Rad-
els have code freely available online, which we ford et al.,2019). By training a very large model
used out-of-the-box. For more details on the mod- on massive data,Radford et al.(2019) show high
els themselves, please refer to the original papers. zero-shot performance on question answering and
language modeling benchmarks. The large modelTransformer. The Transformer model (Vaswani described inRadford et al.(2019) has 1542M pa-et al.,2017) is an encoder-decoder architecture rameters and is reported to require 1 week (168primarily recognized for efficient and accurate ma- hours) of training on 32 TPUv3 chips. 6 chine translation. The encoder and decoder each
consist of 6 stacked layers of multi-head self-
attention. Vaswani et al.(2017) report that the 3 Related work
Transformerbasemodel (65M parameters) was
trained on 8 NVIDIA P100 GPUs for 12 hours, There is some precedent for work characterizing
and the Transformerbigmodel (213M parame- the computational requirements of training and in-
ters) was trained for 3.5 days (84 hours; 300k ference in modern neural network architectures in
steps). This model is also the basis for recent the computer vision community.Li et al.(2016)
work on neural architecture search (NAS) for ma- present a detailed study of the energy use required
chine translation and language modeling (So et al., for training and inference in popular convolutional
2019), and the NLP pipeline that we study in more models for image classification in computer vi-
detail in§4.2(Strubell et al.,2018). So et al. sion, including fine-grained analysis comparing
(2019) report that their full architecture search ran different neural network layer types. Canziani
for a total of 979M training steps, and that their et al.(2016) assess image classification model ac-
base model requires 10 hours to train for 300k curacy as a function of model size and gigaflops
steps on one TPUv2 core. This equates to 32,623 required during inference. They also measure av-
hours of TPU or 274,120 hours on 8 P100 GPUs. erage power draw required during inference on
GPUs as a function of batch size. Neither work an-ELMo. The ELMo model (Peters et al.,2018) alyzes the recurrent and self-attention models thatis based on stacked LSTMs and provides rich have become commonplace in NLP, nor do theyword representations in context by pre-training on extrapolate power to estimates of carbon and dol-a large amount of data using a language model- lar cost of training.ing objective. Replacing context-independent pre-
trained word embeddings with ELMo has been Analysis of hyperparameter tuning has been
shown to increase performance on downstream performed in the context of improved algorithms
tasks such as named entity recognition, semantic for hyperparameter search (Bergstra et al.,2011;
role labeling, and coreference.Peters et al.(2018) Bergstra and Bengio,2012;Snoek et al.,2012). To
report that ELMo was trained on 3 NVIDIA GTX our knowledge there exists to date no analysis of
1080 GPUs for 2 weeks (336 hours). the computation required for R&D and hyperpa-
rameter tuning of neural network models in NLP.BERT.The BERT model (Devlin et al.,2019) pro-
vides a Transformer-based architecture for build-
ing contextual representations similar to ELMo, 6 Via the authorson Reddit.
7 GPU lower bound computed using pre-emptible but trained with a different language modeling ob- P100/V100 U.S. resources priced at $0.43$0.74/hr, upper
jective. BERT substantially improves accuracy on bound uses on-demand U.S. resources priced at $1.46
tasks requiring sentence-level representations such $2.48/hr. We similarly use pre-emptible ($1.46/hr$2.40/hr)
and on-demand ($4.50/hr$8/hr) pricing as lower and upper as question answering and natural language infer- bounds for TPU v2/3; cheaper bulk contracts are available. Model Hardware Power (W) Hours kWh·PUE CO 2 e Cloud compute cost
Transformer base P100x8 1415.78 12 27 26 $41$140
Transformer big P100x8 1515.43 84 201 192 $289$981
ELMo P100x3 517.66 336 275 262 $433$1472
BERT base V100x64 12,041.51 79 1507 1438 $3751$12,571
BERT base TPUv2x16 — 96 — — $2074$6912
NAS P100x8 1515.43 274,120 656,347 626,155 $942,973$3,201,722
NAS TPUv2x1 — 32,623 — — $44,055$146,848
GPT-2 TPUv3x32 — 168 — — $12,902$43,008
Table 3: Estimated cost of training a model in terms of CO 2 emissions (lbs) and cloud compute cost (USD). 7 Power
and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware.
4 Experimental results Estimated cost (USD)
Models Hours Cloud compute Electricity4.1 Cost of training 1 120 $52$175 $5Table3lists CO 2 emissions and estimated cost of 24 2880 $1238$4205 $118training the models described in§2.1. Of note is 4789 239,942 $103k$350k $9870that TPUs are more cost-efficient than GPUs on
workloads that make sense for that hardware (e.g. Table 4: Estimated cost in terms of cloud compute and
BERT). We also see that models emit substan- electricity for training: (1) a single model (2) a single
tial carbon emissions; training BERT on GPU is tune and (3) all models trained during R&D.
roughly equivalent to a trans-American flight.So
et al.(2019) report that NAS achieves a new state- about 60 GPUs running constantly throughout theof-the-art BLEU score of 29.7 for English to Ger- 6 month duration of the project. Table4lists upperman machine translation, an increase of just 0.1 and lower bounds of the estimated cost in termsBLEU at the cost of at least $150k in on-demand of Google Cloud compute and raw electricity re-compute time and non-trivial carbon emissions. quired to develop and deploy this model. 9 We see
that while training a single model is relatively in-4.2 Cost of development: Case study expensive, the cost of tuning a model for a newTo quantify the computational requirements of dataset, which we estimate here to require 24 jobs,R&D for a new model we study the logs of or performing the full R&D required to developall training required to develop Linguistically- this model, quickly becomes extremely expensive.Informed Self-Attention (Strubell et al.,2018), a
multi-task model that performs part-of-speech tag- 5 Conclusions
ging, labeled dependency parsing, predicate detec-
tion and semantic role labeling. This model makes Authors should report training time and
for an interesting case study as a representative sensitivity to hyperparameters.
NLP pipeline and as a Best Long Paper at EMNLP. Our experiments suggest that it would be benefi-
Model training associated with the project cial to directly compare different models to per-
spanned a period of 172 days (approx. 6 months). form a cost-benefit (accuracy) analysis. To ad-
During that time 123 small hyperparameter grid dress this, when proposing a model that is meant
searches were performed, resulting in 4789 jobs to be re-trained for downstream use, such as re-
in total. Jobs varied in length ranging from a min- training on a new domain or fine-tuning on a new
imum of 3 minutes, indicating a crash, to a maxi- task, authors should report training time and com-
mum of 9 days, with an average job length of 52 putational resources required, as well as model
hours. All training was done on a combination of sensitivity to hyperparameters. This will enable
NVIDIA Titan X (72%) and M40 (28%) GPUs. 8 direct comparison across models, allowing subse-
The sum GPU time required for the project quent consumers of these models to accurately as-
totaled 9998 days (27 years). This averages to sess whether the required computational resources
8 We approximate cloud compute cost using P100 pricing. 9 Based on average U.S cost of electricity of $0.12/kWh. are compatible with their setting. More explicit half the estimated cost to use on-demand cloud
characterization of tuning time could also reveal GPUs. Unlike money spent on cloud compute,
inconsistencies in time spent tuning baseline mod- however, that invested in centralized resources
els compared to proposed contributions. Realiz- would continue to pay off as resources are shared
ing this will require: (1) a standard, hardware- across many projects. A government-funded aca-
independent measurement of training time, such demic compute cloud would provide equitable ac-
as gigaflops required to convergence, and (2) a cess to all researchers.
standard measurement of model sensitivity to data
and hyperparameters, such as variance with re- Researchers should prioritize computationally
spect to hyperparameters searched. efficient hardware and algorithms.
We recommend a concerted effort by industry and
Academic researchers need equitable access to academia to promote research of more computa-
computation resources. tionally efficient algorithms, as well as hardware
that requires less energy. An effort can also beRecent advances in available compute come at a made in terms of software. There is already ahigh price not attainable to all who desire access. precedent for NLP software packages prioritizingMost of the models studied in this paper were de- efficient models. An additional avenue throughveloped outside academia; recent improvements in which NLP and machine learning software de-state-of-the-art accuracy are possible thanks to in- velopers could aid in reducing the energy asso-dustry access to large-scale compute. ciated with model tuning is by providing easy-Limiting this style of research to industry labs to-use APIs implementing more efficient alterna-hurts the NLP research community in many ways. tives to brute-force grid search for hyperparameterFirst, it stifles creativity. Researchers with good tuning, e.g. random or Bayesian hyperparameterideas but without access to large-scale compute search techniques (Bergstra et al.,2011;Bergstrawill simply not be able to execute their ideas, and Bengio,2012;Snoek et al.,2012). Whileinstead constrained to focus on different prob- software packages implementing these techniqueslems. Second, it prohibits certain types of re- do exist, 10 they are rarely employed in practicesearch on the basis of access to financial resources. for tuning NLP models. This is likely becauseThis even more deeply promotes the already prob- their interoperability with popular deep learninglematic “rich get richer” cycle of research fund- frameworks such as PyTorch and TensorFlow ising, where groups that are already successful and not optimized, i.e. there are not simple exam-thus well-funded tend to receive more funding ples of how to tune TensorFlow Estimators usingdue to their existing accomplishments. Third, the Bayesian search. Integrating these tools into theprohibitive start-up cost of building in-house re- workflows with which NLP researchers and practi-sources forces resource-poor groups to rely on tioners are already familiar could have notable im-cloud compute services such as AWS, Google pact on the cost of developing and tuning in NLP.Cloud and Microsoft Azure.
While these services provide valuable, flexi- Acknowledgements
ble, and often relatively environmentally friendly We are grateful to Sherief Farouk and the anony- compute resources, it is more cost effective for mous reviewers for helpful feedback on earlieracademic researchers, who often work for non- drafts. This work was supported in part by theprofit educational institutions and whose research Centers for Data Science and Intelligent Infor-is funded by government entities, to pool resources mation Retrieval, the Chan Zuckerberg Initiativeto build shared compute centers at the level of under the Scientific Knowledge Base Construc-funding agencies, such as the U.S. National Sci- tion project, the IBM Cognitive Horizons Networkence Foundation. For example, an off-the-shelf agreement no. W1668553, and National ScienceGPU server containing 8 NVIDIA 1080 Ti GPUs Foundation grant no. IIS-1514053. Any opinions,and supporting hardware can be purchased for findings and conclusions or recommendations ex-approximately $20,000 USD. At that cost, the pressed in this material are those of the authors andhardware required to develop the model in our do not necessarily reflect those of the sponsor.case study (approximately 58 GPUs for 172 days)
would cost $145,000 USD plus electricity, about 10 For example, theHyperopt Python library. References Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and LukeRhonda Ascierto. 2018.Uptime Institute Global Data Zettlemoyer. 2018. Deep contextualized word rep-Center Survey. Technical report, Uptime Institute. resentations. InNAACL.
Dzmitry Bahdanau, KyunghyunCho, and Yoshua Ben-
gio. 2015. Neural Machine Translation by Jointly Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Learning to Align and Translate. In3rd Inter- Dario Amodei, and Ilya Sutskever. 2019.Language
national Conference for Learning Representations models are unsupervised multitask learners.
(ICLR), San Diego, California, USA. Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
James Bergstra and Yoshua Bengio. 2012. Random 2012. Practical bayesian optimization of machine
search for hyper-parameter optimization.Journal of learning algorithms. InAdvances in neural informa-
Machine Learning Research, 13(Feb):281305. tion processing systems, pages 29512959.
James S Bergstra, R´emi Bardenet, Yoshua Bengio, and David R. So, Chen Liang, and Quoc V. Le. 2019.
Bal´azs K´egl. 2011. Algorithms for hyper-parameter The evolved transformer. InProceedings of the
optimization. InAdvances in neural information 36th InternationalConference on Machine Learning
processing systems, pages 25462554. (ICML).
Bruno Burger. 2019.Net Public Electricity Generation Emma Strubell, Patrick Verga, Daniel Andor,
in Germany in 2018. Technical report, Fraunhofer David Weiss, and Andrew McCallum. 2018.
Institute for Solar Energy Systems ISE. Linguistically-Informed Self-Attention for Se-
mantic Role Labeling. InConference on Empir-Alfredo Canziani, Adam Paszke, and Eugenio Culur- ical Methods in Natural Language Processingciello. 2016. An analysis of deep neural network (EMNLP), Brussels, Belgium. models for practical applications .
Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobGary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John Uszkoreit, Llion Jones, Aidan N Gomez, LukaszDeans, Brian Johnson, Elizabeth Jardim, and Brian Kaiser, and Illia Polosukhin. 2017. Attention is allJohnson. 2017. Clicking Clean: Who is winning you need. In31st Conference on Neural Informationthe race to build a green internet?Technical report, Processing Systems (NIPS).Greenpeace.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
derstanding. InNAACL.
Timothy Dozat and Christopher D. Manning. 2017.
Deep biaffine attention for neural dependency pars-
ing. InICLR.
EPA. 2018. Emissions & Generation Resource Inte-
grated Database (eGRID). Technical report, U.S.
Environmental Protection Agency.
Christopher Forster, Thor Johnsen, Swetha Man-
dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie
Bernauer, Allison Gray, Sharan Chetlur, and Raul
Puri. 2019. BERT Meets GPUs. Technical report,
NVIDIA AI.
Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong.
2016. Evaluating the energy efficiency of deep con-
volutional neural networks on cpus and gpus.2016
IEEE International Conferences on Big Data and
Cloud Computing (BDCloud), Social Computing
and Networking (SocialCom), Sustainable Comput-
ing and Communications (SustainCom) (BDCloud-
SocialCom-SustainCom), pages 477484.
Thang Luong, Hieu Pham, and Christopher D. Man-
ning. 2015.Effective approaches to attention-based
neural machine translation. InProceedings of the
2015 Conference on Empirical Methods in Natural
Language Processing, pages 14121421. Associa-
tion for Computational Linguistics.

View File

@ -0,0 +1,793 @@
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 1381
Finite-Element Neural Networks for Solving
Differential Equations
Pradeep Ramuhalli, Member, IEEE, Lalita Udpa, Senior Member, IEEE, and Satish S. Udpa, Fellow, IEEE
Abstract—The solution of partial differential equations (PDE)
arises in a wide variety of engineering problems. Solutions to most
practical problems use numerical analysis techniques such as fi-
nite-element or finite-difference methods. The drawbacks of these
approaches include computational costs associated with the mod-
eling of complex geometries. This paper proposes a finite-element
neural network (FENN) obtained by embedding a finite-element
model in a neural network architecture that enables fast and ac-
curate solution of the forward problem. Results of applying the
FENN to severalsimpleelectromagnetic forward and inverseprob-
lems are presented. Initial results indicate that the FENN perfor-
mance as a forward model is comparable to that of the conven-
tional finite-element method (FEM). The FENN can also be used
in an iterative approach to solve inverse problems associated with Fig. 1. Iterative inversion method for solving inverse problems. the PDE. Results showing the ability of the FENN to solve the in-
verse problem given the measured signal are also presented. The
parallel nature of the FENN also makes it an attractive solution resulting in the corresponding solution to the forward problem
for parallel implementation in hardware and software. . The model output is compared to the measurement ,
Index Terms—Finite-element method (FEM), finite-element using a cost function .If is less than a toler-
neural network (FENN), inverse problems. ance, the estimateis used as the desired solution. If not,
is updated to minimize the cost function.
S I. I Although finite-element methods (FEMs) [3], [4] are ex- NTRODUCTION tremely popular for solving differential equations, their majorOLUTIONS of differential equations arise in a widedrawback is computational complexity. This problem becomesvariety of engineering applications in electromagnetics,more acute when three-dimensional (3-D) finite-elementsignal processing, computational fluid dynamics, etc. Thesemodels are used in an iterative algorithm for solving the inverseequations are typically solved using either analytical or numer-problem. Recently, several authors have suggested the use ofical methods. Analytical solution methods are however feasibleneural networks (MLP or RBF networks [5]) for solving differ-only for simple geometries, which limits their applicability. Inential equations [6][9]. In these techniques, a neural networkmost practical problems with complex boundary conditions,is trained using a large database containing the input data andnumerical analysis methods are required in order to obtain athe solution of the differential equation. The neural networkreasonable solution. An example is the solution of Maxwellsduring generalization learns the mapping corresponding toequations in electromagnetics. Solutions to Maxwells equa-the PDE. Alternatively, in [10], the solution to a differentialtions are used in a variety of applications for calculating theequation is written as a constant term, and an adjustable term interaction of electromagnetic (EM) fields with different typeswith parameters that need to be determined. A neural networkof media. is used to determine the optimal values of the parameters.Very often, the solution to differential equations is necessaryThis approach is applicable only to problems with regularfor solving the corresponding inverse problems. Inverse prob-boundaries. An extension of the approach to problems withlems in general are ill-posed, lacking continuous dependence ofirregular boundaries is given in [11]. Other neural networkthe measurements on the input. This has resulted in the devel-based differential equation solvers use multilayer perceptronopment of a variety of solution techniques ranging from simplenetworks or variations on the MLP to approximate the unknowncalibration procedures to other direct (analytical) and iterativefunction in a PDE [12][14]. A combination of the PDE andapproaches [1]. Iterative methods typically employ a forwardboundary conditions is used to construct an objective functionmodel that simulates the underlying physical process (Fig. 1)that is minimized during the training process.[2]. An initial estimate of the solution of the inverse problem A major limitation of these approaches is that the network ar- (represented byin Fig. 1) is applied to the forward model,chitecture is selected somewhat arbitrarily. A second drawback
is that the performance of the neural networks depends on the
Manuscript received January 17, 2004; revised April 2, 2005. data used in training and testing. As long the test data is sim-
The authors are with the Department of Electrical and Computer Engi- ilar to the training data, the network can interpolate between the neering, Michigan State University, East Lansing, MI 48824 USA (e-mail: training data points to obtain a reasonable prediction. However, rpradeep@egr.msu.edu; udpal@egr.msu.edu; udpa@egr.msu.edu).
Digital Object Identifier 10.1109/TNN.2005.857945 when the test signal is no longer similar to the training data, the
1045-9227/$20.00 © 2005 IEEE 1382 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
network is forced to extrapolate and the performance degrades. Section V draws conclusions from the results and presents
One way around this difficulty is to ensure that the training data- ideas for future work.
base has a diverse set of signals. However, this is difficult to
ensure in practice. Alternatively, we have to design neural net- II. T HE FENN
works that are capable of extrapolation. Extrapolation methods This section briefly describes the FEM and proposes its refor-are discussed extensively in literature [15][18], but the design mulation into a parallel neural network structure. Details aboutof an extrapolation neural network involves several issues par- the FEM can be found in [3] and [4].ticularly for ensuring that the error in the network prediction
stays within reasonable bounds during the extrapolation proce- A. The FEMdure. Consider a typical boundary value problem with the gov-An ideal solution to this problem would be to combine the erning differential equationpower of numerical models with the computational speed of
neural networks, i.e., to embed a numerical model in a neural (1)network structure. One suchfinite-element neural network
(FENN) formulation has been reported by Takeuchi and Kosugi where is a differential operator, is the applied source or
[19]. This approach, based on error minimization, derives the forcing function, and is the unknown quantity. This differen-
neural network using the energy functional resulting from the tial equation can be solved in conjunction with boundary condi-
finite-element formulation. Other reports of FENN combina- tionson theboundary enclosingthedomain .Thevariational
tions are either similar to the Takeuchi method [20], [21] or use formulation used infinite-element analysis determines the un-
Hopfield neural networks to solve the forward problem [22], known by minimizing the functional [3], [4]
[23]. Kalkkuhlet al.[24] provide a description of a FEM-based
approach to NARX modeling that may be interpreted both as (2)
a local model network, as well as a single layer feedforward
network. A slightly different approach to merging numerical with respect to the trial function . The minimization procedure
methods and neural networks is given in [25], where thefi- starts by dividing into small subdomains called elements
nite-difference time domain (FDTD) method is cast in a neural (Fig. 2) and representing in each element by means of basis
network framework for the purpose of solving electromagnetic functions defined over the element
forward problems. The related problem of mesh generation
infinite-element models has also been tackled using neural (3)networks (for instance, [26]). Generally, these networks are
designed to solve the forward problem, and must be modified
to solve inverse problems. where is the unknown solution in element , is the basis
This paper proposes a new approach that embeds afinite-ele- function associated with node in element , is the value
ment model commonly used in the solution of differential equa- of the unknown quantity at node and is the total number of
tions in a neural network. The network, called the FENN, can nodes associated with element . In general, the basis functions
solve the forward problem and can also be used in an itera- (also referred to as interpolation functions or shape functions)
tive algorithm to solve inverse problems. The primary advan- can be linear, quadratic, or of higher order. Typically,finite-el-
tage of this approach is that the FEM is represented in a parallel ement models use either linear or polynomial spline basis func-
form. Thus, it has the potential to alleviate the computational tions.
cost associated with using the FEM in an iterative algorithm The functional within an element is expressed as
for solving inverse problems. More importantly, the FENN does
not need any training, and the computation of the weights is (4)
a one-time process. The proposed approach is also different in
that the neural network architecture developed can be used to
solve the forward and inverse problems. The structure of the By substituting (3) in (4), we obtain the discrete version of the
neural network is also simpler than those reported in the litera- functional within each element
ture, making it easier to implement in parallel in both hardware (5)and software.
The rest of this paper is organized as follows. Section II where is the transpose of a matrix, is the ele-briefly describes the FEM, and derives the proposed FENN. In mental matrix with elements this paper, we focus on the problem of solving typical equa-
tions encountered in electromagnetic nondestructive evaluation (6)(NDE). However, the same concepts can be easily applied
to solve differential equations encountered in otherfields.
Sections III, IV and V present the application of the FENN and is an vector with elements
to solving forward and inverse problems, along with initial
results. A discussion of the advantages and disadvantages of (7)
the proposed FENN architecture is given in Section IV. Finally, RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1383
Combining the values in (5) for each of the elements
(8)
where is the global matrix derived from the terms
of the elemental matrices for different elements, and is the
total number of nodes. , also called the stiffness matrix, is a
sparse, banded matrix. Equation (8) is the discrete version of
the functional and can be minimized with respect to the nodal
parameters by taking the derivative of with respect to and
setting it equal to zero, which results in the matrix equation Fig.2. (a)Schematicrepresentationofdomainandboundary. (b)SampleFEM
mesh for the domain.
(9)
Boundary conditions for these problems are usually of two
types: natural boundary conditions and essential boundary
conditions. Essential boundary conditions (also referred to as
Dirichlet boundary conditions) impose constraints on the value
of the unknown at several nodes. Natural boundary condi-
tions (of which Neumann boundary conditions are a special
case) impose constraints on the change in across a boundary.
Dirichlet boundary conditions are imposed on the functional
minimization (9), by deleting the rows and columns of the
matrix corresponding to the nodes on the Dirichlet boundary
and modifying in (9). Fig. 3. FEM domain discretization using two elements and four nodes.
Natural boundary conditions are applied in the FEM by
adding an additional term to the functional. These boundary This process ensures that natural boundary conditions are im-conditions are then incorporated into the functional and are plicitlyandautomatically satisfiedduring theFEMsolutionpro-satisfied automatically during the solution procedure. As an cedure.example, consider the natural boundary condition represented
by the following equation [3] B. The FENN
on (10) This section describes how thefinite-element model can be
converted intoa parallel network form. Wefocus on solving typ-
where represents the Neumann boundary, is its outward ical inverse problems arising in electromagnetic NDE, but the
normal unit vector, is some constant, and , , and are basicideaisapplicabletootherareas aswell.NDEinverseprob-
known parameters associated with the boundary. Assuming that lems can be formulated as the problem offinding the material
the boundary is made up of segments, we can define properties (such as the conductivity or the permeability) within
boundary matrices and with elements the domain of the problem. Since the domain is discretized in
the FEM method by a large number of elements, the problem
can be posed as one offinding the material properties in each
of these elements. These properties are usually embedded in the
differential operator , or equivalently, in the global matrix .
Thus, in order to be able to iteratively estimate these properties
from the measurements, the material properties need to be sep-
arated out from . This separation is easier to achieve at the
element matrix level. For nodes and in element
(11)
where are basis functions defined over segment and is
the length of the segment. The elements of are added to the
elementsof that correspond tothe nodeson the boundary .
Similarly, the elements of are added to the corresponding
elements of . The global matrix (9) is thus modified as follows
before solving for (13)
where is the parameter representing the material property(12) in element and represents the differential operator at the 1384 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
Fig. 4. FENN.
element level without embedded in it. Substituting (13) into neurons, corresponding to the members of the global ma-
the functional, we get trix . The output of each group of hidden layer neurons is the
corresponding row vector of . The weights from the input to
the hidden layer are set to the appropriate values of . Each(14) neuron in the hidden layer acts as a summation unit, (equivalent
toasummationfollowedbyalinearactivationfunction[5]).The
If we define outputs of the hidden layer neurons are the elements of the
global matrix as given in (15).
(15) Each group of hidden neurons is connected to one output
neuron (giving a total of output neurons) by a set of weights
, with each element of representing the nodal values .where Note that the set of weights between thefirst group of hidden
neurons and thefirst output neuron are the same as the set of(16)else weights between the second group of hidden neurons and the
second output neuron (as well as between successive groups
of hidden neurons and the corresponding output neuron). Each
output neuron is also a summation unit followed by a linear ac-
tivation function, and the output of each neuron is equal to :
(18)
(17)
where the second part of (18) is obtained by using (15). As an
Equation (17) expresses the functional explicitly in terms of . example, the FENN architecture for a two-element, four-node
The assumption that is constant within each element is im- FEM mesh (Fig. 3) is shown in Fig. 4. In this
plicit in this expression. This assumption is usually satisfied in case, the FENN has two input neurons, 16 hidden layer neurons
problems in NDE where each element in the FEM mesh is de- and four output neurons. Thefigure illustrates the grouping of
fined within the confines of a domain, and at no time does a the hidden layer neurons, as well as the similarity inherent in
single element cross domain boundaries. Furthermore, each el- the weights that connect each group of hidden layer neurons
ement is small enough that minor variations in within an el- to the corresponding output neuron. To simplify thefigure, the
ement may be ignored. Equation (17) can be easily converted weights between the network input and hidden layer neurons
into a parallel network form. The neural network comprises an are depicted by means of vectors (for
input, output and hidden layer. In the general case with el- , 2, 3, 4 and , 2), where the individual weight values
ements and nodes in the FEM mesh, the input layer with are defined as in (16).
network inputs takes the values in each element as input. 1) Boundary Conditions in the FENN: Note that the ele-
The hidden layer has neurons 1 arranged in groups of ments of and in (11) do not depend on the material prop-
1 erties . and need to be added appropriately to the global In this paper, we use the term“neurons”in the FENN (in the hidden and
output layers) to avoid confusion with the nodes in afinite-element mesh. matrix and the source vector as shown in (12). Equation RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1385
Fig. 5. Geometry of mesh for 1-D FEM.
Fig. 6. Flowchart (with example) for designing the FENN for a general PDE.
(12) thus implies that natural boundary conditions can be ap- layer neurons. These weights will be referred to as the clamped
plied in the FENN as bias inputs to the hidden layer neurons weights, while the remaining weights will be referred to as the
that are a part of the boundary, and the corresponding output free weights. An example of these weights is presented later.
neurons. Dirichlet boundary conditions are applied by clamping The FENN architecture was derived without consideration of
the corresponding weights between the hidden layer and output the dimensionality of the problem at hand, and thus can be used 1386 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
for 1-, 2-, 3-, or higher dimensional problems. The number of
nodes and elements in the FEM mesh dictates the number of
neurons in the different layers. The weights between the input
and hidden layer change depending on node-element connec-
tivity information.
The major drawback of the FENN is the number of neurons
and weights necessary. However, the memory requirements can
be reduced considerably, since most of the weights between the
input and hidden layer are zero. These weights, and the corre-
sponding connections, can be discarded. Similarly, most of the Fig. 7. Shielded microstrip geometry. (a) Complete problem description. (b)
elements of the matrix are also zero ( is a banded ma- Problem description using symmetry considerations.
trix). The corresponding neurons in the hidden layer can also
be discarded, reducing memory and computation requirements The network implementation of (23) can be derived as fol-
considerably. Furthermore, the weights between each group of lows. If and values at each element are the inputs to the
hidden layer neurons and the output layer are the same . network, , , , and form the weights
Weight-sharing approaches can be used here to further reduce between the input and hidden layers. The network thus uses
the storage requirements. inputneuronsand hiddenneurons.Thevaluesof ateachof
thenodesareassigned asweightsbetweenthehidden andoutput
C. A 1-D Example layers, and the source is the desired output of this network
Consider the 1-D equation (corresponding to the output neurons). Dirichlet boundary
conditions on are applied as explained earlier.
(19) D. General Case
Fig. 6 shows aflowchart of the general scheme for convertingboundary conditions on the boundary defined by . a differential equation into the FENN structure. An exampleand are constants depending on the material and is the in two dimensions is also provided next to theflowchart. Weapplied source. Laplaces equation and Poissons equation are start with the differential equation and the boundary conditionsspecial cases of this equation. The FENN formulation for this and formulate the FEM using the variational method. This in-problem starts by discretizing the domain of interest with el- volves discretizing the domain of interest with elements andements and nodes. In one dimension, each element is defined nodes, selecting basis functions, writing the functional forby two nodes (Fig. 5). Define basis functions and over each element and obtaining the element matrices and the sourceeach element and let is the value of on node in element vector. The example presented uses the FEM mesh shown in. An example of the basis functions is shown in Fig. 5. Fig. 3, with elements, and nodes, and linearFor these basis functions, i.e., basis functions. The unknown solution to the differential equa-
tion is represented by its values at each of the nodes in the(20) finite-element mesh . The element matrices are then
separated into two parts, with one part dependent on the mate-the element matrices are given by [3] rial properties and while the other is independent of them.
The FENN is then designed to have input neurons,
hidden neurons, and output neurons, where is the number
of material property parameters. In the example under consid-
eration, , since we have two material property parameters(21) ( and ). Thefirst group of input neurons takes in the
values while the second group takes in the values in each ele-
ment. The weights from the input to the hidden layer are set to
the appropriate values of . In the example, since nodes 1, 2,
(22) and 3 are part of element 1 (see Fig. 3), the weights from thefirst
input node to thefirst group of four neurons in the hidden
Here, is the length of element . The global matrix is then layer are given by
constructed by selectively adding the element matrices based
on the nodes that form an element. Specifically, is a sparse
tridiagonal matrix, and its nonzero elements are given by (24)
The last weight is zero since node 4 is not a part of element 1.
Each group of hidden neurons is connected to one output
neuron (giving a total of output neurons) by a set of weights
, with each element of representing the nodal values . The
(23) output of each neuron in the output layer is equal to . RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1387
Fig. 8. Forward problem solutions for shielded microstrip problem show the contours of constant potential for: (a) FEM solution and (b) FENN solution. (c) Error
between (a) and (b). Thex- andy-axes show the nodes in the FEM discretization of the domain, and thez-axis in (c) shows the error at each of these nodes in volts.
III. F ORWARD AND INVERSE PROBLEM FORMULATION USING where is the output of the FENN. Then, for a gradient-
FENN based approach, the gradients of the error with respect to the
free hidden layer weights is given by
The FENN architecture and algorithm lends itself to solving (27)both the forward and inverse problems. The forward problem
involves determining the weights given the material parame- Equation (27) can be used to solve the forward problem. Sim-ters and and the applied source while the inverse problem ilarly, to solve the inverse problem, the gradients of the errorinvolves determining and given and . Any optimization with respect to and (input of the FENN) are necessary, andapproach can be used to solve both these problems. Suppose we are given bydefine the error at the output of the FENN as
(28)
(26) (29) 1388 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
TABLE I
SUMMARY OF PERFORMANCE OF THE FENN A LGORITHM FOR VARIOUS PDE S
For the forward problem, such an approach is equivalent to the Dirichlet boundary, with on the microstrip and on
iterative approaches used to solve for the unknown nodal values the outer boundary [Fig. 7(b)]. Finally, there is no source term
in the FEM [4]. in this example (the source term would correspond to a charge
distribution in the domain of interest), i.e., . In this ex-
IV. R ESULTS ample, we assume that volts and . Further, we
assume that the domain of interest is .A. Forward Model Results The solution to the forward problem is presented in Fig. 8,
The FENN was tested using both 1- and 2-D versions of with the FEM solution using 11 nodes in each direction shown
Poissons equation in Fig. 8(a) and the corresponding FENN solution in Fig. 8(b).
(30) Thesefigures show contours of constant potential. The error be-
tween the FEM and FENN solutions is presented in Fig. 8(c). As
where represents the material property, and is the applied seen from thefigure, the FENN is seen to match the FEM solu-
source. For instance, in electromagnetics may represent the tion accurately, with the peak error at any node on the order of
permittivity while represents the charge density. .
As thefirst example, consider the following 2-D equation Several other examples were also used to test the FENN and
the results are summarized in Table I. Column 1 shows the
(31) PDE used to evaluate the FENN performance, while column 2
shows the boundary conditions used. The analytic solution to
with boundary conditions the problem is indicated in Column 3. The FENN structure and
on (32) the number of iterations for convergence using a gradient de-
scent approach are indicated in Columns 4 and 5, respectively.
and The FENN structure, as explained earlier, has inputs,
hidden neurons and output neurons, where and are the
on (33) number of elements and nodes in the FEM mesh, respectively,
and is the number of hidden neurons, and corresponds to the
This is the governing equation for the shielded microstrip trans- number of nonzero elements in the FEM global matrix . Fi-
mission line problem shown in Fig. 7. The forward problem nally, Columns 6 and 7 present the sum-squared error (SSE) and
computes the electric potential due to the shielded microstrip the maximum error in the solution, respectively, where the er-
shown in Fig. 7(a). The potentials are zero on the shielding con- rors are computed with respect to the analytical solution. These
ductor.Sincethegeometryissymmetric,wecansolvetheequiv- results indicate that the FENN is capable of accurately deter-
alent problem shown in Fig. 7(b), by applying the homogeneous mining the potential . One advantage of the FENN approach
Neumann condition on the plane of symmetry. The inner con- is that the computation of the input-hidden layer weights is a
ductor (microstrip) is held at a constant potential of volts. one-time process, as long as the differential equation does not
Finally, we also assume that the material inside the shielding change. The only changes necessary to solve the different prob-
conductor has a permittivity , where K is a constant. The lems are changes in the input and the desired output .
permittivity in this case corresponds to the material property .
Specifically, and . The homogeneous Neu- B. Inverse Model Results
mann boundary condition is equivalent to setting . TheFENNwasalsousedtosolveseveralsimpleinverseprob-
The microstrip and the shielding conductor correspond to the lems based on (30). In all cases, the objective was to determine RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1389
Fig. 9. FENN inversion results for Poissons equation with initial solutions (a) = x . (b) =1+ x .
the value of and for given values of and . Thefirst ex- In order to obtain a unique solution, we need to constrain the
ample is a 1-D problem that involves determining given value of at the boundary as well. Consider the same differen-
and , for the differential equation tial equation as (34), but with and specified as follows:
(34) and
with boundary conditions and . The analyt- (36)
ical solution to this inverse problem is The analytical solution for this equation is .To
and (35) solve this problem, we set and clamp the value of at
As seen from (35), the problem has an infinite number of solu- and as follows: , .
tions and we expect the solution procedure to converge to one The results of the constrained inversion obtained using 11
of these solutions depending on the initial value. nodes and 10 elements in the correspondingfinite-element mesh
Fig. 9(a) and (b) shows two solutions to this inverse problem are shown in Fig. 10. Fig. 10(a) shows the comparison between
for two different initializations (shown using triangles). In both the analytical solution (solid line with squares) and the FENN
cases, the FENN solution (in stars) is seen to match the analyt- result (solid line with stars). The initial value of is shown in
ical solution (squares). The SSE in both cases was on the order thefigure as a dashed line. Fig. 10(b) shows the comparison
of . between the actual and desired forcing function at the FENN 1390 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
Fig. 10. Constrained inversion result with eleven nodes. (a) Comparison of analytic and simulation results for . (b) Comparison of actual and desired NN outputs.
output. This result indicates that the SSE in the forcing function, weight structure that allows both the forward and inverse prob-
as well as the SSE in the inversion result, is fairly large (0.0148 lemstobesolvedusingsimplegradient-basedalgorithms.Initial
and 0.0197, respectively). The reason for this was traced back results indicate that the proposed FENN algorithm is capable of
to the mesh discretization. Fig. 11 shows the SSE in the output accurately solving both the forward and inverse problems. In
of the FENN and the SSE in the inverse problem solution as a addition, the forward problem solution from the FENN is seen
function of FEM discretization. It is seen that increasing the dis- to exactly match the FEM solution, indicating that the FENN
cretization significantly improves the solution. Similar results represents thefinite-element model exactly in a parallel config-
were observed for other problems. uration.
The major advantage of the FENN is that it represents the
finite-element model in a parallel form, enabling parallel imple-
V. D ISCUSSION AND CONCLUSION mentation in either hardware or software. Further, computing
gradients in the FENN is very simple. This is an advantage in
The FENN is closely related to thefinite-element model used solving bothforward and inverse problems using gradient-based
to solve differential equations. The FENN architecture has a methods. The gradients can also be computed in parallel and RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1391
Fig. 11. SSE in FENN output and inversion results as a function of discretization.
the lack of nonlinearities in the neuron activation functions [6] C. A. Jensenet al.,“Inversion of feedforward neural networks: algo-
makes the computation of gradients simpler. A major advantage rithms and applications,”Proc. IEEE, vol. 87, no. 9, pp. 15361549,
of this approach for solving inverse problems is that it avoids 1999.
[7] P. Ramuhalli, L. Udpa, and S. Udpa,“Neural networkalgorithm for elec-
inverting the global matrix in each iteration. The FENN also tromagnetic NDE signal inversion,”inENDE 2000, Budapest, Hungary,
does not require any training, since most of its weights can be Jun. 2000.
computed in advance and stored. The weights depend on the [8] C. H. Barbosa, A. C. Bruno, M. Vellasco, M. Pacheco, J. P. Wikswo Jr.,
and A. P. Ewing,“Automation of SQUID nondestructive evaluation of
governing differential equation and its associated boundary steel plates by neural networks,”IEEE Trans. Appl. Supercond., vol. 9,
conditions, and as long as these two factors do not change, no. 2, pp. 34753478, 1999.
the weights do not change. This is especially an advantage [9] W.Qing, S. Xueqin,Y.Qingxin,and Y.Weili,“Usingwaveletneural net-
works for the optimal design of electromagnetic devices,”IEEE Trans.
in solving inverse problems in electromagnetic NDE. This Magn., vol. 33, no. 2, pp. 19281930, 1997.
approach also reduces the computational effort associated with [10] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis,“Artificial neural networks
the network. for solving ordinary and partial differential equations,”IEEE Trans.
Neural Netw., vol. 9, no. 5, pp. 9871000, 1998.
Future work will concentrate on applying the FENN to 3-D [11] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou,“Neural-network
electromagnetic NDE problems. The robustness of the approach methods for boundary value problems with irregular boundaries,”IEEE
will also be tested, since the ability of these approaches to in- Trans. Neural Netw., vol. 11, no. 5, pp. 10411049, 2000.
[12] B. P. Van Milligen, V. Tribaldos, and J. A. Jimenez,“Neural network
vert practical noisy measurements is important. Furthermore, differential equation and plasma equilibrium solver,”Phys. Rev. Lett.,
the use of better optimization algorithms, like conjugate gra- vol. 75, no. 20, pp. 35943597, 1995.
dient methods, is expected to improve the solution speed. In ad- [13] M. W. M. G. Dissanayake and N. Phan-Thien,“Neural-network-based
approximations for solving partial differential equations,”Commun.
dition, parallel implementation of the FENN in both hardware Numer. Meth. Eng., vol. 10, pp. 195201, 1994.
and software is under investigation. The approach described in [14] R. Masuoka,“Neural networks learning differential data,”IEICE Trans.
this paper is very general in that it can be applied to a variety Inform. Syst., vol. E83-D, no. 6, pp. 12911300, 2000.
[15] D.C.Youla,“Generalizedimagerestorationbythemethodofalternating
of inverse problems infields other than electromagnetic NDE. orthogonal projections,”IEEE Trans. Circuits Syst., vol. CAS-25, no. 9,
Some of these other applications will also be investigated to pp. 694702, 1978.
show the general nature of the proposed method. [16] D. C. Youla and H. Webb,“Image restoration by the method of convex
projections: part I—theory,”IEEE Trans. Med. Imag., vol. MI-1, no. 2,
pp. 8194, 1982.
REFERENCES [17] A. Lent and H. Tuy,“An iterative method for the extrapolation of band-
limitedfunctions,”J.Math.AnalysisandApplicat.,vol.83, pp.554565,
[1] L. Udpa and S. S. Udpa,“Application of signal processing and pattern 1981.
recognition techniques to inverse problems in NDE,”Int. J. Appl. Elec- [18] W. Chen,“A new extrapolation algorithm for band-limited signals using
tromagn. Mechan., vol. 8, pp. 99117, 1997. the regularization method,”IEEE Trans. Signal Process., vol. 41, no. 3,
[2] M. Yan, M. Afzal, S. Udpa, S. Mandayam, Y. Sun, L. Udpa, and P. pp. 10481060, 1993.
Sacks,“Iterative algorithms for electromagnetic NDE signal inversion,” [19] J. Takeuchi and Y. Kosugi,“Neural network representation of thefinite
inENDE 97, Reggio Calabria, Italy, Sep. 1416, 1997. element method,”Neural Netw., vol. 7, no. 2, pp. 389395, 1994.
[3] J. Jin,The Finite Element Method in Electromagnetics. New York: [20] R. Sikora, J. Sikora, E. Cardelli, and T. Chady,“Artificial neural net-
Wiley, 1993. work application for material evaluation by electromagnetic methods,”
[4] P. Zhou,Numerical Analysis of Electromagnetic Fields. Berlin, Ger- inProc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 40274032.
many: Springer-Verlag, 1993. [21] G. Xu, G. Littlefair, R. Penson, and R. Callan,“Application of FE-based
[5] S. Haykin,Neural Networks: A Comprehensive Foundation. Upper neural networks to dynamic problems,”inProc. Int. Conf. Neural Infor-
Saddle River, NJ: Prentice-Hall, 1994. mation Processing, vol. 3, 1999, pp. 10391044. 1392 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005
[22] F. Guo, P. Zhang, F. Wang, X. Ma, and G. Qiu,“Finite element anal- Lalita Udpa (S84M86SM96) received the
ysis-based Hopfield neural network model for solving nonlinear elec- Ph.D. degree in electrical engineering from Col-
tromagneticfield problems,”inProc. Int. Joint Conf. Neural Networks, orado State University, Fort Collins, in 1986.
vol. 6, 1999, pp. 43994403. She is currently a Professor with the Department
[23] H. Lee and I. S. Kang,“Neural algorithm for solving differential equa- of Electrical and Computer Engineering, Michigan
tions,”J. Computat. Phys., vol. 91, pp. 110131, 1990. State University, East Lansing. She works primarily
[24] J. Kalkkuhl, K. J. Hunt, and H. Fritz,“FEM-based neural-network in the broad areas of nondestructive evaluation,
approach to nonlinear modeling with application to longitudinal vehicle signal processing, and biomedical applications. Her
dynamics control,”IEEE Trans. Neural Netw., vol. 10, no. 4, pp. research interests include various aspects of NDE,
885897, 1999. such as development of computational models for
[25] R. K. Mishra and P. S. Hall,“NFDTD concept,”IEEE Trans. Neural the forward problem in NDE, signal and image pro-
Netw., vol. 16, no. 2, pp. 484490, 2005. cessing, pattern recognition and neural networks, and development of solution
[26] D. G. Triantafyllidis and D. P. Labridis,“Afinite-element mesh gener- techniques for inverse problems. Her current projects includefinite-element
ator based on growing neural networks,”IEEE Trans. Neural Netw., vol. modeling of electromagnetic NDE phenomena, application of neural network
13, no. 6, pp. 14821496, 2002. and signal processing algorithms to NDE data, and development of image
processing techniques for the analysis of NDE and biomedical images.
Dr. Udpa is a Member of Eta Kappa Nu and Sigma Xi.
Satish S. Udpa(S82M82SM91F03) received
the B.Tech. degree in 1975 and the Post Graduate
Diplomainelectricalengineeringin1977fromJ.N.T.
University, Hyderabad, India. He received the M.S.
degree in 1980 and the Ph.D. degree in electrical en-
gineering in 1983, both from Colorado State Univer-
sity, Fort Collins.
He has been with Michigan State University, East
Lansing, since 2001 and is currently Acting Dean for
the College of Engineering and a Professor with the
Electrical and Computer Engineering Department.
Prior to joining Michigan State, he was a Professor with Iowa State University,
Ames, from 1990 to 2001 and was associated with the Materials Assessment
Research Group. Prior to joining Iowa State, he was an Associate Professor
with the Department of Electrical Engineering at Colorado State University.
His research interests span the broad area of materials characterization and
nondestructive evaluation (NDE). Work done by him to date in the area includes
an extensive repertoire of forward models for simulating physical processes
underlying several inspection techniques. Coupled with careful experimental
Pradeep Ramuhalli (S92M02) received the work, such forward models can be used for designing new sensors, optimizing
B.Tech. degree from J.N.T. University, Hyderabad, test conditions, estimating the probability of detection, assessing designs for
India, in electronics and communications engi- inspectability and training inverse models for characterizing defects. He has
neering in 1995, and the M.S. and Ph.D. degrees in also been involved in the development of system-, as well as model-based,
electrical engineering from Iowa State University, inverse solutions for defect and material property characterization. His interests
Ames, in 1998 and 2002, respectively. have expanded in recent years to include the development of noninvasive
He is currently an Assistant Professor with the tools for clinical applications. Work done to date in thisfield includes the
Department of Electrical and Computer Engi- development of new electromagnetic-acoustic (EMAT) methods for detecting
neering, Michigan State University, East Lansing. single leg separation failures in artificial heart valves and microwave imaging
His research is in the general area of nondestruc- and ablation therapy systems. He and his research group have been engaged
tive evaluation and materials characterization. His in the design and development of high-performance instrumentation including
research interests include the application of signal and image processing acoustic microscopes and single and multifrequency eddy current NDE instru-
methods, pattern recognition and neural networks for nondestructive evaluation ments. These systems, as well as software packages embodying algorithms
applications, development of model-based solutions for inverse problems in developed by Udpa for defect classification and characterization, have been
NDE, and the development of information fusion algorithms for multimodal licensed to industry.
data fusion. He is a Fellow of the American Society for Nondestructive Testing (ASNT)
Dr. Ramuhalli is a Member of Phi Kappa Phi. and a Fellow of the Indian Society of Nondestructive Testing.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,399 @@
Learning Efficient Convolutional Networks through Network Slimming
Zhuang Liu 1 Jianguo Li 2 Zhiqiang Shen 3 Gao Huang 4 Shoumeng Yan 2 Changshui Zhang 1
1 CSAI, TNList, Tsinghua University 2 Intel Labs China 3 Fudan University 4 Cornell University
{liuzhuangthu, zhiqiangshen0214}@gmail.com,{jianguo.li, shoumeng.yan}@intel.com,
gh349@cornell.edu, zcs@mail.tsinghua.edu.cn
Abstract However, larger CNNs, although with stronger represen-
tation power, are more resource-hungry. For instance, a
The deployment of deep convolutional neural networks 152-layer ResNet [14] has more than 60 million parame-
(CNNs) in many real world applications is largely hindered ters and requires more than 20 Giga float-point-operations
by their high computational cost. In this paper, we propose (FLOPs) when inferencing an image with resolution 224×
a novel learning scheme for CNNs to simultaneously 1) re- 224. This is unlikely to be affordable on resource con-
duce the model size; 2) decrease the run-time memory foot- strained platforms such as mobile devices, wearables or In-
print; and 3) lower the number of computing operations, ternet of Things (IoT) devices.
without compromising accuracy. This is achieved by en- The deployment of CNNs in real world applications areforcing channel-level sparsity in the network in a simple but mostly constrained by1) Model size: CNNs strong repre-effective way. Different from many existing approaches, the sentation power comes from their millions of trainable pa-proposed method directly applies to modern CNN architec- rameters. Those parameters, along with network structuretures, introduces minimum overhead to the training process, information, need to be stored on disk and loaded into mem-and requires no special software/hardware accelerators for ory during inference time. As an example, storing a typi-the resulting models. We call our approachnetwork slim- cal CNN trained on ImageNet consumes more than 300MBming, which takes wide and large networks as input mod- space, which is a big resource burden to embedded devices.els, but during training insignificant channels are automat- 2) Run-time memory: During inference time, the interme-ically identified and pruned afterwards, yielding thin and diate activations/responses of CNNs could even take morecompact models with comparable accuracy. We empirically memory space than storing the model parameters, even withdemonstrate the effectiveness of our approach with several batch size 1. This is not a problem for high-end GPUs, butstate-of-the-art CNN models, including VGGNet, ResNet unaffordable for many applications with low computationaland DenseNet, on various image classification datasets. For power.3) Number of computing operations:The convolu-VGGNet, a multi-pass version of network slimming gives a tion operations are computationally intensive on high reso-20×reduction in model size and a 5×reduction in comput- lution images. A large CNN may take several minutes toing operations. process one single image on a mobile device, making it un-
realistic to be adopted for real applications.
1. Introduction Many works have been proposed to compress large
CNNs or directly learn more efficient CNN models for fast
In recent years, convolutional neural networks (CNNs) inference. These include low-rank approximation [7], net-
have become the dominant approach for a variety of com- work quantization [3, 12] and binarization [28, 6], weight
puter vision tasks, e.g., image classification [22], object pruning [12], dynamic inference [16], etc. However, most
detection [8], semantic segmentation [26]. Large-scale of these methods can only address one or two challenges
datasets, high-end modern GPUs and new network architec- mentioned above. Moreover, some of the techniques require
tures allow the development of unprecedented large CNN specially designed software/hardware accelerators for exe-
models. For instance, from AlexNet [22], VGGNet [31] and cution speedup [28, 6, 12].
GoogleNet [34] to ResNets [14], the ImageNet Classifica- Another direction to reduce the resource consumption of
tion Challenge winner models have evolved from 8 layers large CNNs is to sparsify the network. Sparsity can be im-
to more than 100 layers. posed on different level of structures [2, 37, 35, 29, 25],
This work was done when Zhuang Liu and Zhiqiang Shen were interns which yields considerable model-size compression and in-
at Intel Labs China. Jianguo Li is the corresponding author. ference speedup. However, these approaches generally re-
2736 channel scaling channel scaling i-thconv-layer factors (i+1)=j-th i-thconv-layer factors (i+1)=j-th
conv-layer conv-layer Ci1 1.170 C 1.170
C C i1
i2 0.001 j1 Cj1
Ci3 0.290 pruning Ci3 0.290
C 0.003 Ci4 j2 Cj2
… … …
… …
C Cin 0.820 in 0.820
initial network compact network
Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity
regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small
scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then
fine-tuned to achieve comparable (or even higher) accuracy as normally trained full network.
quire special software/hardware accelerators to harvest the Low-rank Decompositionapproximates weight matrix in
gain in memory or time savings, though it is easier than neural networks with low-rank matrix using techniques like
non-structured sparse weight matrix as in [12]. Singular Value Decomposition (SVD) [7]. This method
In this paper, we proposenetwork slimming, a simple works especially well on fully-connected layers, yield-
yet effective network training scheme, which addresses all ing3x model-size compression however without notable
the aforementioned challenges when deploying large CNNs speed acceleration, since computing operations in CNN
under limited resources. Our approach imposes L1 regular- mainly come from convolutional layers.
ization on the scaling factors in batch normalization (BN) Weight Quantization. HashNet [3] proposes to quantizelayers, thus it is easy to implement without introducing any the network weights. Before training, network weights arechange to existing CNN architectures. Pushing the val- hashed to different groups and within each group weightues of BN scaling factors towards zero with L1 regulariza- the value is shared. In this way only the shared weights andtion enables us to identify insignificant channels (or neu- hash indices need to be stored, thus a large amount of stor-rons), as each scaling factor corresponds to a specific con- age space could be saved. [12] uses a improved quantizationvolutional channel (or a neuron in a fully-connected layer). technique in a deep compression pipeline and achieves 35xThis facilitates the channel-level pruning at the followed to 49x compression rates on AlexNet and VGGNet. How-step. The additional regularization term rarely hurt the per- ever, these techniques can neither save run-time memoryformance. In fact, in some cases it leads to higher gen- nor inference time, since during inference shared weightseralization accuracy. Pruning unimportant channels may need to be restored to their original positions.sometimes temporarily degrade the performance, but this [28, 6] quantize real-valued weights into binary/ternaryeffect can be compensated by the followed fine-tuning of weights (weight values restricted to{1,1}or{1,0,1}).the pruned network. After pruning, the resulting narrower This yields a large amount of model-size saving, and signifi-network is much more compact in terms of model size, run- cant speedup could also be obtained given bitwise operationtime memory, and computing operations compared to the libraries. However, this aggressive low-bit approximationinitial wide network. The above process can be repeated method usually comes with a moderate accuracy loss. for several times, yielding a multi-pass network slimming
scheme which leads to even more compact network. Weight Pruning / Sparsifying.[12] proposes to prune the
Experiments on several benchmark datasets and different unimportant connections with small weights in trained neu-
network architectures show that we can obtain CNN models ral networks. The resulting networks weights are mostly
with up to 20x mode-size compression and 5x reduction in zeros thus the storage space can be reduced by storing the
computing operations of the original ones, while achieving model in a sparse format. However, these methods can only
the same or even higher accuracy. Moreover, our method achieve speedup with dedicated sparse matrix operation li-
achieves model compression and inference speedup with braries and/or hardware. The run-time memory saving is
conventional hardware and deep learning software pack- also very limited since most memory space is consumed by
ages, since the resulting narrower model is free of any the activation maps (still dense) instead of the weights.
sparse storing format or computing operations. In [12], there is no guidance for sparsity during training.
[32] overcomes this limitation by explicitly imposing sparse
2. Related Work constraint over each weight with additional gate variables,
and achieve high compression rates by pruning connections
In this section, we discuss related work from five aspects. with zero gate values. This method achieves better com-
2737 pression rate than [12], but suffers from the same drawback. Advantages of Channel-level Sparsity. As discussed in
prior works [35, 23, 11], sparsity can be realized at differ-Structured Pruning / Sparsifying. Recently, [23] pro- ent levels, e.g., weight-level, kernel-level, channel-level orposes to prune channels with small incoming weights in layer-level. Fine-grained level (e.g., weight-level) sparsitytrained CNNs, and then fine-tune the network to regain gives the highest flexibility and generality leads to higheraccuracy. [2] introduces sparsity by random deactivat- compression rate, but it usually requires special software oring input-output channel-wise connections in convolutional hardware accelerators to do fast inference on the sparsifiedlayers before training, which also yields smaller networks model [11]. On the contrary, the coarsest layer-level spar-with moderate accuracy loss. Compared with these works, sity does not require special packages to harvest the infer-we explicitly impose channel-wise sparsity in the optimiza- ence speedup, while it is less flexible as some whole layerstion objective during training, leading to smoother channel need to be pruned. In fact, removing layers is only effec-pruning process and little accuracy loss. tive when the depth is sufficiently large, e.g., more than 50[37] imposes neuron-level sparsity during training thus layers [35, 18]. In comparison, channel-level sparsity pro-some neurons could be pruned to obtain compact networks. vides a nice tradeoff between flexibility and ease of imple-[35] proposes a Structured Sparsity Learning (SSL) method mentation. It can be applied to any typical CNNs or fully-to sparsify different level of structures (e.g. filters, channels connected networks (treat each neuron as a channel), andor layers) in CNNs. Both methods utilize group sparsity the resulting network is essentially a “thinned” version ofregualarization during training to obtain structured spar- the unpruned network, which can be efficiently inferenced sity. Instead of resorting to group sparsity on convolu- on conventional CNN platforms.tional weights, our approach imposes simple L1 sparsity on
channel-wise scaling factors, thus the optimization objec- Challenges. Achieving channel-level sparsity requires
tive is much simpler. pruning all the incoming and outgoing connections asso-
Since these methods prune or sparsify part of the net- ciated with a channel. This renders the method of directly
work structures (e.g., neurons, channels) instead of individ- pruning weights on a pre-trained model ineffective, as it is
ual weights, they usually require less specialized libraries unlikely that all the weights at the input or output end of
(e.g. for sparse computing operation) to achieve inference a channel happen to have near zero values. As reported in
speedup and run-time memory saving. Our network slim- [23], pruning channels on pre-trained ResNets can only lead
ming also falls into this category, with absolutely no special to a reduction of10% in the number of parameters without
libraries needed to obtain the benefits. suffering from accuracy loss. [35] addresses this problem
by enforcing sparsity regularization into the training objec-Neural Architecture Learning. While state-of-the-art tive. Specifically, they adoptgroup LASSOto push all theCNNs are typically designed by experts [22, 31, 14], there filter weights corresponds to the same channel towards zeroare also some explorations on automatically learning net- simultaneously during training. However, this approach re-work architectures. [20] introduces sub-modular/super- quires computing the gradients of the additional regulariza-modular optimization for network architecture search with tion term with respect to all the filter weights, which is non-a given resource budget. Some recent works [38, 1] propose trivial. We introduce a simple idea to address the aboveto learn neural architecture automatically with reinforce- challenges, and the details are presented below.ment learning. The searching space of these methods are
extremely large, thus one needs to train hundreds of mod- Scaling Factors and Sparsity-induced Penalty.Our idea
els to distinguish good from bad ones. Network slimming is introducing a scaling factorγfor each channel, which is
can also be treated as an approach for architecture learning, multiplied to the output of that channel. Then we jointly
despite the choices are limited to the width of each layer. train the network weights and these scaling factors, with
However, in contrast to the aforementioned methods, net- sparsity regularization imposed on the latter. Finally we
work slimming learns network architecture through only a prune those channels with small factors, and fine-tune the
single training process, which is in line with our goal of pruned network. Specifically, the training objective of our
efficiency. approach is given by
3. Network slimming L= l(f(x,W),y) +λ g(γ) (1)
(x,y) γ∈Γ We aim to provide a simple scheme to achieve channel-
level sparsity in deep CNNs. In this section, we first dis- where(x,y)denote the train input and target,Wdenotes
cuss the advantages and challenges of channel-level spar- the trainable weights, the first sum-term corresponds to the
sity, and introduce how we leverage the scaling layers in normal training loss of a CNN,g(·)is a sparsity-induced
batch normalization to effectively identify and prune unim- penalty on the scaling factors, andλbalances the two terms.
portant channels in the network. In our experiment, we chooseg(s) =|s|, which is known as
2738 convolution layers. 2), if we insert a scaling layer before
a BN layer, the scaling effect of the scaling layer will be
Train with Prune channels Initial Fine-tune the Compact completely canceled by the normalization process in BN. channel sparsity with small network pruned network networkregularization scaling factors 3), if we insert scaling layer after BN layer, there are two
consecutive scaling factors for each channel. Figure 2: Flow-chart of network slimming procedure. The dotted-
line is for the multi-pass/iterative scheme. Channel Pruning and Fine-tuning.After training under
channel-level sparsity-induced regularization, we obtain a
L1-norm and widely used to achieve sparsity. Subgradient model in which many scaling factors are near zero (see Fig-
descent is adopted as the optimization method for the non- ure 1). Then we can prune channels with near-zero scaling
smooth L1 penalty term. An alternative option is to replace factors, by removing all their incoming and outgoing con-
the L1 penalty with the smooth-L1 penalty [30] to avoid nections and corresponding weights. We prune channels
using sub-gradient at non-smooth point. with a global threshold across all layers, which is defined
As pruning a channel essentially corresponds to remov- as a certain percentile of all the scaling factor values. For
ing all the incoming and outgoing connections of that chan- instance, we prune 70% channels with lower scaling factors
nel, we can directly obtain a narrow network (see Figure 1) by choosing the percentile threshold as 70%. By doing so,
without resorting to any special sparse computation pack- we obtain a more compact network with less parameters and
ages. The scaling factors act as the agents for channel se- run-time memory, as well as less computing operations.
lection. As they are jointly optimized with the network Pruning may temporarily lead to some accuracy loss,
weights, the network can automatically identity insignifi- when the pruning ratio is high. But this can be largely com-
cant channels, which can be safely removed without greatly pensated by the followed fine-tuning process on the pruned
affecting the generalization performance. network. In our experiments, the fine-tuned narrow network
Leveraging the Scaling Factors in BN Layers.Batch nor- can even achieve higher accuracy than the original unpruned
malization [19] has been adopted by most modern CNNs network in many cases.
as a standard approach to achieve fast convergence and bet- Multi-pass Scheme. We can also extend the proposedter generalization performance. The way BN normalizes method from single-pass learning scheme (training withthe activations motivates us to design a simple and effi- sparsity regularization, pruning, and fine-tuning) to a multi-cient method to incorporates the channel-wise scaling fac- pass scheme. Specifically, a network slimming proceduretors. Particularly, BN layer normalizes the internal activa- results in a narrow network, on which we could again applytions using mini-batch statistics. Letzin andzout be the the whole training procedure to learn an even more compactinput and output of a BN layer,Bdenotes the current mini- model. This is illustrated by the dotted-line in Figure 2. Ex-batch, BN layer performs the following transformation: perimental results show that this multi-pass scheme can lead
to even better results in terms of compression rate.zzˆ= in −µ B ; zσ2 +ǫ out =γzˆ+β (2) Handling Cross Layer Connections and Pre-activation B Structure. The network slimming process introduced
whereµB andσB are the mean and standard deviation val- above can be directly applied to most plain CNN architec-
ues of input activations overB,γandβare trainable affine tures such as AlexNet [22] and VGGNet [31]. While some
transformation parameters (scale and shift) which provides adaptations are required when it is applied to modern net-
the possibility of linearly transforming normalized activa- works withcross layer connectionsand thepre-activation
tions back to any scales. design such as ResNet [15] and DenseNet [17]. For these
It is common practice to insert a BN layer after a convo- networks, the output of a layer may be treated as the input
lutional layer, with channel-wise scaling/shifting parame- of multiple subsequent layers, in which a BN layer is placed
ters. Therefore, we can directly leverage theγparameters in before the convolutional layer. In this case, the sparsity is
BN layers as the scaling factors we need for network slim- achieved at the incoming end of a layer, i.e., the layer selec-
ming. It has the great advantage of introducing no overhead tively uses a subset of channels it received. To harvest the
to the network. In fact, this is perhaps also the most effec- parameter and computation savings at test time, we need
tive way we can learn meaningful scaling factors for chan- to place achannel selectionlayer to mask out insignificant
nel pruning.1), if we add scaling layers to a CNN without channels we have identified.
BN layer, the value of the scaling factors are not meaning-
ful for evaluating the importance of a channel, because both 4. Experiments convolution layers and scaling layers are linear transforma-
tions. One can obtain the same results by decreasing the We empirically demonstrate the effectiveness of network
scaling factor values while amplifying the weights in the slimming on several benchmark datasets. We implement
2739 (a) Test Errors on CIFAR-10
Model Test error (%) Parameters Pruned FLOPs Pruned
VGGNet (Baseline) 6.34 20.04M - 7.97×10 8 -
VGGNet (70% Pruned) 6.20 2.30M 88.5% 3.91×10 8 51.0%
DenseNet-40 (Baseline) 6.11 1.02M - 5.33×10 8 -
DenseNet-40 (40% Pruned) 5.19 0.66M 35.7% 3.81×10 8 28.4%
DenseNet-40 (70% Pruned) 5.65 0.35M 65.2% 2.40×10 8 55.0%
ResNet-164 (Baseline) 5.42 1.70M - 4.99×10 8 -
ResNet-164 (40% Pruned) 5.08 1.44M 14.9% 3.81×10 8 23.7%
ResNet-164 (60% Pruned) 5.27 1.10M 35.2% 2.75×10 8 44.9%
(b) Test Errors on CIFAR-100
Model Test error (%) Parameters Pruned FLOPs Pruned
VGGNet (Baseline) 26.74 20.08M - 7.97×10 8 -
VGGNet (50% Pruned) 26.52 5.00M 75.1% 5.01×10 8 37.1%
DenseNet-40 (Baseline) 25.36 1.06M - 5.33×10 8 -
DenseNet-40 (40% Pruned) 25.28 0.66M 37.5% 3.71×10 8 30.3%
DenseNet-40 (60% Pruned) 25.72 0.46M 54.6% 2.81×10 8 47.1%
ResNet-164 (Baseline) 23.37 1.73M - 5.00×10 8 -
ResNet-164 (40% Pruned) 22.87 1.46M 15.5% 3.33×10 8 33.3%
ResNet-164 (60% Pruned) 23.91 1.21M 29.7% 2.47×10 8 50.6%
(c) Test Errors on SVHN
Model Test Error (%) Parameters Pruned FLOPs Pruned
VGGNet (Baseline) 2.17 20.04M - 7.97×10 8 -
VGGNet (60% Pruned) 2.06 3.04M 84.8% 3.98×10 8 50.1%
DenseNet-40 (Baseline) 1.89 1.02M - 5.33×10 8 -
DenseNet-40 (40% Pruned) 1.79 0.65M 36.3% 3.69×10 8 30.8%
DenseNet-40 (60% Pruned) 1.81 0.44M 56.6% 2.67×10 8 49.8%
ResNet-164 (Baseline) 1.78 1.70M - 4.99×10 8 -
ResNet-164 (40% Pruned) 1.85 1.46M 14.5% 3.44×10 8 31.1%
ResNet-164 (60% Pruned) 1.81 1.12M 34.3% 2.25×10 8 54.9%
Table 1: Results on CIFAR and SVHN datasets. “Baseline” denotes normal training without sparsity regularization. In column-1, “60%
pruned” denotes the fine-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters
and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy
could typically be maintained with≥60% channels pruned.
our method based on the publicly available Torch [5] im- images, from which we split a validation set of 6,000 im-
plementation for ResNets by [10]. The code is available at ages for model selection during training. The test set con-
https://github.com/liuzhuang13/slimming. tains 26,032 images. During training, we select the model
with the lowest validation error as the model to be pruned
4.1. Datasets (or the baseline model). We also report the test errors of the
models with lowest validation errors during fine-tuning.CIFAR.The two CIFAR datasets [21] consist of natural im-
ages with resolution 32×32. CIFAR-10 is drawn from 10
and CIFAR-100 from 100 classes. The train and test sets ImageNet. The ImageNet dataset contains 1.2 millioncontain 50,000 and 10,000 images respectively. On CIFAR- training images and 50,000 validation images of 100010, a validation set of 5,000 images is split from the training classes. We adopt the data augmentation scheme as in [10].set for the search ofλ(in Equation 1) on each model. We We report the single-center-crop validation error of the finalreport the final test errors after training or fine-tuning on model.all training images. A standard data augmentation scheme
(shifting/mirroring) [14, 18, 24] is adopted. The input data
is normalized using channel means and standard deviations. MNIST.MNIST is a handwritten digit dataset containingWe also compare our method with [23] on CIFAR datasets. 60,000 training images and 10,000 test images. To test the
SVHN.The Street View House Number (SVHN) dataset effectiveness of our method on a fully-connected network
[27] consists of 32x32 colored digit images. Following (treating each neuron as a channel with 1×1 spatial size),
common practice [9, 18, 24] we use all the 604,388 training we compare our method with [35] on this dataset.
2740 4.2. Network Models Model Parameter and FLOP Savings
On CIFAR and SVHN dataset, we evaluate our method 100 100.0% 100.0% 100.0% Original
Parameter Ratio
on three popular network architectures: VGGNet[31], 80 FLOPs Ratio
ResNet [14] and DenseNet [17]. The VGGNet is originally
Ratio (%) 64.8%
60
designed for ImageNet classification. For our experiment a 55.1%
49.0% 45.0%
variation of the original VGGNet for CIFAR dataset is taken 40 34.8%
from [36]. For ResNet, a 164-layer pre-activation ResNet 20 11.5%
with bottleneck structure (ResNet-164) [15] is used. For 0
DenseNet, we use a 40-layer DenseNet with growth rate 12 VGGNet DenseNet-40 ResNet-164
(DenseNet-40). Figure 3: Comparison of pruned models withlowertest errors on On ImageNet dataset, we adopt the 11-layer (8-conv + CIFAR-10 than the original models. The blue and green bars are 3 FC) “VGG-A” network [31] model with batch normaliza- parameter and FLOP ratios between pruned and original models.
tion from [4]. We remove the dropout layers since we use
relatively heavy data augmentation. To prune the neurons mented by building a new narrower model and copying the
in fully-connected layers, we treat them as convolutional corresponding weights from the model trained with sparsity.
channels with 1×1 spatial size.
On MNIST dataset, we evaluate our method on the same Fine-tuning.After the pruning we obtain a narrower and
3-layer fully-connected network as in [35]. more compact model, which is then fine-tuned. On CIFAR,
SVHN and MNIST datasets, the fine-tuning uses the same
4.3. Training, Pruning and Fine­tuning optimization setting as in training. For ImageNet dataset,
due to time constraint, we fine-tune the pruned VGG-A withNormal Training.We train all the networks normally from a learning rate of 10 3 for only 5 epochs.scratch as baselines. All the networks are trained using
SGD. On CIFAR and SVHN datasets we train using mini- 4.4. Results batch size 64 for 160 and 20 epochs, respectively. The ini-
tial learning rate is set to 0.1, and is divided by 10 at 50% CIFAR and SVHNThe results on CIFAR and SVHN are
and 75% of the total number of training epochs. On Im- shown in Table 1. We mark all lowest test errors of a model
ageNet and MNIST datasets, we train our models for 60 inboldface.
and 30 epochs respectively, with a batch size of 256, and an Parameter and FLOP reductions. The purpose of net-initial learning rate of 0.1 which is divided by 10 after 1/3 work slimming is to reduce the amount of computing re-and 2/3 fraction of training epochs. We use a weight de- sources needed. The last row of each model has≥60%cay of10 4 and a Nesterov momentum [33] of 0.9 without channels pruned while still maintaining similar accuracy todampening. The weight initialization introduced by [13] is the baseline. The parameter saving can be up to 10×. Theadopted. Our optimization settings closely follow the orig- FLOP reductions are typically around50%. To highlightinal implementation at [10]. In all our experiments, we ini- network slimmings efficiency, we plot the resource sav-tialize all channel scaling factors to be 0.5, since this gives ings in Figure 3. It can be observed that VGGNet has ahigher accuracy for the baseline models compared with de- large amount of redundant parameters that can be pruned.fault setting (all initialized to be 1) from [10]. On ResNet-164 the parameter and FLOP savings are rel-
Training with Sparsity.For CIFAR and SVHN datasets, atively insignificant, we conjecture this is due to its “bot-
when training with channel sparse regularization, the hyper- tleneck” structure has already functioned as selecting chan-
parameteerλ, which controls the tradeoff between empiri- nels. Also, on CIFAR-100 the reduction rate is typically
cal loss and sparsity, is determined by a grid search over slightly lower than CIFAR-10 and SVHN, which is possi-
10 3 , 10 4 , 10 5 on CIFAR-10 validation set. For VG- bly due to the fact that CIFAR-100 contains more classes.
GNet we chooseλ=10 4 and for ResNet and DenseNet Regularization Effect.From Table 1, we can observe that,λ=10 5 . For VGG-A on ImageNet, we setλ=10 5 . All on ResNet and DenseNet, typically when40%channels areother settings are kept the same as in normal training. pruned, the fine-tuned network can achieve a lower test er-
Pruning.When we prune the channels of models trained ror than the original models. For example, DenseNet-40
with sparsity, a pruning threshold on the scaling factors with 40% channels pruned achieve a test error of 5.19%
needs to be determined. Unlike in [23] where different lay- on CIFAR-10, which is almost 1% lower than the original
ers are pruned by different ratios, we use a global pruning model. We hypothesize this is due to the regularization ef-
threshold for simplicity. The pruning threshold is deter- fect of L1 sparsity on channels, which naturally provides
mined by a percentile among all scaling factors , e.g., 40% feature selection in intermediate layers of a network. We
or 60% channels are pruned. The pruning process is imple- will analyze this effect in the next section.
2741 VGG-A Baseline 50% Pruned (a) Multi-pass Scheme on CIFAR-10
Params 132.9M 23.2M IterTrained Fine-tunedParams PrunedFLOPs Pruned
Params Pruned - 82.5% 1 6.38 6.51 66.7% 38.6%
FLOPs 4.57×10 10 3.18×10 10 2 6.23 6.11 84.7% 52.7%
FLOPs Pruned - 30.4% 3 5.87 6.10 91.4% 63.1%
Validation Error (%) 36.69 36.66 4 6.19 6.59 95.6% 77.2%
5 5.96 7.73 98.3% 88.7%
Table 2: Results on ImageNet. 6 7.79 9.70 99.4% 95.7%
Model Test Error (%)Params Pruned #Neurons (b) Multi-pass Scheme on CIFAR-100
Baseline 1.43 - 784-500-300-10 IterTrained Fine-tunedParams PrunedFLOPs Pruned
Pruned [35] 1.53 83.5% 434-174-78-10 1 27.72 26.52 59.1% 30.9%
Pruned (ours) 1.49 84.4% 784-100-60-10 2 26.03 26.52 79.2% 46.1%
3 26.49 29.08 89.8% 67.3%
Table 3: Results on MNIST. 4 28.17 30.59 95.3% 83.0%
5 30.04 36.35 98.3% 93.5%
6 35.91 46.73 99.4% 97.7%
ImageNet. The results for ImageNet dataset are summa-
rized in Table 2. When 50% channels are pruned, the pa- Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR-
rameter saving is more than 5×, while the FLOP saving 100 datasets, using VGGNet. The baseline model has test errors of
is only 30.4%. This is due to the fact that only 378 (out 6.34% and 26.74%. “Trained” and “Fine-tuned” columns denote
of 2752) channels from all the computation-intensive con- the test errors (%) of the model trained with sparsity, and the fine-
tuned model after channel pruning, respectively. The parameter volutional layers are pruned, while 5094 neurons (out of and FLOP pruned ratios correspond to the fine-tuned model in that 8192) from the parameter-intensive fully-connected layers row and the trained model in the next row. are pruned. It is worth noting that our method can achieve
the savings with no accuracy loss on the 1000-class Im- more compact models. On CIFAR-10, the trained modelageNet dataset, where other methods for efficient CNNs achieves the lowest test error in iteration 5. This model[2, 23, 35, 28] mostly report accuracy loss. achieves 20×parameter reduction and 5×FLOP reduction,
MNIST.On MNIST dataset, we compare our method with while still achievinglowertest error. On CIFAR-100, after
the Structured Sparsity Learning (SSL) method [35] in Ta- iteration 3, the test error begins to increase. This is pos-
ble 3. Despite our method is mainly designed to prune sibly due to that it contains more classes than CIFAR-10,
channels in convolutional layers, it also works well in prun- so pruning channels too agressively will inevitably hurt the
ing neurons in fully-connected layers. In this experiment, performance. However, we can still prune near 90% param-
we observe that pruning with a global threshold sometimes eters and near 70% FLOPs without notable accuracy loss.
completely removes a layer, thus we prune 80% of the neu-
rons in each of the two intermediate layers. Our method 5. Analysis
slightly outperforms [35], in that a slightly lower test error There are two crucial hyper-parameters in network slim-is achieved while pruning more parameters. ming, the pruned percentagetand the coefficient of the
We provide some additional experimental results in the sparsity regularization termλ(see Equation 1). In this sec-
supplementary materials, including (1) detailed structure of tion, we analyze their effects in more detail.
a compact VGGNet on CIFAR-10; (2) wall-clock time and Effect of Pruned Percentage. Once we obtain a modelrun-time memory savings in practice. (3) comparison with trained with sparsity regularization, we need to decide whata previous channel pruning method [23]; percentage of channels to prune from the model. If we
4.5. Results for Multi­pass Scheme prune too few channels, the resource saving can be very
limited. However, it could be destructive to the model if
We employ the multi-pass scheme on CIFAR datasets we prune too many channels, and it may not be possible to
using VGGNet. Since there are no skip-connections, prun- recover the accuracy by fine-tuning. We train a DenseNet-
ing away a whole layer will completely destroy the mod- 40 model withλ=10 5 on CIFAR-10 to show the effect of
els. Thus, besides setting the percentile threshold as 50%, pruning a varying percentage of channels. The results are
we also put a constraint that at each layer, at most 50% of summarized in Figure 5.
channels can be pruned. From Figure 5, it can be concluded that the classification
The test errors of models in each iteration are shown in performance of the pruned or fine-tuned models degrade
Table 4. As the pruning process goes, we obtain more and only when the pruning ratio surpasses a threshold. The fine-
2742 λ= 0 λ= 10 5 λ= 10 4
400 450 2000
350 400
300 350 1500
300250
Count 250200 1000200150 150
100 100 500
50 50
0 0 00.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
Scaling factor value
Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter
λ). With the increase ofλ, scaling factors become sparser.
8.0 0Baseline
7.5 Trained with Sparsity 10 Pruned 7.0 Fine-tuned
Channel Index )
% 20
Test error ( 6.5
30 6.0
40 5.5
5.0 50
4.50 10 20 30 40 50 60 70 80 90 0 20 40 60 80
Pruned channels (%) Epoch
Figure 5: The effect of pruning varying percentages of channels, Figure 6: Visulization of channel scaling factors change in scale
from DenseNet-40 trained on CIFAR-10 withλ=10 5 . along the training process, taken from the 11th conv-layer in VG-
GNet trained on CIFAR-10. Brighter color corresponds to larger
value. The bright lines indicate the “selected” channels, the dark
tuning process can typically compensate the possible accu- lines indicate channels that can be pruned.
racy loss caused by pruning. Only when the threshold goes
beyond 80%, the test error of fine-tuned model falls behind progresses, some channels scaling factors become largerthe baseline model. Notably, when trained with sparsity, (brighter) while others become smaller (darker).even without fine-tuning, the model performs better than the
original model. This is possibly due the the regularization 6. Conclusion effect of L1 sparsity on channel scaling factors.
We proposed the network slimming technique to learnChannel Sparsity Regularization.The purpose of the L1 more compact CNNs. It directly imposes sparsity-inducedsparsity term is to force many of the scaling factors to be regularization on the scaling factors in batch normalizationnear zero. The parameterλin Equation 1 controls its signif- layers, and unimportant channels can thus be automatically icance compared with the normal training loss. In Figure 4 identified during training and then pruned. On multiple we plot the distributions of scaling factors in the whole net- datasets, we have shown that the proposed method is able towork with differentλvalues. For this experiment we use a significantly decrease the computational cost (up to 20×) ofVGGNet trained on CIFAR-10 dataset. state-of-the-art networks, with no accuracy loss. More im-
It can be observed that with the increase ofλ, the scaling portantly, the proposed method simultaneously reduces the
factors are more and more concentrated near zero. When model size, run-time memory, computing operations while
λ=0, i.e., theres no sparsity regularization, the distribution introducing minimum overhead to the training process, and
is relatively flat. Whenλ=10 4 , almost all scaling factors the resulting models require no special libraries/hardware
fall into a small region near zero. This process can be seen for efficient inference.
as a feature selection happening in intermediate layers of
deep networks, where only channels with non-negligible Acknowledgements. Gao Huang is supported by the In-
scaling factors are chosen. We further visualize this pro- ternational Postdoctoral Exchange Fellowship Program of
cess by a heatmap. Figure 6 shows the magnitude of scaling China Postdoctoral Council (No.20150015). Changshui
factors from one layer in VGGNet, along the training pro- Zhang is supported by NSFC and DFG joint project NSFC
cess. Each channel starts with equal weights; as the training 61621136008/DFG TRR-169.
2743 References [20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network
architecture optimization through submodularity and super- [1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu- modularity.arXiv preprint arXiv:1609.00074, 2016. ral network architectures using reinforcement learning. In [21] A. Krizhevsky and G. Hinton. Learning multiple layers of ICLR, 2017. features from tiny images. InTech Report, 2009. [2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetof sparsity in convolutional neural networks.arXiv preprint classification with deep convolutional neural networks. In arXiv:1702.06257, 2017. NIPS, pages 10971105, 2012. [3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Y. Chen. Compressing neural networks with the hashing Graf. Pruning filters for efficient convnets. arXiv preprint trick. InICML, 2015. arXiv:1608.08710, 2016.
[4] S. Chintala. Training an object classifier in torch-7 on [24] M. Lin, Q. Chen, and S. Yan. Network in network. InICLR,multiple gpus over imagenet. https://github.com/ 2014.soumith/imagenet-multiGPU.torch. [25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.
[5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Sparse convolutional neural networks. InProceedings of the
matlab-like environment for machine learning. InBigLearn, IEEE Conference on Computer Vision and Pattern Recogni-
NIPS Workshop, number EPFL-CONF-192376, 2011. tion, pages 806814, 2015.
[6] M. Courbariaux and Y. Bengio. Binarynet: Training deep [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
neural networks with weights and activations constrained to+ networks for semantic segmentation. InCVPR, pages 3431
1 or-1.arXiv preprint arXiv:1602.02830, 2016. 3440, 2015.
[7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
gus. Exploiting linear structure within convolutional net- Ng. Reading digits in natural images with unsupervised fea-
works for efficient evaluation. InNIPS, 2014. ture learning, 2011. InNIPS Workshop on Deep Learning
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- and Unsupervised Feature Learning, 2011.
ture hierarchies for accurate object detection and semantic [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
segmentation. InCVPR, pages 580587, 2014. net: Imagenet classification using binary convolutional neu-
[9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and ral networks. InECCV, 2016.
Y. Bengio. Maxout networks. InICML, 2013. [29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini.
[10] S. Gross and M. Wilber. Training and investigating residual Group sparse regularization for deep neural networks.arXiv
nets. https://github.com/szagoruyko/cifar. preprint arXiv:1607.00485, 2016.
torch. [30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization
[11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- methods for l1 regularization: A comparative study and two
pressing deep neural network with pruning, trained quanti- new approaches. InECML, pages 286297, 2007.
zation and huffman coding. InICLR, 2016. [31] K. Simonyan and A. Zisserman. Very deep convolutional
[12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights networks for large-scale image recognition. InICLR, 2015.
and connections for efficient neural network. InNIPS, pages [32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse
11351143, 2015. neural networks.CoRR, abs/1611.06694, 2016.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into [33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the
rectifiers: Surpassing human-level performance on imagenet importance of initialization and momentum in deep learning.
classification. InICCV, 2015. InICML, 2013.
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
for image recognition. InCVPR, 2016. D. Anguelov, D. Erhan, et al. Going deeper with convolu-
tions. InCVPR, pages 19, 2015.[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningdeep residual networks. InECCV, pages 630645. Springer, structured sparsity in deep neural networks. InNIPS, 2016.2016. [36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and github.com/szagoruyko/cifar.torch.K. Q. Weinberger. Multi-scale dense convolutional networks
for efficient prediction. arXiv preprint arXiv:1703.09844, [37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards
2017. compact cnns. InECCV, 2016.
[38] B. Zoph and Q. V. Le. Neural architecture search with rein-[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. forcement learning. InICLR, 2017.Densely connected convolutional networks. InCVPR, 2017.
[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger.
Deep networks with stochastic depth. InECCV, 2016.
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
2744

View File

@ -0,0 +1,933 @@
262-A1677 7/24/01 11:12 AM Page 763
SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING
MANFRED OPPER Theories that try to understand the ability of neural
Neural Computation Research Group networks to generalize from learned examples are
Aston University discussed. Also, an approach that is based on ideas
Birmingham B4 7ET, United Kingdom from statistical physics which aims to model typical
learning behavior is compared with a worst-case
framework.
Learning to
Generalize
................................................ ◗
Introduction rule. To what extent is it possible to understand the com-
plexity of learning from examples by mathematical models
Neural networks learn from examples. This statement is andtheirsolutions?Thisquestionisthefocusofthisarticle.
obviously true for the brain, but also artificial networks (or I concentrate on the use of neural networks for classifica-
neural networks), which have become a powerful new tool tion. Here, one can take characteristic features (e.g., the
for many pattern-recognition problems, adapt their “syn- pixels of an image) as an input pattern to the network. In
aptic” couplings to a set of examples. Neural nets usually the simplest case, it should decide whether a given pattern
consist of many simple computing units which are com- belongs (at least more likely) to a certain class of objects
bined in an architecture which is often independent from and respond with the output 1 or 1. To learn the under-
the problem. The parameters which control the interaction lying classification rule, the network is trained on a set of
among the units can be changed during the learning phase patterns together with the classification labels, which are
and these are often called synaptic couplings.After the provided by a trainer. A heuristic strategy for training is to
learning phase, a network adopts some ability to generalize tune the parameters of the machine (the couplings of the
from the examples; it can make predictions about inputs network) using a learning algorithm, in such a way that the
which it has not seen before; it has begun to understand a errors made on the set of training examples are small, in
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 763 262-A1677 7/24/01 11:12 AM Page 764
MANFRED OPPER
the hope that this helps to reduce the errors on new data. for the case of realizable rules they are also independent
How well will the trained network be able to classify an in- of the specific algorithm, as long as the training examples
put that it has not seen before? This performance on new are perfectly learned. Because it is able to cover even bad
data defines the generalization ability of the network. This situations which are unfavorable for improvement of the
ability will be affected by the problem of realizability: The learning process, it is not surprising that this theory may
network may not be sufficiently complex to learn the rule in some cases provide too pessimistic results which are also
completely or there may be ambiguities in classification. too crude to reveal interesting behavior in the intermediate
Here, I concentrate on a second problem arising from the region of the learning curve.
fact that learning will mostly not be exhaustive and the in- In this article, I concentrate mainly on a different ap-
formation about the rule contained in the examples is not proach, which has its origin in statistical physics rather than
complete. Hence, the performance of a network may vary in mathematical statistics, and compare its results with the
from one training set to another. In order to treat the gen- worst-case results. This method aims at studying the typical
eralization ability in a quantitative way, a common model rather than the worst-case behavior and often enables the
assumes that all input patterns, those from the training set exact calculations of the entire learning curve for models of
and the new one on which the network is tested, have a pre- simple networks which have many parameters. Since both
assigned probability distribution (which characterizes the biological and artificial neural networks are composed of
feature that must be classified), and they are produced in- many elements, it is hoped that such an approach may ac-
dependently at random with the same probability distribu- tually reveal some relevant and interesting structures.
tion from the networks environment. Sometimes the prob- At first, it may seem surprising that a problem should
ability distribution used to extract the examples and the simplifywhenthenumberofitsconstituentsbecomeslarge.
classification of these examples is called the rule.The net- However, this phenomenon is well-known for macroscopic
works performance on novel data can now be quantified by physical systems such as gases or liquids which consist of
the so-called generalization error,which is the probability a huge number of molecules. Clearly, it is not possible to
of misclassifying the test input and can be measured by re- study the complete microscopic state of such a system,
peating the same learning experiment many times with dif- which is described by the rapidly fluctuating positions and
ferent data. velocities of all particles. On the other hand, macroscopic
Within such a probabilistic framework, neural networks quantities such as density, temperature, and pressure are
areoftenviewedasstatisticaladaptivemodelswhichshould usually collective properties influenced by all elements. For
give a likely explanation of the observed data. In this frame- such quantities, fluctuations are averaged out in the ther-
work, the learning process becomes mathematically related modynamic limit of a large number of particles and the col-
to a statistical estimation problem for optimal network pa- lective properties become, to some extent, independent of
rameters.Hence,mathematicalstatisticsseemstobeamost themicrostate.Similarly,thegeneralizationabilityofaneu-
appropriate candidate for studying a neural networks be- ral network is a collective property of all the network pa-
havior. In fact, various statistical approaches have been ap- rameters, and the techniques of statistical physics allow, at
plied to quantify the generalization performance. For ex- least for some simple but nontrivial models, for exact com-
ample, expressions for the generalization error have been putations in the thermodynamic limit. Before explaining
obtainedinthelimit,wherethenumberofexamplesislarge these ideas in detail, I provide a short description of feed-
compared to the number of couplings (Seung et al.,1992; forward neural networks.
Amari and Murata, 1993). In such a case, one can expect ................................................that learning is almost exhaustive, such that the statistical ◗
fluctuations of the parameters around their optimal values Artificial Neural Networks
are small. However, in practice the number of parameters is
often large so that the network can be flexible, and it is not Based on highly idealized models of brain function, artifi-
clear how many examples are needed for the asymptotic cial neural networks are built from simple elementary com-
theorytobecomevalid.Theasymptotictheorymayactually puting units, which are sometimes termed neurons after
miss interesting behavior of the so-called learning curve, their biological counterparts. Although hardware imple-
which displays the progress of generalization ability with mentations have become an important research topic, neu-
an increasing amount of training data. ral nets are still simulated mostly on standard computers.
A second important approach, which was introduced Each computing unit of a neural net has a single output and
into mathematical statistics in the 1970s by Vapnik and several ingoing connections which receive the outputs of
Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact other units. To every ingoing connection (labeled by the
bounds for the generalization error which are valid for any index i) a real number is assigned, the synaptic weight w,i
number of training examples. Moreover, they are entirely which is the basic adjustable parameter of the network. To
independent of the underlying distribution of inputs, and compute a units output, all incoming values x are multi- i
764 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 765
LEARNING TO GENERALIZE
0.6 0.9 0.8
inputs
1.6 1.4 0.1 synaptic weights
weighted sum
1.6 × 0.6 + (1.4) × (0.9) + (0.1) × 0.8 = 2.14
1
0
1
2.14 aboutput
FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numeri-
cal values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs
reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which
the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and
step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information.
plied by the weights w and then added. Figure 1a shows its simple structure, it can for many learning problems give i
an example of such a computation with three couplings. a nontrivial generalization performance and may be used
Finally, the result, wx,is passed through an activation as a first step to an unknown classification task. As can be i i i
function which is typically of the shape of the red curve in seen by comparing Figs. 2a and 1b, it is also a building
Fig. 1a (a sigmoidal function), which allows for a soft, am- block for the more complex multilayer networks. Hence,
biguous classification between 1 and 1. Other impor- understanding its performance theoretically may also pro-
tant cases are the step function (green curve) and the linear vide insight into the more complex machines. To learn a set
function (yellow curve; used in the output neuron for prob- of examples, a network must adjust its couplings appropri-
lems of fitting continuous functions). In the following, to ately (I often use the word couplings for their numerical
keep matters simple, I restrict the discussion mainly to the strengths, the weights w, for i1,..., N). Remarkably, i
step function. Such simple units can develop a remarkable for the perceptron there exists a simple learning algorithm
computational power when connected in a suitable archi- which always enables the network to find those parameter
tecture. An important network type is the feedforward ar- values whenever the examples can be learnt by a percep-
chitecture shown in Fig. 1b, which has two layers of comput- tron. In Rosenblatts algorithm, the input patterns are pre-
ing units and adjustable couplings. The input nodes (which sented sequentially (e.g., in cycles) to the network and the
do not compute) are coupled to the so-called hidden units,
whichfeedtheiroutputsintooneormoreoutputunits.With
suchanarchitectureandsigmoidalactivationfunctions,any
continuous function of the inputs can be arbitrarily closely xx 21 x2 x3 xn
approximated when the number of hidden units is suffi-
ciently large. (w1 ,w 2 )
w ................................................ 1 w2 w3 wn ◗
The Perceptron x1
The simplest type of network is the perceptron (Fig. 2a).
There are Ninputs, Nsynaptic couplings w, and the output i
is simply a b
N FIGURE 2 (a) The perceptron. (b) Classification of inputs
awx [1] i i by a perceptron with two inputs. The arrow indicates the vec-
i1 tor composed of the weights of the network, and the line per-
It has a single-layer architecture and the step function pendicular to this vector is the boundary between the classes
(green curve in Fig. 1a) as its activation function. Despite of input.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 765 262-A1677 7/24/01 11:12 AM Page 766
MANFRED OPPER
output is tested. Whenever a pattern is not classified cor-
rectly, all couplings are altered simultaneously. We increase x2
by a fixed amount all weights for which the input unit and
the correct value of the output neuron have the same sign
but we decrease them for the opposite sign. This simple
algorithm is reminiscent of the so-called Hebbian learning
rule,a physiological model of a learning processes in the
real brain. It assumes that synaptic weights are increased
when two neurons are simultaneously active. Rosenblatts
theorem states that in cases in which there exists a choice of
the w which classify correctly all of the examples (i.e., per- i
fectly learnable perceptron), this algorithm finds a solution
in a finite number of steps, which is at worst equal to A N 3 ,
where Ais an appropriate constant.
It is often useful to obtain an intuition of a perceptrons xa 1
classification performance by thinking in terms of a geo-
metric picture. We may view the numerical values of the in-
puts as the coordinates of a point in some (usually) high-
dimensional space. The case of two dimensions is shown
in Fig. 2b. A corresponding point is also constructed for the
couplings w.The arrow which points from the origin of the i
coordinate system to this latter point is called the weight
vector or coupling vector. An application of linear algebra
tothecomputationofthenetworkshowsthatthelinewhich
is perpendicular to the coupling vector is the boundary be-
tween inputs belonging to the two different classes. Input
points which are on the same side as the coupling vector are
classified as 1 (the green region in Fig. 2b) and those on
the other side as 1 (red region in Fig. 2b).
Rosenblatts algorithm aims to determine such a line
when it is possible. This picture generalizes to higher di- direction of coupling vectorb
mensions, for which a hyperplane plays the same role of the FIGURE 3 (a) Projection of 200 random points (with ran-
line of the previous two-dimensional example. We can still dom labels) from a 200-dimensional space onto the first two
obtainanintuitivepicturebyprojectingontwo-dimensional coordinate axes (x and x). (b) Projection of the same points 1 2
planes. In Fig. 3a, 200 input patterns with random coordi- onto a plane which contains the coupling vector of a perfectly
nates (randomly labeled red and blue) in a 200-dimensional trained perceptron.
input space are projected on the plane spanned by two arbi-
trary coordinate axes. If we instead use a plane for projec-
tion which contains the coupling vector (determined from tions for small changes of the couplings). Hence, in general,
a variant of Rosenblatts algorithm) we obtain the view in addition to the perfectly learnable perceptron case in
shown in Fig. 3b, in which red and green points are clearly which the final error is zero, minimizing the training error
separated and there is even a gap between the two clouds. is usually a difficult task which could take a large amount of
It is evident that there are cases in which the two sets of computer time. However, in practice, iterative approaches,
points are too mixed and there is no line in two dimensions which are based on the minimization of other smooth cost
(or no hyperplane in higher dimensions which separates functions,areusedtotrainaneuralnetwork(Bishop,1995).
them). In these cases, the rule is too complex to be per- ................................................fectly learned by a perceptron. If this happens, we must at- ◗
tempt to determine the choice of the coupling which mini- Capacity, VC Dimension,
mizesthenumberoferrorsonagivensetofexamples.Here, and Worst-Case Generalization
Rosenblatts algorithm does not work and the problem of
finding the minimum is much more difficult from the algo- As previously shown, perceptrons are only able to realize a
rithmic point. The training error, which is the number of very restricted type of classification rules, the so-called lin-
errorsmadeonthetrainingset,isusuallyanonsmoothfunc- early separable ones. Hence, independently from the issue
tion of the network couplings (i.e., it may have large varia- of finding the best algorithm to learn the rule, one may ask
766 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 767
LEARNING TO GENERALIZE
the following question: In how many cases will the percep- exp[Nf(m/N)], where the function f(a) vanishes for
tron be able to learn a given set of training examples per- a2 and it is positive for a2. Such a threshold phe-
fectly if the output labels are chosen arbitrarily? In order to nomenon is an example of a phase transition (i.e., a sharp
answer this question in a quantitative way, it is convenient change of behavior) which can occur in the thermodynamic
tointroducesomeconceptssuchascapacity,VCdimension, limit of a large network size.
andworst-casegeneralization,whichcanbeusedinthecase Generally, the point at which such a transition takesof the perceptron and have a more general meaning. place defines the so-called capacity of the neural network.In the case of perceptrons, this question was answered in Although the capacity measures the ability of a network tothe 1960s by Cover (1965). He calculated for any set of in- learn random mappings of the inputs, it is also related to itsput patterns, e.g., m,the fraction of all the 2 m possible map- ability to learn a rule (i.e., to generalize from examples).pings that can be linearly separated and are thus learnable The question now is, how does the network perform on aby perceptrons. This fraction is shown in Fig. 4 as a func- new example after having been trained to learn mexampletion of the number of examples per coupling for different on the training set?numbers of input nodes (couplings) N.Three regions can To obtain an intuitive idea of the connection betweenbe distinguished: capacity and ability to generalize, we assume a training set
Region in which m/N1: Simple linear algebra shows of size mand a single pattern for test. Suppose we define
that it is always possible to learn all mappings when the a possible rule by an arbitrary learnable mapping from
number mof input patterns is less than or equal to the inputs to outputs. If m1 is much larger than the capac-
number Nof couplings (there are simply enough adjustable ity, then for most rules the labels on the mtraining pat-
parameters). terns which the perceptron is able to recognize will nearly
Region in which m/N1: For this region, there are ex- uniquely determine the couplings (and consequently the
amples of rules that cannot be learned. However, when the answer of the learning algorithm on the test pattern), and
number of examples is less than twice the number of cou- therulecanbeperfectlyunderstoodfromtheexamples.Be-
plings (m/N2), if the network is large enough almost all low capacity, in most cases there are two different choices
mappings can be learned. If the output labels for each of of couplings which give opposite answers for the test pat-
the minputs are chosen randomly 1 or 1 with equal tern. Hence, a correct classification will occur with proba-
probability, the probability of finding a nonrealizable cou- bility 0.5 assuming all rules to be equally probable. Figure 5
pling goes to zero exponentially when Ngoes to infinity at displays the two types of situations form3andN2.
fixed ratio m/N. This intuitive connection can be sharpened. Vapnik and
Region in which m/N2: For m/N2 the probabil- Chervonenkis established a relation between a capacity
ity for a mapping to be realizable by perceptrons decreases such as quantity and the generalization ability that is valid
to zero rapidly and it goes to zero exponentially when N for general classifiers (Vapnik, 1982, 1995). The VC dimen-
goes to infinity at fixed ratio m/N(it is proportional to sion is defined as the size of the largest set of inputs for
which all mappings can be learned by the type of classi-
fier. It equals Nfor the perceptron. Vapnik and Chervo-
1.0 nenkis were able to show that for any training set of size m
fraction of realizable mappings 0.8
0.6
0.4 ? ?
0.2
0.0 a b
01234 FIGURE 5 Classification rules for four patterns based on a m/N perceptron. The patterns colored in red represent the training
FIGURE 4 Fraction of all mappings of minput patterns examples, and triangles and circles represent different class la-
which are learnable by perceptrons as a function of m/Nfor bels. The question mark is a test pattern. (a) There are two
different numbers of couplings N: N10 (in green), N20 possible ways of classifying the test point consistent with the
(in blue), and N100 (in red). examples; (b) only one classification is possible.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 767 262-A1677 7/24/01 11:12 AM Page 768
MANFRED OPPER
larger than the VC dimension D , the growth of the num- blue curve in Fig. 6, the minimal training error will decrease VC
ber of realizable mappings is bounded by an expression for increasing complexity of the nets. On the other hand,
which grows much slower than 2 m (in fact, only like a poly- the VC dimension and the complexity of the networks in-
nomial in m). crease with the increasing number of hidden units, leading
They proved that a large difference between training er- to an increasing expected difference (confidence interval)
ror (i.e., the minimum percentage of errors that is done on between training error and generalization error as indi-
the training set) and generalization error (i.e., the proba- cated by the red curve. The sum of both (green curve) will
bility of producing an error on the test pattern after having have a minimum, giving the smallest bound on the general-
learned the examples) of classifiers is highly improbable if ization error. As discussed later, this procedure will in some
the number of examples is well above D . This theorem cases lead to not very realistic estimates by the rather pes- VC
implies a small expected generalization error for perfect simistic bounds of the theory. In other words, the rigorous
learning of the training set results. The expected general- bounds, which are obtained from an arbitrary network and
ization error is bounded by a quantity which increases pro- rule, are much larger than those determined from the re-
portionally to D and decreases (neglecting logarithmic sults for most of the networks and rules. VC
corrections in m) inversely proportional to m. ................................................Conversely, one can construct a worst-case distribution ◗
of input patterns, for which a size of the training set larger Typical Scenario: The Approach
than D is also necessary for good generalization. The VC of Statistical Physics VC
results should, in practice, enable us to select the network
with the proper complexity which guarantees the smallest When the number of examples is comparable to the size of
bound on the generalization error. For example, in order the network, which for a perceptron equals the VC dimen-
tofind the proper size of the hidden layer of a network with sion, the VC theory states that one can construct malicious
twolayers,onecouldtrainnetworksofdifferentsizesonthe situations which prevent generalizations. However, in gen-
same data. eral, we would not expect that the world acts as an adver-
The relation among these concepts can be better under- sary. Therefore, how should one model a typical situation?
stood if we consider a family of networks of increasing com- As a first step, one may construct rules and pattern dis-
plexity which have to learn the same rule. A qualitative pic- tributions which act together in a nonadversarial way. The
ture of the results is shown in Fig. 6. As indicated by the teacherstudent paradigm has proven to be useful in such a
situation. Here, the rule to be learned is modeled by a sec-
ondnetwork,theteachernetwork;inthiscase,iftheteacher
and the student have the same architecture and the same
upper bound on numberofunits,theruleisevidentlyrealizable.Thecorrect generalization error class labels for any inputs are given by the outputs of the
teacher. Within this framework, it is often possible to ob-
tain simple expressions for the generalization error. For a
upper bound on perceptron, we can use the geometric picture to visualize confidence interval the generalization error. A misclassification of a new in-
put vector by a student perceptron with coupling vector ST
occurs only if the input pattern is between the separating
planes (dashed region in Fig. 7) defined by ST and the vec-
tor of teacher couplings TE. If the inputs are drawn ran- training error domlyfromauniformdistribution,thegeneralizationerror
is directly proportional to the angle between ST and TE.
network complexity Hence, the generalization error is small when teacher and
student vectors are close together and decreases to zero
when both coincide.
In the limit, when the number of examples is very large
all the students which learn the training examples perfectly
will not differ very much from and their couplings will be FIGURE 6 As the complexity of the network varies (i.e., close to those of the teacher. Such cases with a small gen- of the number of hidden units, as shown schematically below),
the generalization error (in red), calculated from the sum of eralization error have been successfully treated by asymp-
the training error (in green) and the confidence interval (in totic methods of statistics. On the other hand, when the
blue) according to the theory of VapnikChervonenkis, shows number of examples is relatively small, there are many dif-
a minimum; this corresponds to the network with the best gen- ferent students which are consistent with the teacher re-
eralization ability. garding the training examples, and the uncertainty about
768 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 769
LEARNING TO GENERALIZE
with the number of couplings N(like typical volumes in
N-dimensional spaces) and Bdecreases exponentially with
m(because it becomes more improbable to be correct ST mtimes for any e0), both factors can balance each other
when mincreases like maN.ais an effective measure for TE the size of the training set when Ngoes to infinity. In order
to have quantities which remain finite as NSq, it is also
useful to take the logarithm of V(e) and divide by N, which
transforms the product into a sum of two terms. The first
one (which is often called the entropic term) increases with
increasing generalization error (green curve in Fig. 8). This
FIGURE 7 For a uniform distribution of patterns, the gen- is true because there are many networks which are not
eralization error of a perceptron equals the area of the similar to the teacher, but there is only one network equal
shaded region divided by the area of the entire circle. ST and to the teacher. For almost all networks (remember, the
TE represent the coupling vectors of the student and teacher, entropic term does not include the effect of the training ex-
respectively. amples) e0.5, i.e., they are correct half of the time by
random guessing. On the other hand, the second term (red
curve in Fig. 8) decreases with increasing generalization er-
the true couplings of the teacher is large. Possible general- ror because the probability of being correct on an input
ization errors may range from zero (if, by chance, a learn- pattern increases when the student network becomes more
ing algorithm converges to the teacher) to some worst-case similar to the teacher. It is often called the energetic contri-
value. We may say that the constraint which specifies the butionbecauseitfavorshighlyordered(towardtheteacher)
macrostateofthenetwork(itstrainingerror)doesnotspec- network states, reminiscent of the states of physical systems
ify the microstate uniquely. Nevertheless, it makes sense to at low energies. Hence, there will be a maximum (Fig. 8, ar-
speak of a typical value for the generalization error, which row) of V(e) at some value of ewhich by definition is the
is defined as the value which is realized by the majority of typical generalization error.
the students. In the thermodynamic limit known from sta- The development of the learning process as the number
tistical physics, in which the number of parameters of the of examples aNincreases can be understood as a compe-
network is taken to be large, we expect that in fact almost tition between the entropic term, which favors disordered
all students belong to this majority, provided the quantity network configurations that are not similar to the teacher,
of interest is a cooperative effect of all components of the andtheenergeticterm.Thelattertermdominateswhenthe
system. As the geometric visualization for the generaliza- number of examples is large. It will later be shown that such
tion error of the perceptron shows, this is actually the case. a competition can lead to a rich and interesting behavior as
The following approach, which was pioneered by Elizabeth the number of examples is varied. The result for the learn-
Gardner (Gardner, 1988; Gardner and Derrida, 1989), is ing curve (Györgyi and Tishby, 1990; Sompolinsky et al.,
based on the calculation of V(e), the volume of the space
of couplings which both perfectly implement mtraining
examples and have a given generalization error e. For an
intuitive picture, consider that only discrete values for the entropic contribution
couplings are allowed; then V(e) would be proportional to
the number of students. The typical value of the general-
ization error is the value of e, which maximizes V(e). It
should be kept in mind that V(e) is a random number and energetic contribution
fluctuates from one training set to another. A correct treat- 1/N logfV(ε)g
ment of this randomness requires involved mathematical
techniques (Mézard et al.,1987). To obtain a picture which
is quite often qualitatively correct, we may replace it by its
average over many realizations of training sets. From ele-
mentary probability theory we see that this average num- maximum
ber can be found by calculating the volume Aof the space 0 0.1 0.2 0.3 0.4 0.5
of all students with generalization error e, irrespective of ε
their behavior on the training set, and multiplying it by FIGURE 8 Logarithm of the average volume of students that
the probability Bthat a student with generalization error e havelearnedmexamplesandgiveegeneralizationerror(green
gives mtimes the correct answers on independent draw- curve). The blue and red curves represent the energetic and
ings of the input patterns. Since Aincreases exponentially entropic contributions, respectively.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 769 262-A1677 7/24/01 11:12 AM Page 770
MANFRED OPPER
0.5 student is free to ask the teacher questions, i.e., if the stu-
ε dent can choose highly informative input patterns. For the
simple perceptron a fruitful query strategy is to select a new 0.4 input vector which is perpendicular to the current coupling
vector of the student (Kinzel and Ruján, 1990). Such an
0.3 input is a highly ambiguous pattern because small changes
continuous couplings in the student couplings produce different classification an-
swers. For more complicated networks it may be difficult 0.2 to obtain similar ambiguous inputs by an explicit construc-
tion. A general algorithm has been proposed (Seung et al.,
0.1 1992a) which uses the principle of maximal disagreement discrete couplings in a committee of several students as a selection process for
training patterns. Using an appropriate randomized train- 0.00.0 0.1 0.2 0.3 0.4 0.5 0. 6 ingstrategy,differentstudentsaregeneratedwhichalllearn α the same set of examples. Next, any new input vector is only
FIGURE 9 Learning curves for typical student perceptrons. accepted for training when the disagreement of its classi-
am/Nis the ratio between the number of examples and the fication between the students is maximal. For a committee
coupling number. of two students it can be shown that when the number of
examples is large, the information gain does not decrease
but reaches a positive constant. This results in a much faster
1990) of a perceptron obtained by the statistical physics ap- decrease of the generalization error. Instead of being in-
proach (treating the random sampling the proper way) is versely proportional to the number of examples, the de-
shown by the red curve of Fig. 9. In contrast to the worst- crease is now exponentially fast.
casepredictionsoftheVCtheory,itispossibletohavesome ................................................generalization ability below VC dimension or capacity. As ◗
we might have expected, the generalization error decreases Bad Students and Good Students
monotonically, showing that the more that is learned, the
more that is understood. Asymptotically, the error is pro- Although the typical student perceptron has a smooth,
portional to Nand inversely proportional to m, in agree- monotonically decreasing learning curve, the possibility
ment with the VC predictions. This may not be true for that some concrete learning algorithm may result in a set
more complicated networks. of student couplings which are untypical in the sense of
our theory cannot be ruled out. For bad students, even non-................................................ ◗ monotic generalization behavior is possible. The problem
Query Learning of a concrete learning algorithm can be made to fit into the
statistical physics framework if the algorithm minimizes a
Soon after Gardners pioneering work, it was realized that certain cost function. Treating the achieved values of the
the approach of statistical physics is closely related to ideas new cost function as a macroscopic constraint, the tools of
in information theory and Bayesian statistics (Levin et al., statistical physics apply again.
1989;GyörgyiandTishby,1990;OpperandHaussler,1991), As an example, it is convenient to consider a case in
for which the reduction of an initial uncertainty about the which the teacher and the student have a different archi-
true state of a system (teacher) by observing data is a cen- tecture: In one of the simplest examples one tries to learn
tral topic of interest. The logarithm of the volume of rele- a classification problem by interpreting it as a regression
vant microstates as defined in the previous section is a di- problem, i.e., a problem of fitting a continuous function
rect measure for such uncertainty. The moderate progress through data points. To be specific, we study the situation
in generalization ability displayed by the red learning curve in which the teacher network is still given by a percep-
of Fig. 9 can be understood by the fact that as learning pro- tron which computes binary valued outputs of the form
gresses less information about the teacher is gained from a ywx, 1, but as the student we choose a network i i i
newrandomexample.Here,theinformationgainisdefined with a linear transfer function (the yellow curve in Fig. 1a)
as the reduction of the uncertainty when a new example is
learned. The decrease in information gain is due to the in- Y awxi i
crease in the generalization performance. This is plausible i
because inputs for which the majority of student networks and try to fit this linear expression to the binary labels of
give the correct answer are less informative than those for the teacher. If the number of couplings is sufficiently large
which a mistake is more likely. The situation changes if the (larger than the number of examples) the linear function
770 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 771
LEARNING TO GENERALIZE
(unlike the sign) is perfectly able to fit arbitrary continuous the student learns all examples perfectly. Although it may
output values. This linear fit is an attempt to explain the not be easy to construct a learning algorithm which per-
data in a more complicated way than necessary, and the forms such a maximization in practice, the resulting gener-
couplings have to be finely tuned in order to achieve this alization error can be calculated using the statistical phys-
goal. We find that the student trained in such a way does ics approach (Engel and Van den Broeck, 1993). The result
not generalize well (Opper and Kinzel, 1995). In order to is in agreement with the VC theory: There is no prediction
compare the classifications of teacher and student on a new better than random guessing below the capacity.
random input after training, we have finally converted the Although the previous algorithms led to a behavior
students output into a classification label by taking the sign whichisworsethanthetypicalone,wenowexaminetheop-
of its output. As shown in the red curve of Fig. 10, after positecaseofanalgorithmwhichdoesbetter.Sincethegen-
an initial improvement of performance the generalization eralization ability of a neural network is related to the fact
error increases again to the random guessing value e0.5 that similar input vectors are mapped onto the same out-
at a1 (Fig. 10, red curve). This phenomenon is called put, one can assume that such a property can be enhanced
overfitting.For a1 (i.e., for more data than parameters), if the separating gap between the two classes is maximized,
it is no longer possible to have a perfect linear fit through which defines a new cost function for an algorithm. This
the data, but a fit with a minimal deviation from a linear optimal margin perceptron can be practically realized and
function leads to the second part of the learning curve.ede- when applied to a set of data leads to the projection of
creases again and approaches 0 asymptotically for aSq. Fig. 11. As a remarkable result, it can be seen that there is a
This shows that when enough data are available, the details relatively large fraction of patterns which are located at the
of the training algorithm are less important. gap. These points are called support vectors(SVs). In order
The dependence of the generalization performance on to understand their importance for the generalization abil-
the complexity of the assumed data model is well-known. If ity, we make the following gedankenexperimentand assume
function class is used that is too complex, data values can be that all the points which lie outside the gap (the nonsupport
perfectly fitted but the predicted function will be very sen- vectors) are eliminated from the training set of examples.
sitive to the variations of the data sample, leading to very From the two-dimensional projection of Fig. 11, we may
unreliable predictions on novel inputs. On the other hand, conjecture that by running the maximal margin algorithm
functions that are too simple make the best fit almost insen- on the remaining examples (the SVs) we cannot create a
sitive to the data, which prevents us from learning enough larger gap between the points. Hence, the algorithm will
from them. converge to the same separating hyperplane as before. This
It is also possible to calculate the worst-case generaliza- intuitive picture is actually correct. If the SVs of a training
tion ability of perceptron students learning from a percep- set were known beforehand (unfortunately, they are only
tron teacher. The largest generalization error is obtained identified after running the algorithm), the margin classi-
(Fig. 7) when the angle between the coupling vectors of fier would have to be trained only on the SVs. It would au-
teacher and student is maximized under the constraint that tomatically classify the rest of the training inputs correctly.
0.50
ε
0.40
0.30 linear student
0.20
margin classifier
0.10
0.000123456 α
FIGURE 10 Learning curves for a linear student and for a FIGURE 11 Learning with a margin classifier and m300
margin classifier. am/N. examples in an N150-dimensional space.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 771 262-A1677 7/24/01 11:12 AM Page 772
MANFRED OPPER
Hence, if in an actual classification experiment the number ber of consistent students is small; nevertheless, the few re-
of SVs is small compared to the number of non-SVs, we maining ones must still differ in a finite fraction of bits from
may expect a good generalization ability. each other and from the teacher so that perfect generaliza-
The learning curve for a margin classifier (Opper and tion is still impossible. For aslightly above a only the cou- c
Kinzel, 1995) learning from a perceptron teacher (calcu- plings of the teacher survive.
lated by the statistical physics approach) is shown in Fig. 10
(blue curve). The concept of a margin classifier has recently ................................................
been generalized to the so-called support vector machines ◗
Learning with Errors (Vapnik, 1995), for which the inputs of a perceptron are re-
placed by suitable features which are cleverly chosen non-
linear functions of the original inputs. In this way, nonlin- The example of the Ising perceptron teaches us that it will
ear separable rules can be learned, providing an interesting not always be simple to obtain zero training error. More-
alternative to multilayer networks. over, an algorithm trying to achieve this goal may get stuck
in local minima. Hence, the idea of allowing errors explic-
itly in the learning procedure, by introducing an appropri-................................................ ◗ ate noise, can make sense. An early analysis of such a sto-
The Ising Perceptron chastic training procedure and its generalization ability for
the learning in so-called Boolean networks (with elemen-
The approach of statistical physics can develop a specific tary computing units different from the ones used in neural
predictivepowerinsituationsinwhichonewouldliketoun- networks) can be found in Carnevali and Patarnello (1987).
derstand novel network models or architectures for which A stochastic algorithm can be useful to escape local min-
currently no efficient learning algorithm is known. As the ima of the training error, enabling a better learning of the
simplest example, we consider a perceptron for which the training set. Surprisingly, such a method can also lead to
couplings w are constrained to binary values 1 and 1 bettergeneralizationabilitiesiftheclassificationruleisalso j
(Gardner and Derrida, 1989; Györgyi, 1990; Seung et al., corrupted by some degree of noise (Györgyi and Tishby,
1992b). For this so-called Ising perceptron(named after 1990). A stochastic training algorithm can be realized by
Ernst Ising, who studied coupled binary-valued elements as the Monte Carlo metropolis method, which was invented
a model for a ferromagnet), perfect learning of examples is to generate the effects of temperature in simulations of
equivalent to a difficult combinatorial optimization prob- physical systems. Any changes of the network couplings
lem (integer linear programming), which in the worst case which lead to a decrease of the training error during learn-
is believed to require a learning time that increases expo- ing are allowed. However, with some probability that in-
nentially with the number of couplings N. creases with the temperature, an increase of the training
To obtain the learning curve for the typical student, we error is also accepted. Although in principle this algorithm
can proceed as before, replacing V(e) by the number of may visit all the networks configurations, for a large sys-
student configurations that are consistent with the teacher tem, with an overwhelming probability, only states close to
which results in changing the entropic term appropriately. some fixed training error will actually appear. The method
When the examples are provided by a teacher network of of statistical physics applied to this situation shows that for
thesamebinarytype,onecanexpectthatthegeneralization sufficiently large temperatures (T) we often obtain a quali-
error will decrease monotonically to zero as a function of a. tatively correct picture if we repeat the approximate calcu-
The learning curve is shown as the blue curve in Fig. 9. For lation for the noise-free case and replace the relative num-
sufficiently small a, the discreteness of the couplings has al- ber of examples aby the effective number a/T.Hence, the
most no effect. However, in contrast to the continuous case, learning curves become essentially stretched and good gen-
perfect generalization does not require infinitely many ex- eralization ability is still possible at the price of an increase
amples but is achieved already at a finite number a 1.24. in necessary training examples. c
This is not surprising because the teachers couplings con- Within the stochastic framework, learning (with errors)
tain only a finite amount of information (one bit per cou- can now also be realized for the Ising perceptron, and it is
pling) and one would expect that it does not take much interesting to study the number of relevant student configu-
more than aboutNexamples to learn them. The remark- rations as a function of ein more detail (Fig. 12). The green
ableandunexpectedresultoftheanalysisisthefactthatthe curve is obtained for a small value ofawhere a strong maxi-
transition to perfect generalization is discontinuous. The mum with high generalization error exists. By increasing a,
generalization error decreases immediately from a non- this maximum decreases until it is the same as the second
zero value to zero. This gives an impression about the com- maximum at e0.5, indicating a transition like that of the
plex structure of the space of all consistent students and blue learning curve in Fig. 9. For larger a, the state of per-
also gives a hint as to why perfect learning in the Ising per- fect generalization should be the typical state. Neverthe-
ceptron is a difficult task. For aslightly below a, the num- less, if the stochastic algorithm starts with an initial state c
772 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 773
LEARNING TO GENERALIZE
α lar model. Here, each hidden unit is connected to a dif- 1 ferent set of the input nodes. A further simplification is the
log (number of students) α replacement of adaptive couplings from the hidden units to 2
the output node by a prewired fixed function which maps
the states of the hidden units to the output. α3 Two such functions have been studied in great detail.
For the first one, the output gives just the majority vote of
α the hidden units—that is, if the majority of the hidden units 4
α is negative, then the total output is negative, and vice versa. 4 >α3 >α2 >α1 This network is called a committee machine.For the second
0 0.1 0.2 0.3 0.4 0.5 type of network, the parity machine,the output is the par- ε ity of the hidden outputs—that is, a minus results from an
FIGURE 12 Logarithm of the number of relevant Ising stu- odd number of negative hidden units and a plus from an
dents for different values of a. even number. For both types of networks, the capacity has
been calculated in the thermodynamic limit of a large num-
ber Nof (first layer) couplings (Barkai et al.,1990; Monas-
which has no resemblance to the (unknown) teacher (i.e., son and Zecchina, 1995). By increasing the number of hid-
with e0.5), it will spend time that increases exponentially den units (but always keeping it much smaller than N),
with Nin the smaller local maximum, the metastable state. the capacity per coupling (and the VC dimension) can be
Hence, a sudden transition to perfect generalization will be made arbitrarily large. Hence, the VC theory predicts that
observable only in examples which correspond to the blue the ability to generalize begins at a size of the training set
curve of Fig. 12, where this metastable state disappears. which increases with the capacity. The learning curves of
For large vales of a(yellow curve), the stochastic algorithm the typical parity machine (Fig. 14) being trained by a par-
will converge always to the state of perfect generalization. ity teacher for (from left to right) one, two, four, and six
On the other hand, since the state with e0.5 is always hidden units seem to partially support this prediction.
metastable, a stochastic algorithm which starts with the Belowacertainnumberofexamples,onlymemorization
teachers couplings will never drive the student out of the ofthelearnedpatternsoccursandnotgeneralization.Then,
state of perfect generalization. It should be made clear that a transition to nontrivial generalization takes place (Han-
the sharp phase transitions are the result of the thermody- sel et al.,1992; Opper, 1994). Far beyond the transition, the
namic limit, where the macroscopic state is entirely domi- decay of the learning curves becomes that of a simple per-
nated by the typical configurations. For simulations of any ceptron (black curve in Fig. 14) independent of the num-
finite system a rounding and softening of the transitions ber of hidden units, and this occurs much faster than for
will be observed. the bound given by VC theory. This shows that the typical
................................................ learning curve can in fact be determined by more than one ◗
More Sophisticated Computations
Are Needed for Multilayer Networks 0.5
ε
As a first step to understand the generalization perfor- 0.4 mance of multilayer networks, one can study an archi- 46
tecture which is simpler than the fully connected one of
Fig. 1b. The tree architecture of Fig. 13 has become a popu- 0.3 2
10.2
0.1
0.00.0 0.1 0.2 0.3 0.4 0.5 0.6 α
FIGURE 14 Learning curves for the parity machine with
FIGURE 13 A two-layer network with tree architecture. tree architecture. Each curve represents the generalization er-
The arrow indicates the direction of propagation of the ror eas a function of aand is distinguished by the number of
information. hidden units of the network.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 773 262-A1677 7/24/01 11:12 AM Page 774
MANFRED OPPER
complexity parameter. In contrast, the learning curve of the same similarity to every teacher perceptron. Although
the committee machine with the tree architecture of Fig. 13 this symmetric state allows for some degree of generaliza-
(Schwarze and Hertz, 1992) is smooth and resembles that tion, it is not able to recover the teachers rule completely.
of the simple perceptron. As the number of hidden units After a long plateau, the symmetry is broken and each of
is increased (keeping Nfixed and very large), the general- the student perceptrons specializes to one of the teacher
ization error increases, but despite the diverging VC di- perceptrons, and thus their similarity with the others is
mension the curves converge to a limiting one having an lost. This leads to a rapid (but continuous) decrease in the
asymptotic decay which is only twice as slow as that of the generalization error. Such types of learning curves with
perceptron. This is an example for which typical and worst- plateaus can actually be observed in applications of fully
case generalization behaviors are entirely different. connected multilayer networks.
Recently, more light has been shed on the relation be- ................................................tween average and worst-case scenarios of the tree com- ◗
mittee. A reduced worst-case scenario, in which a tree Outlook
committee teacher was to be learned from tree committee
students under an input distribution, has been analyzed The worst-case approach of the VC theory and the typical
from a statistical physics perspective (Urbanczik, 1996). As case approach of statistical physics are important theories
expected, few students show a much worse generalization for modeling and understanding the complexity of learning
ability than the typical one. Moreover, such students may to generalize from examples. Although the VC approach
also be difficult to find by most reasonable learning algo- plays an important role in a general theory of learnabil-
rithms because bad students require very fine tuning of ity, its practical applications for neural networks have been
their couplings. Calculation of the couplings with finite pre- limited by the overall generality of the approach. Since only
cision requires many bits per coupling that increases faster weak assumptions about probability distributions and ma-
than exponentially with aand which for sufficiently large a chines are considered by the theory, the estimates for gen-
willbebeyondthecapabilityofpracticalalgorithms.Hence, eralization errors have often been too pessimistic. Recent
it is expected that, in practice, a bad behavior will not be developments of the theory seem to overcome these prob-
observed. lems. By using modified VC dimensions, which depend on
Transitions of the generalization error such as those the data that have actually occurred and which in favorable
observed for the tree parity machine are a characteristic cases are much smaller than the general dimensions, more
feature of large systems which have a symmetry that can realistic results seem to be possible. For the support vec-
be spontaneously broken. To explain this, consider the sim- tor machines (Vapnik, 1995) (generalizations of the margin
plest case of two hidden units. The output of this parity ma- classifiers which allow for nonlinear boundaries that sepa-
chine does not change if we simultaneously change the sign rate the two classes), Vapnik and collaborators have shown
of all the couplings for both hidden units. Hence, if the the effectiveness of the modified VC results for selecting
teachers couplings are all equal to 1, a student with all the optimal type of model in practical applications.
couplings equal to 1 acts exactly as the same classifier. If The statistical physics approach, on the other hand, has
there are few examples in the training set, the entropic con- revealed new and unexpected behavior of simple network
tribution will dominate the typical behavior and the typi- models,suchasavarietyofphasetransitions.Whethersuch
cal students will display the same symmetry. Their coupling transitions play a cognitive role in animal or human brains
vectors will consist of positive and negative random num- is an exciting topic. Recent developments of the theory
bers. Hence, there is no preference for the teacher or the aim to understand dynamical problems of learning. For ex-
reversed one and generalization is not possible. If the num- ample, online learning (Saad, 1998), in which the problems
ber of examples is large enough, the symmetry is broken of learning and generalization are strongly mixed, has en-
and there are two possible types of typical students, one abled the study of complex multilayer networks and has
with more positive and the other one with more negative stimulated research on the development of optimized algo-
couplings. Hence, any of the typical students will show rithms. In addition to an extension of the approach to more
some similarity with the teacher (or its negative image) and complicated networks, an understanding of the robustness
generalization occurs. A similar type of symmetry break- of the typical behavior, and an interpolation to the other
ing also leads to a continuous phase transition in the fully extreme, the worst-case scenario is an important subject of
connected committee machine. This can be viewed as a research.
committee of perceptrons, one for each hidden unit, which
share the same input nodes. Any permutation of these per- Acknowledgments
ceptrons obviously leaves the output invariant. Again, if I thank members of the Department of Physics of Complex Sys-
few examples are learned, the typical state reflects the sym- tems at the Weizmann Institute in Rehovot, Israel, where parts of
metry. Each student perceptron will show approximately this article were written, for their warm hospitality.
774 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 775
LEARNING TO GENERALIZE
References Cited OPPER , M., and H AUSSLER , M. (1991). Generalization perfor-
mance of Bayes optimal classification algorithm for learning a
AMARI , S., and M URATA , N. (1993). Statistical theory of learning perceptron. Phys. Rev. Lett.66,2677.
curves under entropic loss. Neural Comput.5,140. OPPER , M., and K INZEL , W. (1995). Statistical mechanics of gen-
BARKAI , E., H ANSEL , D., and K ANTER , I. (1990). Statistical me- eralization. In Physics of Neural Networks III(J. L. van Hem-
chanics of a multilayered neural network. Phys. Rev. Lett.65, men, E. Domany, and K. Schulten, Eds.). Springer-Verlag,
2312. New York.
BISHOP , C. M. (1995). Neural Networks for Pattern Recognition. SAAD , D. (Ed.) (1998). Online Learning in Neural Networks.
Clarendon/Oxford Univ. Press, Oxford/New York. Cambridge Univ. Press, New York.
CARNEVALI , P., and P ATARNELLO , S. (1987). Exhaustive thermo- SCHWARZE , H., and H ERTZ , J. (1992). Generalization in a large
dynamical analysis of Boolean learning networks. Europhys. committee machine. Europhys. Lett.20,375.
Lett.4,1199. SCHWARZE , H., and H ERTZ , J. (1993). Generalization in fully con-
COVER , T. M. (1965). Geometrical and statistical properties of nected committee machines. Europhys. Lett.21,785.
systems of linear inequalities with applications in pattern rec- SEUNG , H. S., S OMPOLINSKY , H., and T ISHBY , N. (1992a). Statis-
ognition. IEEE Trans. El. Comp.14,326. tical mechanics of learning from examples. Phys. Rev. A45,
ENGEL , A., and V AN DEN BROECK , C. (1993). Systems that can 6056.
learn from examples: Replica calculation of uniform conver- SEUNG , H. S., O PPER , M., and S OMPOLINSKY , H. (1992b). Query
gence bound for the perceptron. Phys. Rev. Lett.71,1772. by committee. InThe Proceedings of the Vth Annual Workshop
GARDNER ,E.(1988).Thespaceofinteractionsinneuralnetworks. on Computational Learning Theory (COLT92),p. 287. Associ-
J. Phys. A21,257. ation for Computing Machinery, New York.
GARDNER , E., and D ERRIDA , B. (1989). Optimal storage proper- SOMPOLINSKY , H., T ISHBY , N., and S EUNG , H. S. (1990). Learning
ties of neural network models. J. Phys. A21,271. from examples in large neural networks. Phys. Rev. Lett.65,
GYÖRGYI , G. (1990). First order transition to perfect generaliza- 1683.
tion in a neural network with binary synapses. Phys. Rev. A41, URBANCZIK , R. (1996). Learning in a large committee machine:
7097. Worst case and average case. Europhys. Lett.35,553.
GYÖRGYI , G., and T ISHBY , N. (1990). Statistical theory of learn- VALLET , F., C AILTON , J., and R EFREGIER , P. (1989). Linear and
ing a rule. In Neural Networks and Spin Glasses: Proceedings nonlinear extension of the pseudo-inverse solution for learn-
of the STATPHYS 17 Workshop on Neural Networks and Spin ing Boolean functions. Europhys. Lett.9,315.
Glasses(W. K. Theumann and R. Koberle, Eds.). World Scien- VAPNIK , V. N. (1982). Estimation of Dependencies Based on Em-
tific, Singapore. pirical Data.Springer-Verlag, New York.
HANSEL , D., M ATO , G., and M EUNIER , C. (1992). Memorization VAPNIK , V. N. (1995). The Nature of Statistical Learning Theory.
without generalization in a multilayered neural network. Eu- Springer-Verlag, New York.
rophys. Lett.20,471. VAPNIK , V. N., and C HERVONENKIS , A. (1971). On the uniform
KINZEL , W., and R UJÀN , P. (1990). Improving a network general- convergence of relative frequencies of events to their probabil-
ization ability by selecting examples. Europhys. Lett.13,473. ities. Theory Probability Appl.16,254.
LEVIN ,E.,T ISHBY ,N.,andS OLLA ,S.(1989).Astatisticalapproach
to learning and generalization in neural networks. In Proceed- General References ings of the Second Workshop on Computational Learning The-
ory(R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan ARBIB , M. A. (Ed.) (1995). The Handbook of Brain Theory and
Kaufmann, San Mateo, CA. Neural Networks.MIT Press, Cambridge, MA.
MÉZARD , M., P ARISI , G., and V IRASORO , M. A. (1987). Spin glass BERGER , J. O. (1985). Statistical Decision Theory and Bayesian
theory and beyond. In Lecture Notes in Physics,Vol. 9. World Analysis.Springer-Verlag, New York.
Scientific, Singapore. HERTZ ,J.A.,K ROGH ,A.,andP ALMER , R. G. (1991).Introduction
MONASSON , R., and Z ECCHINA , R. (1995). Weight space structure to the Theory of Neural Computation.Addison-Wesley, Red-
andinternalrepresentations:Adirectapproachtolearningand wood City, CA.
generalization in multilayer neural networks. Phys. Rev. Lett. MINSKY , M., and P APERT , S. (1969). Perceptrons.MIT Press,
75,2432. Cambridge, MA.
OPPER , M. (1994). Learning and generalization in a two-layer WATKIN , T. L. H., R AU , A., and B IEHL , M. (1993). The statistical
neural network: The role of the VapnikChervonenkis dimen- mechanics of learning a rule. Rev. Modern Phys.65,499.
sion. Phys. Rev. Lett.72,2113.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 775 262-A1677 7/24/01 11:12 AM Page 776

Binary file not shown.

BIN
Corpus/MOGRIFIER LSTM.txt Normal file

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,662 @@
Movement Pruning:
Adaptive Sparsity by Fine-Tuning
Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2
1 Hugging Face, 2 Cornell University
{victor,thomas}@huggingface.co;arush@cornell.edu
arXiv:2005.07683v1 [cs.CL] 15 May 2020 Abstract
Magnitude pruning is a widely used strategy for reducing model size in pure
supervised learning; however, it is less effective in the transfer learning regime that
has become standard for state-of-the-art natural language processing applications.
We propose the use ofmovement pruning, a simple, deterministic first-order weight
pruning method that is more adaptive to pretrained model fine-tuning. We give
mathematical foundations to the method and compare it to existing zeroth- and
first-order pruning methods. Experiments show that when pruning large pretrained
language models, movement pruning shows significant improvements in high-
sparsity regimes. When combined with distillation, the approach achieves minimal
accuracy loss with down to only 3% of the model parameters.
1 Introduction
Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art
performance in applications in natural language processing and related fields. In this setup, a large
model pretrained on a massive generic dataset is then fine-tuned on a smaller annotated dataset to
perform a specific end-task. Model accuracy has been shown to scale with the pretrained model and
dataset size [Raffel et al., 2019]. However, significant resources are required to ship and deploy these
large models, and training the models have high environmental costs [Strubell et al., 2019].
Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at
only a small cost of accuracy. Pruning methods, which remove weights based on their importance,
are a particularly simple and effective method for compressing models to be sent to edge devices such
as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high
absolute values, is the most widely used method for weight pruning. It has been applied to a large
variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al.,
2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis
[Frankle et al., 2019].
While magnitude pruning is highly effective for standard supervised learning, it is inherently less
useful in the transfer learning regime. In supervised learning, weight values are primarily determined
by the end-task training data. In transfer learning, weight values are mostly predetermined by the
original model and are only fine-tuned on the end task. This prevents these methods from learning to
prune based on the fine-tuning step, or “fine-pruning.”
In this work, we argue that to effectively reduce the size of models for transfer learning, one should
instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during
fine-tuning. Movement pruning differs from magnitude pruning in that both weights with low and
high values can be pruned if they shrink during training. This strategy moves the selection criteria
from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective. To
Preprint. Under review. test this approach, we introduce a particularly simple, deterministic version of movement pruning
utilizing the straight-through estimator [Bengio et al., 2013].
We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019,
Vaswani et al., 2017] on a diverse set of fine-tuning tasks. In highly sparse regimes (less than 15% of
remaining weights), we observe significant improvements over magnitude pruning and other 1st-order
methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original
BERT performance with only 5% of the encoders weight on natural language inference (MNLI)
[Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of
the differences between magnitude pruning and movement pruning shows that the two methods lead
to radically different pruned models with movement pruning showing greater ability to adapt to the
end-task.
2 Related Work
In addition to magnitude pruning, there are many other approaches for generic model weight pruning.
Most similar to our approach are methods for using parallel score matrices to augment the weight
matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo-
lutional networks. Differing from our methods, these methods keep the weights of the model fixed
(either from a randomly initialized network or a pre-trained network) and the scores are updated to
find a good sparse subnetwork.
Many previous works have also explored using higher-order information to select prunable weights.
LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for
deletion. Our method does not require the (possibly costly) computation of second-order derivatives
since the importance scores are obtained simply as the by-product of the standard fine-tuning. Theis
et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In
contrast, we found it useful to preserve the direction of movement in our algorithm.
Compressing pretrained language models for transfer learning is also a popular area of study. Other
approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning
[Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model
and targets individual weight. We also show that having a teacher can further improve our approach.
Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train
sparse language models from scratch. This differs from our approach which focuses on the fine-tuning
stage. Finally, another popular compression approach is quantization. Quantization has been applied
to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014]
providing high memory compression rates at the cost of no or little performance. As shown in
previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and
can be combined to further improve the performance/size ratio.
3 Background: Score-Based Pruning
We first establish shared notation for discussing different neural network pruning strategies. Let
W2Rnn refer to a generic weight matrix in the model (we consider square matrices, but they
could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of
associated importance scoresS2Rnn . Given importance scores, each pruning strategy computes a
maskM2 f0;1gnn . Inference for an inputxbecomesa= (WM)x, whereis the Hadamard
product. A common strategy is to keep the top-vpercent of weights by importance. We define Top v as a function which selects thev%highest values inS:1; STop(S) (1) v i;j = i;j in topv%
0; o.w.
Magnitude-based weight pruning determines the mask based on the absolute value of each weight as a measure of importance. Formally, we have importance scoresS= jWi;j j , and masks 1i;jn M=Top(S)(Eq(1)). There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude pruning: the model is first trained until convergence and weights with the lowest
magnitudes are removed afterward. The sparsified model is then re-trained with the removed weights
fixed to 0. This loop is repeated until the desired sparsity level is reached.
2 Magnitude pruning L0 regularization Movement pruning Soft movement pruning
Pruning Decision 0th order 1st order 1st order 1st order
Masking Function Top v Continuous Hard-Concrete Top v Thresholding
Pruning Structure Local or Global Global Local or Global Global
Learning Objective L L+l0 E(L0 ) L L+mvp R(S)
Gradient Form Gumbel-Softmax Straight-Through Straight-ThroughP P PScoresS jW ) )i;j j (@L )(t W(t) f(S(t) ) (@L )(t) W(t) (@L )(t) W(t
t@W i;j i;j i;j i;j t@W i;j i;j t@W i;j
Table 1: Summary of the pruning methods considered in this work and their specificities. The
expression offofL0 regularization is detailed in Eq (3).
In this study, we focus onautomated gradual pruning[Zhu and Gupta, 2018]. It supplements
magnitude pruning by allowing masked weights to be updated such that they are not fixed for the
entire duration of the training. Automated gradual pruning enables the model to recover from previous
masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity levelv during training using a cubic sparsity scheduler:v(t) =vf + (v t 3
i vf ) 1ti . The sparsity nt level at time stept,v(t) is increased from an initial valuevi (usually 0) to a final valuevf innpruning
steps afterti steps of warm-up. The model is thus pruned and trained jointly.
4 Movement Pruning
Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running
model. In this work, we focus on movement pruning methods where importance is derived from
first-order information. Intuitively, instead of selecting weights that are far from zero, we retain
connections that are moving away from zero during the training process. We consider two versions of
movement pruning: hard and soft.
For (hard) movement pruning, masks are computed using the Top v function:M=Top(S). Unlike v magnitude pruning, during training, we learn both the weightsP Wand their importance scoresS.
During the forward pass, we compute for alli,a ni = Wk=1 i;k Mi;k xk .
Since the gradient of Top v is 0 everywhere it is defined, we follow Ramanujan et al. [2020], Mallya
and Lazebnik [2018] and approximate its value with thestraight-through estimator[Bengio et al.,
2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The
approximation of gradient of the lossLwith respect toSi;j is given by
@L @L @a= i @L= W x@S j (2)
i;j @a i @S i;j @a i;ji
This implies that the scores of weights are updated, even if these weights are masked in the forward
pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge.
We also consider a relaxed (soft) version of movement pruning based on the binary mask function
described by Mallya and Lazebnik [2018]. Here we replace hyperparametervwith a fixed global
threshold valuethat controls the binary mask. The mask is calculated asP M= (S> ). In order to
control the sparsity level, we add a regularization termR(S) =mvp (Si;j i;j )which encourages
the importance scores to decrease over time 1 . The coefficientmvp controls the penalty intensity and
thus the sparsity level.
Finally we note that these approaches yield a similar updateL0 regularization based pruning, another
movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the
hard-concretedistribution, where the maskMis sampled for alli;jwith hyperparametersb >0,
l <0, andr >1: u U(0;1) Si;j =(log(u)log(1u) +Si;j )=b
Zi;j = (rl)Si;j +l M i;j = min(1;ReLU(Zi;j ))
The expectedP L0 norm has a closed form involving the parameters of the hard-concrete: E(L0 ) =
logSi;j i;j blog(l=r). Thus, the weights and scores of the model can be optimized in
P1 We also experimented with jSi;j i;j jbut it turned out to be harder to tune while giving similar results.
3 (a) Magnitude pruning (b) Movement pruning
Figure 1: During fine-tuning (on MNLI), the weights stay close to their pre-trained values which
limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are
plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning
selects weights that are moving away from 0.
an end-to-end fashion to minimize the sum of the training lossLand the expectedL0 penalty. A
coefficientl0 controls theL0 penalty and indirectly the sparsity level. Gradients take a similar form:
@L @L rl= W@S i;j xj f(Si;j )wheref(Si;j ) = S Zi;j 1g (3)
i;j @a i b i;j (1Si;j )1f0
At test time, a non-stochastic estimation of the mask is used:M^= min 1;ReLU (rl)(S)+l
and weights multiplied by 0 can simply be discarded.
Table 1 highlights the characteristics of each pruning method. The main differences are in the masking
functions, pruning structure, and the final gradient form.
Method Interpretation In movement pruning, the gradient ofLwith respect toWi;j is given
by the standard gradient derivation: @L = @L M@W i;j @a i;j xj . By combining it to Eq(2), we have i @L = @L W@S i;j @W i;j (we omit the binary mask termMi;j for simplicity). From the gradient update in i;j
Eq (2),S @Li;j is increasing when <0, which happens in two cases: @S i;j
(a) @L <0andW@W i;j >0i;j
(b) @L >0andW@W i;j <0i;j
It means that during trainingWi;j is increasing while being positive or is decreasing while being
negative. It is equivalent to saying thatSi;j is increasing whenWi;j is moving away from 0. Inversely,
Si;j is decreasing when @L >0which means thatW@S i;j is shrinking towards 0. i;j
While magnitude pruning selects the most important weights as the ones which maximize their
distance to 0 (jWi;j j), movement pruning selects the weights which are moving the most away from
0 (Si;j ). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement
pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from
equation (2), afterTgradient updates, we have
XS(T) @L= )(t) W(t) (4) i;j S (@W i;j i;jt<T
Figure 1 shows this difference empirically by comparing weight values during fine-tuning against
their pre-trained value. As observed by Gordon et al. [2020], fine-tuned weights stay close in absolute
value to their initial pre-trained values. For magnitude pruning, this stability around the pre-trained
4 values implies that we know with high confidence before even fine-tuning which weights will be
pruned as the weights with the smallest absolute value at pre-training will likely stay small and be
pruned. In contrast, in movement pruning, the pre-trained weights do not have such an awareness of
the pruning decision since the selection is made during fine-tuning (moving away from 0), and both
low and high values can be pruned. We posit that this is critical for the success of the approach as it
is able to prune based on the task-specific data, not only the pre-trained value.
5 Experimental Setup
Transfer learning for NLP uses large pre-trained language models that are fine-tuned on target tasks
[Ruder et al., 2019, Devlin et al., 2019, Radford et al., 2019, Liu et al., 2019]. We experiment with task-
specific pruning ofBERT-base-uncased, a pre-trained model that contains roughly 84M parameters.
We freeze the embedding modules and fine-tune the transformer layers and the task-specific head.
We perform experiments on three monolingual (English) tasks, which are common benchmarks for
the recent progress in transfer learning for NLP: question answering (SQuAD v1.1) [Rajpurkar et al.,
2016], natural language inference (MNLI) [Williams et al., 2018], and sentence similarity (QQP)
[Iyer et al., 2017]. The datasets respectively contain 8K, 393K, and 364K training examples. SQuAD
is formulated as a span extraction task, MNLI and QQP are paired sentence classification tasks.
For a given task, we fine-tune the pre-trained model for the same number of updates (between 6
and 10 epochs) across pruning methods 2 . We follow Zhu and Gupta [2018] and use a cubic sparsity
scheduling for Magnitude Pruning (MaP), Movement Pruning (MvP), and Soft Movement Pruning
(SMvP). Adding a few steps of cool-down at the end of pruning empirically improves the performance
especially in high sparsity regimes. The schedule forvis:
8<vi 0t < t i
v v )3 t (5):f + (vi f )(1tti tf
nt i t < Ttf
vf o.w.
wheretf is the number of cool-down steps.
We compare our results against several state-of-the-art pruning baselines: Reweighted Proximal
Pruning (RPP) [Guo et al., 2019] combines re-weightedL1 minimization and Proximal Projection
[Parikh and Boyd, 2014] to perform unstructured pruning. LayerDrop [Fan et al., 2020a] leverages
structured dropout to prune models at test time. For RPP and LayerDrop, we report results from
authors. We also compare our method against the mini-BERT models, a collection of smaller BERT
models with varying hyper-parameters [Turc et al., 2019].
6 Results
Figure 2 displays the results for the main pruning methods at different levels of pruning on each
dataset. First, we observe the consistency of the comparison between magnitude and movement
pruning: at low sparsity (more than 70% of remaining weights), magnitude pruning outperforms
all methods with little or no loss with respect to the dense model whereas the performance of
movement pruning methods quickly decreases even for low sparsity levels. However, magnitude
pruning performs poorly with high sparsity, and the performance drops extremely quickly. In contrast,
first-order methods show strong performances with less than 15% of remaining weights.
Table 2 shows the specific model scores for different methods at high sparsity levels. Magnitude
pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 withL0 regular-
ization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning. These experiments
indicate that in high sparsity regimes, importance scores derived from the movement accumulated
during fine-tuning induce significantly better pruned models compared to absolute values.
Next, we compare the difference in performance between first-order methods. We see that straight-
through based hard movement pruning (MvP) is comparable withL0 regularization (with a significant
gap in favor of movement pruning on QQP). Soft movement pruning (SMvP) consistently outperforms
2 Preliminary experiments showed that increasing the number of pruning steps tended to improve the end
performance
5 Figure 2: Comparisons between different pruning methods in high sparsity regimes.Soft move-
ment pruning consistently outperforms other methods in high sparsity regimes.We plot the
performance of the standard fine-tuned BERT along with 95% of its performance.
Table 2: Performance at high sparsity levels. (Soft) movement pruning outperforms current
state-of-the art pruning methods at different high sparsity levels.
BERT base Remaining
fine-tuned Weights (%) MaP L0 Regu MvP soft MvP
SQuAD - Dev 10% 67.7/78.5 69.9/80.1 71.9/81.7 71.3/81.580.4/88.1EM/F1 3% 40.1/54.5 61.6/73.6 65.2/76.3 69.6/79.9
MNLI - Dev 10% 77.8/79.0 77.9/78.5 79.3/79.5 80.7/81.2acc/MM acc 84.5/84.9 3% 68.9/69.8 75.2/75.6 76.1/76.7 79.0/79.7
QQP - Dev 10% 78.8/75.1 87.6/81.9 89.1/85.5 90.2/86.891.4/88.4acc/F1 3% 72.1/58.4 86.5/81.1 85.6/81.0 89.2/85.5
hard movement pruning andL0 regularization by a strong margin and yields the strongest performance
among all pruning methods in high sparsity regimes. These comparisons support the fact that even if
movement pruning (and its relaxed version soft movement pruning) is simpler thanL0 regularization,
it yet yields stronger performances for the same compute budget.
Finally, movement pruning and soft movement pruning compare favorably to the other baselines, ex-
cept for QQP where RPP is on par with soft movement pruning. Movement pruning also outperforms
the fine-tuned mini-BERT models. This is coherent with [Li et al., 2020]: it is both more efficient and
more effective to train a large model and compress it afterward than training a smaller model from
scratch. We do note though that current hardware does not support optimized inference for sparse
models: from an inference speed perspective, it might often desirable to use a small dense model
such as mini-BERT over a sparse alternative of the same size.
Distillation further boosts performance Following previous work, we can further leverage knowl-
edge distillation [Bucila et al., 2006, Hinton et al., 2014] to boost performance for free in the pruned
domain [Jiao et al., 2019, Sanh et al., 2019] using our baseline fine-tunedBERT-basemodel as
teacher. The training objective is a linear combination of the training loss and a knowledge distillation
Figure 3: Comparisons between different pruning methods augmented with distillation.Distillation
improves the performance across all pruning methods and sparsity levels.
6 Table 3: Distillation-augmented performances for selected high sparsity levels.All pruning methods
benefit from distillation signal further enhancing the ratio Performance VS Model Size.
BERT base Remaining
fine-tuned Weights (%) MaP L0 Regu MvP soft MvP
SQuAD - Dev 10% 70.2/80.1 72.4/81.9 75.6/84.3 76.6/84.980.4/88.1EM/F1 3% 45.5/59.6 65.5/75.9 67.5/78.0 72.9/82.4
MNLI - Dev 10% 78.3/79.3 78.7/79.8 80.1/80.4 81.2/81.8acc/MM acc 84.5/84.9 3% 69.4/70.6 76.2/76.5 76.5/77.4 79.6/80.2
QQP - Dev 10% 79.8/65.0 88.1/82.8 89.7/86.2 90.5/87.191.4/88.4acc/F1 3% 72.4/57.8 87.1/82.0 86.1/81.5 89.3/85.6
(a) Distribution of remaining weights (b) Scores and weights learned by
movement pruning
Figure 4: Magnitude pruning and movement pruning leads to pruned models with radically different
weight distribution.
loss on the output distributions. Figure 3 shows the results on SQuAD, MNLI, and QQP for the three
pruning methods boosted with distillation. Overall, we observe that the relative comparisons of the
pruning methods remain unchanged while the performances are strictly increased. Table 3 shows for
instance that on SQuAD, movement pruning at 10% goes from 81.7 F1 to 84.3 F1. When combined
with distillation, soft movement pruning yields the strongest performances across all pruning methods
and studied datasets: it reaches 95% ofBERT-basewith only a fraction of the weights in the encoder
(5% on SQuAD and MNLI).
7 Analysis
Movement pruning is adaptive Figure 4a compares the distribution of the remaining weights for
the same matrix of a model pruned at the same sparsity using magnitude and movement pruning. We
observe that by definition, magnitude pruning removes all the weights that are close to zero, ending
up with two clusters. In contrast, movement pruning leads to a smoother distribution, which covers
the whole interval except for values close to 0.
Figure 4b displays each individual weight against its associated importance score in movement
pruning. We plot pruned weights in grey. We observe that movement pruning induces no simple
relationship between the scores and the weights. Both weights with high absolute value or low
absolute value can be considered important. However, high scores are systematically associated with
non-zero weights (and thus the “v-shape”). This is coherent with the interpretation we gave to the
scores (section 4): a high scoreSindicates that during fine-tuning, the associated weight moved away
from 0 and is thus non-null.
Local and global masks perform similarly We study the influence of the locality of the pruning
decision. While local Top v selects thev% most important weights matrix by matrix, global Top v uncovers non-uniform sparsity patterns in the network by selecting thev% most important weights in
7 Figure 5: Comparison of local and global selec- Figure 6:Remaining weights per layer in thetions of weights on SQuAD at different sparsity Transformer.Global magnitude pruning tends tolevels.For magnitude and movement pruning, prune uniformly layers. Global 1st order meth-local and global Top v performs similarly at all ods allocate the weight to the lower layers whilelevels of sparsity. heavily pruning the highest layers.
the whole network. Previous work has shown that a non-uniform sparsity across layers is crucial to
the performance in high sparsity regimes [He et al., 2018]. In particular, Mallya and Lazebnik [2018]
found that the sparsity tends to increase with the depth of the network layer.
Figure 5 compares the performance of local selection (matrix by matrix) against global selection
(all the matrices) for magnitude pruning and movement pruning. Despite being able to find a
global sparsity structure, we found that global did not significantly outperform local, except in high
sparsity regimes (2.3 F1 points of difference with 3% of remaining weights for movement pruning).
Even though the distillation signal boosts the performance of pruned models, the end performance
difference between local and global selections remains marginal.
Figure 6 shows the remaining weights percentage obtained per layer when the model is pruned until
10% with global pruning methods. Global weight pruning is able to allocate sparsity non-uniformly
through the network, and it has been shown to be crucial for the performance in high sparsity regimes
[He et al., 2018]. We notice that except for global magnitude pruning, all the global pruning methods
tend to allocate a significant part of the weights to the lowest layers while heavily pruning in the
highest layers. Global magnitude pruning tends to prune similarly to local magnitude pruning, i.e.,
uniformly across layers.
8 Conclusion
We consider the case of pruning of pretrained models for task-specific fine-tuning and compare
zeroth- and first-order pruning methods. We show that a simple method for weight pruning based on
straight-through gradients is effective for this task and that it adapts using a first-order importance
score. We apply this movement pruning to a transformer-based architecture and empirically show that
our method consistently yields strong improvements over existing methods in high-sparsity regimes.
The analysis demonstrates how this approach adapts to the fine-tuning regime in a way that magnitude
pruning cannot. In future work, it would also be interesting to leverage group-sparsity inducing
penalties [Bach et al., 2011] to remove entire columns or filters. In this setup, we would associate a
score to a group of weights (a column or a row for instance). In the transformer architecture, it would
give a systematic way to perform feature selection and remove entire columns of the embedding
matrix.
References
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer.ArXiv, abs/1910.10683, 2019.
8 Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep
learning in nlp. InACL, 2019.
Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for
efficient neural network. InNIPS, 2015.
Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding. InICLR, 2016.
Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. InNIPS,
2016.
Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.ArXiv,
abs/1902.09574, 2019.
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket
hypothesis at scale.ArXiv, abs/1903.01611, 2019.
Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients
through stochastic neurons for conditional computation.ArXiv, abs/1308.3432, 2013.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. InNAACL, 2019.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017.
Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through
l0 regularization. InICLR, 2017.
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for
sentence understanding through inference. InNAACL, 2018.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for
machine comprehension of text. InEMNLP, 2016.
Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, fixed network by
learning to mask.ArXiv, abs/1801.06519, 2018.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari.
Whats hidden in a randomly weighted neural network? InCVPR, 2020.
Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNIPS, 1989.
Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and
performance comparisons. InNIPS, 1993.
Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with
dense networks and fisher pruning.ArXiv, abs/1801.05787, 2018.
Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Ji Liu, and Jungong Han. Global sparse
momentum sgd for pruning very deep neural networks. InNeurIPS, 2019.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
bert: smaller, faster, cheaper and lighter. InNeurIPS EMC2 Workshop, 2019.
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling
task-specific knowledge from bert into simple neural networks.ArXiv, abs/1903.12136, 2019.
Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
structured dropout. InICLR, 2020a.
Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InNeurIPS,
2019.
9 Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and
multiple languages: lottery tickets in rl and nlp. InICLR, 2020.
Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou,
and Armand Joulin. Training with quantization noise for extreme model compression.ArXiv,
abs/2004.07320, 2020b.
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert.ArXiv,
abs/1910.06188, 2019.
Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional
networks using vector quantization.ArXiv, abs/1412.6115, 2014.
Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gon-
zalez. Train large, then compress: Rethinking model size for efficient training and inference of
transformers.ArXiv, abs/2002.11794, 2020.
Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model
compression. InICLR, 2018.
Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of
weight pruning on transfer learning.ArXiv, abs/2002.08307, 2020.
Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in
natural language processing. InNAACL, 2019.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. 2019.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach.ArXiv, abs/1907.11692, 2019.
Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017.
URLhttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lian Lin, and Yanzhi Wang. Reweighted proximal
pruning for large-scale language representation.ArXiv, abs/1909.12486, 2019.
Neal Parikh and Stephen P. Boyd. Proximal algorithms.Found. Trends Optim., 1:127239, 2014.
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better:
The impact of student initialization on knowledge distillation.ArXiv, abs/1908.08962, 2019.
Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InKDD, 2006.
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
InNIPS, 2014.
Xiaoqi Jiao, Y. Yin, Lifeng Shang, Xin Jiang, Xusong Chen, Linlin Li, Fang Wang, and Qun Liu.
Tinybert: Distilling bert for natural language understanding.ArXiv, abs/1909.10351, 2019.
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
compression and acceleration on mobile devices. InECCV, 2018.
Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity
through convex optimization.Statistical Science, 27, 09 2011. doi: 10.1214/12-STS394.
10 A Appendices
A.1 Guarantees on the decrease of the training loss
As the scores are updated, the relative order of the importances is likely shuffled, and some connections
will be replaced by more important ones. Under certain conditions, we are able to formally prove that
as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from
[Ramanujan et al., 2020] to consider the case of fine-tuableW.
We suppose that (a) the training lossLis smooth and admits a first-order Taylor development
everywhere it is defined and (b) the learning rate ofW(W >0) is small. We define the TopK
function as the analog of the Top v function, wherekis an integer instead of a proportion. We first
consider the case wherek= 1in the TopK masking, meaning that only one connection is remaining
(and the other weights are deactivated/masked). Lets denoteWi;j this sole remaining connection at
stept. Following Eq (1), it means that81u;vn;S (t)u;v S(t) .i;j
We suppose that at stept+ 1, connections are swapped and the only remaining connection at step
t+ 1is(k;l). We have:
(
Att; 81u;vn;S (t)u;v S(t)
i;j (6) Att+ 1; 81u;vn;S (t+1)u;v S(t+1)
k;l
Eq(6)gives the following inequality:S(t+1) S(t) S(t+1) S(t) . After re-injecting the gradient k;l k;l i;j i;j update in Eq (2), we have:
@L ) @L
S W(t x (7)@a k;l l S W(t) x
k @a i;j ji
Moreover, the conditions in Eq(6)lead to the following inferences:a(t) =W(t) x a(t+1) =i i;j j and k
W(t+1) xk;l l .
Since t)W is small,jj(a(t+1) ;a (t+1) )(a( ;a (t) )jj i k i k 2 is also small. Because the training lossLis
smooth, we can write the 1st order Taylor development ofLin point(a(t) ;a (t) ):i k
L(a(t+1) ;a (t+1) ) L(a(t) ;a (t) )i k i k
@L (a(t+1) a(t) @L) + (a(t+1) a(t) )@a k kk @a i ii
@L= W(t+1) @Lx W(t) x@a k;l l
k @a i;j ji
@L @L @L (8) = W(t+1) x W(t) @Lx (t) x@a k;l l + ( W(t) x
k @a k;l l +
k @a k;l l ) Wi;j jk @a i
@L= (W(t+1) x ) @L
@a k;l l W(t) @Lx xk;l l ) + ( W(t
k @a k;l l W(t) x
k @a i;j j )
i
@L @L @L @L= x (S(t) ) x@a l (W x k;l ) + ( W(t) l W(t) x
k @a l m
k @a k;lk @a i;j j )
i
The first term is null because of inequalities(6)and the second term is negative because of inequality
(7). ThusL(a(t+1) ;a (t+1) ) L(a(t) ;a (t) ): when connection(k;l)becomes more important than i k i (i;j), the connections are swapped and the training loss decreases between step k tandt+ 1. Similarly, we can generalize the proof to a setE=f(ai ;b i );(ci ;d i );iNg ofNswapping
connections.
We note that this proof is not specific to theTopKmasking function. In fact, we can extend the proof
using theThresholdmasking functionM:= (S>=)[Mallya and Lazebnik, 2018]. Inequalities
(6) are still valid and the proof stays unchanged.
11 Last, we note these guarantees do not hold if we consider the absolute value of the scoresjSi;j j(as
it is done in Ding et al. [2019] for instance). We prove it by contradiction. If it was the case, it
would also be true one specific case: thenegative thresholdmasking function (M:= (S< )where
<0).
We suppose that at stept+ 1, the only remaining connection(i;j)is replaced by(k;l):
(
Att; 81u;vn;S (t) S(t)
i;j u;v (9)Att+ 1; 81u;vn;S (t+1) S(t+1)
k;l u;v
The inequality on the gradient update becomes: @LS W(t) x@a k k;l l < @LS W@a i;j xj and following i
the same development as in Eq(8), we haveL(a(t+1) ;a (t+1) ) L(a(t) ;a (t) )0: the loss increases. i k i We proved by contradiction that the guarantees on the decrease of the loss do not hold if we consider k
the absolute value of the score as a proxy for importance.
12

View File

@ -0,0 +1,150 @@
Network Pruning
As one of the earliest works in network pruning, Yann Lecun's Optimal brain
damage (OBD) paper has been cited in many of the papers.
Some research focuses on module network designs. "These models, such as
SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions
convolution with lesser parameters and better performance."
Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a
compression and regularization method, as opposed to other techniques such as
non-structured pruning (weight sparsifying and connection pruning), low rank
approximation and vector quantization (references to these approaches can be
found in the related work sections of the following papers). 
Difference between structred and non-structured pruning:
"Non-structured pruning aims to remove single parameters that have little
influence on the accuracy of networks". For example, L1-norm regularization on
weights is noted as non-structured pruning- since it's basically a weight
sparsifying method, i.e removes single parameter.
The term 'structure' refers to a structured unit in the network. So instead of
pruning individual weights or connections, structured pruning targets neurons,
filters, channels, layers etc. But the general implementation idea is the same as
penalizing individual weights: introducing a regularization term (mostly in the
form of L1-norm) to the loss function to penalize (sparsify) structures.
I focused on structured pruning and read through the following papers:
1. Structured Pruning of Convolutional Neural Networks via L1
Regularization (August 2019)
"(...) network pruning is useful to remove redundant parameters, filters,
channels or neurons, and address the over-fitting issue."
Provides a good review of previous work on non-structured and structured
pruning.
"This study presents a scheme to prune filters or neurons of fully-connected
layers based on L1 regularization to zero out the weights of some filters or
neurons."
Didn't quite understand the method and implementation. There are two key
elements: mask and threshold. "(...) the problem of zeroing out the values of
some filters can be transformed to zero some mask." || "Though the proposed
method introduces mask, the network topology will be preserved because the mask can be absorbed into weight." || "Here the mask value cannot be
completely zeroed in practical application, because the objective function (7) is
non-convex and the global optimal solution may not be obtained. A strategy is
adopted in the proposed method to solve this problem. If the order of
magnitude of the mask value is small enough, it can be considered almost as
zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...)
The average value of the product of the mask and the weight is used to
determine whether the mask is exactly zero or not."
From what I understand they use L1 norm in the loss function to penalize
useless filters through peenalizing masks. And a threshold value is introduced
to determine when the mask is small enough to be considered zero. 
They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
32)
2. Learning Efficient Convolutional Networks through Network Slimming (August
2017) + Git repo
"Our approach imposes L1 regular- ization on the scaling factors in batch
normalization (BN) layers, thus it is easy to implement without introducing any
change to existing CNN architectures. Pushing the values of BN scaling factors
towards zero with L1 regularization enables us to identify insignificant channels
(or neurons), as each scaling factor corresponds to a specific con- volutional
channel (or a neuron in a fully-connected layer)."
They provide a good insight on advantages and disadvantages of other
computation reduction methods such as low rank approximation, vector
quantization etc. 
I belive here they use the word 'channel' to refer to filters (?).
"Our idea is introducing a scaling factor γ for each channel, which is multiplied
to the output of that channel. Then we jointly train the network weights and
these scaling factors, with sparsity regularization imposed on the latter. Finally
we prune those channels with small factors, and fine-tune the pruned network.
" --> so instead of 'mask' they use the 'scaling factor' and impose regularization
on that, but the idea is very similar.
"The way BN normalizes the activations motivates us to design a simple and
efficient method to incorporates the channel-wise scaling factors. Particularly,
BN layer normalizes the internal activa- tions using mini-batch statistics." || "
(...) we can directly leverage the γ parameters in BN layers as the scaling factors
we need for network slim- ming. It has the great advantage of introducing no
overhead to the network." They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40),
ImageNet (model: VGG-A) and MNIST (model: Lenet)
3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
" (...) we propose Structured Sparsity Learning (SSL) method to directly learn a
compressed structure of deep CNNs by group Lasso regularization during the
training. SSL is a generic regularization to adaptively adjust mutiple structures
in DNN, including structures of filters, channels, and filter shapes within each
layer, and structure of depth beyond the layers." || " (...) offering not only well-
regularized big models with improved accuracy but greatly accelerated
computation."
 "Here W represents the collection of all weights in the DNN; ED(W) is the loss
on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
norm; and Rg(·) is the structured sparsity regularization on each layer. Because
Group Lasso can effectively zero out all weights in some groups [14][15], we
adopt it in our SSL. The regularization of group Lasso on a set of weights w can
be represented as  
 , where w(g) is a group of partial weights in w and G is the total number of
groups. " || "In SSL, the learned “structure” is decided by the way of splitting
groups of w(g). We investigate and formulate the filer-wise, channel-wise,
shape-wise, and depth-wise structured sparsity (...)"
They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
20) and ImageNet (model:AlexNet)
The authors also provide a visualization of filters after pruning, showing that
only important detectors of patterns remain after pruning.
In conclusions: "Moreover, a variant of SSL can be performed as structure
regularization to improve classification accuracy of state-of-the-art DNNs."
4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
"After an initial training phase, we remove all connections whose weight is
lower than a threshold. This pruning converts a dense, fully-connected layer to
a sparse layer." || "We then retrain the sparse network so the remaining
connections can compensate for the connections that have been removed. The
phases of pruning and retraining may be repeated iteratively to further reduce network complexity. In effect, this training process learns the network
connectivity in addition to the weights (...)"
Although the description above implies the pruning was done only for FC
layers, they also do pruning on convolutional layers - although they don't
provide much detail on this in the methods. But there's this statement when
they explain retraining: "(...) we fix the parameters for CONV layers and only
retrain the FC layers after pruning the FC layers, and vice versa.". The results
section also shows that convolutional layer connections were also
pruned on the tested models.
They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and
ImageNet (models: AlexNet, VGG-16)
The authors provide a visualization of the sparsity patterns of neurons after
pruning (for an FC layer) which shows that pruning can detect visual attention
regions.
The method used in this paper targets individual parameters (weights) to
prune. So, technically this should be considered as a non-structured pruning
method. However, the reason I think this is referenced as a structured pruning
method is that if all connections of a neuron is pruned (i.e all input and output
weights were below threshold), the neuron itself will be removed from the
network:  "After pruning connections, neurons with zero input connections or
zero output connections may be safely pruned."
SIDENOTE: They touch on the use of global average pooling instead of fully
connected layers in CNNs: "There have been other attempts to reduce the
number of parameters of neural networks by replacing the fully connected
layer with global average pooling."
5. Many more can be picked from the references of these papers.
There's a paper on Bayesion compression for Deep Learning from 2017. Their
hypothesis is: "By employing sparsity inducing priors for hidden units (and not
individual weights) we can prune neurons including all their ingoing and outgoing
weights." However, the method is mathematically heavy and the related work
references are quite old (1990s, 2000s).

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,535 @@
The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy,
including cookie policy.
The 4 Research Techniques to
Train Deep Neural Network
Models More E:ciently
James Le Follow
Oct 29, 2019 · 9 min read
Photo by Victor Freitas on Unsplash
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 1 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
Deep learning and unsupervised feature learning have shown
great promise in many practical applications. State-of-the-art
performance has been reported in several domains, ranging
from speech recognition and image recognition to text
processing and beyond.
Its also been observed that increasing the scale of deep
learning—with respect to numbers of training examples, model
parameters, or both—can drastically improve accuracy. These
results have led to a surge of interest in scaling up the training
and inference algorithms used for these models and in
improving optimization techniques for both.
The use of GPUs is a signiFcant advance in recent years that
makes the training of modestly-sized deep networks practical.
A known limitation of the GPU approach is that the training
speed-up is small when the model doesnt Ft in a GPUs
memory (typically less than 6 gigabytes).
To use a GPU eLectively, researchers often reduce the size of
the dataset or parameters so that CPU-to-GPU transfers are not
a signiFcant bottleneck. While data and parameter reduction
work well for small problems (e.g. acoustic modeling for speech
recognition), they are less attractive for problems with a large
number of examples and dimensions (e.g., high-resolution
images).
In the previous post, we
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 2 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
talked about 5 diLerent
algorithms for ePcient deep
learning inference. In this
article, well discuss the
upper right part of the
quadrant on the left. What
are the best research
techniques to train deep
neural networks more
ePciently?
1 — Parallelization Training
Lets start with parallelization. As the Fgure below shows, the
number of transistors keeps increasing over the years. But
single-threaded performance and frequency are plateauing in
recent years. Interestingly, the number of cores is increasing.
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 3 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
So what we really need to know is how to parallelize the
problem to take advantage of parallel processing. There are a
lot of opportunities to do that in deep neural networks.
For example, we can do data parallelism: feeding 2 images
into the same model and running them at the same time. This
does not aLect latency for any single input. It doesnt make it
shorter, but it makes the batch size larger. It also requires
coordinated weight updates during training.
For example, in JeL Deans paper “Large Scale Distributed Deep
Networks,” theres a parameter server (as a master) and a
couple of model workers (as slaves) running their own pieces of
training data and updating the gradient to the master.
Another idea is model parallelism — splitting up the model
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 4 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
and distributing each part to diLerent processors or diLerent
threads. For example, imagine we want to run convolution in
the image below by doing a 6-dimension “for” loop. What we
can do is cut the input image by 2x2 blocks, so that each
thread/processor handles 1/4 of the image. Also, we can
parallelize the convolutional layers by the output or input
feature map regions, and the fully-connected layers by the
output activation.
...
Machine learning models are moving closer
and closer to edge devices. Fritz AI is here
to help with this transition. Explore our
suite of developer tools that makes it easy to
teach devices to see, hear, sense, and think.
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 5 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
...
2 — Mixed Precision Training
Larger models usually require more compute and memory
resources to train. These requirements can be lowered by using
reduced precision representation and arithmetic.
Performance (speed) of any program, including neural network
training and inference, is limited by one of three factors:
arithmetic bandwidth, memory bandwidth, or latency.
Reduced precision addresses two of these limiters. Memory
bandwidth pressure is lowered by using fewer bits to store the
same number of values. Arithmetic time can also be lowered on
processors that oLer higher throughput for reduced precision
math. For example, half-precision math throughput in recent
GPUs is 2× to 8× higher than for single-precision. In addition
to speed improvements, reduced precision formats also reduce
the amount of memory required for training.
Modern deep learning training systems use a single-precision
(FP32) format. In their paper “Mixed Precision Training,”
researchers from NVIDIA and Baidu addressed training with
reduced precision while maintaining model accuracy.
SpeciFcally, they trained various neural networks using the
IEEE half-precision format (FP16). Since FP16 format has a
narrower dynamic range than FP32, they introduced three
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 6 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
techniques to prevent model accuracy loss: maintaining a
master copy of weights in FP32, loss-scaling that minimizes
gradient values becoming zeros, and FP16 arithmetic with
accumulation in FP32.
Using these techniques, they
demonstrated that a wide
variety of network
architectures and
applications can be trained
to match the accuracy of
FP32 training. Experimental
results include convolutional
and recurrent network
architectures, trained for classiFcation, regression, and
generative tasks.
Applications include image classiFcation, image generation,
object detection, language modeling, machine translation, and
speech recognition. The proposed methodology requires no
changes to models or training hyperparameters.
3 — Model Distillation
Model distillation refers to the idea of model compression by
teaching a smaller network exactly what to do, step-by-step,
using a bigger, already-trained network. The soft labels refer
to the output feature maps by the bigger network after every
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 7 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
convolution layer. The smaller network is then trained to learn
the exact behavior of the bigger network by trying to replicate
its outputs at every level (not just the Fnal loss).
The method was Frst proposed by Bucila et al., 2006 and
generalized by Hinton et al., 2015. In distillation, knowledge is
transferred from the teacher model to the student by
minimizing a loss function in which the target is the
distribution of class probabilities predicted by the teacher
model. That is — the output of a softmax function on the
teacher models logits.
So how do teacher-student
networks exactly work?
The highly-complex teacher
network is Frst trained
separately using the
complete dataset. This step
requires high computational
performance and thus can
only be done ohine (on
high-performing GPUs).
While designing a student network, correspondence needs
to be established between intermediate outputs of the
student network and the teacher network. This
correspondence can involve directly passing the output of a
layer in the teacher network to the student network, or
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 8 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
performing some data augmentation before passing it to the
student network.
Next, the data are forward-passed through the teacher
network to get all intermediate outputs, and then data
augmentation (if any) is applied to the same.
Finally, the outputs from the teacher network are back-
propagated through the student network so that the student
network can learn to replicate the behavior of the teacher
network.
...
The future of machine learning is on the
edge. Subscribe to the Fritz AI Newsletter
to discover the possibilities and beneIts of
embedding ML models inside mobile apps.
...
4 — Dense-Sparse-Dense Training
The research paper “Dense-Sparse-Dense Training for Deep
Neural Networks” was published back in 2017 by researchers
from Stanford, NVIDIA, Baidu, and Facebook. Applying Dense-
Sparse-Dense (DSD) takes 3 sequential steps:
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 9 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
Dense: Normal neural net training…business as usual. Its
notable that even though DSD acts as a regularizer, the
usual regularization methods such as dropout and weight
regularization can be applied as well. The authors dont
mention batch normalization, but it would work as well.
Sparse: We regularize the
network by removing
connections with small
weights. From each layer in
the network, a percentage of
the layers weights that are
closest to 0 in absolute value is selected to be pruned. This
means that they are set to 0 at each training iteration. Its
worth noting that the pruned weights are selected only
once, not at each SGD iteration. Eventually, the network
recovers the pruned weights knowledge and condenses it in
the remaining ones. We train this sparse net until
convergence.
Dense: First, we re-enable the pruned weights from the
previous step. The net is again trained normally until
convergence. This step increases the capacity of the model.
It can use the recovered capacity to store new knowledge.
The authors note that the learning rate should be 1/10th of
the original. Since the model is already performing well, the
lower learning rate helps preserve the knowledge gained in
the previous step.
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 10 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
Removing pruning in the dense step allows the training to
escape saddle points to eventually reach a better minimum.
This lower minimum corresponds to improved training and
validation metrics.
Saddle points are areas in the multidimensional space of the
model that might not be a good solution but are hard to escape
from. The authors hypothesize that the lower minimum is
achieved because the sparsity in the network moves the
optimization problem to a lower-dimensional space. This space
is more robust to noise in the training data.
The authors tested DSD on image classiFcation (CNN), caption
generation (RNN), and speech recognition (LSTM). The
proposed method improved accuracy across all three tasks. Its
quite remarkable that DSD works across domains.
DSD improved all CNN models tested — ResNet50, VGG,
and GoogLeNet. The improvement in absolute top-1
accuracy was respectively 1.12%, 4.31%, and 1.12%. This
corresponds to a relative improvement of 4.66%, 13.7%,
and 3.6%. These results are remarkable for such Fnely-
tuned models!
DSD was applied to
NeuralTalk, an amazing
model that generates a
description from an image.
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 11 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
To verify that the Dense-
Sparse-Dense method works
on an LSTM, the CNN part of
Neural Talk is frozen. Only
the LSTM layers are trained. Very high (80% deducted by
the validation set) pruning was applied at the Sparse step.
Still, this gives the Neural Talk BLEU score an average
relative improvement of 6.7%. Its fascinating that such a
minor adjustment produces this much improvement.
Applying DSD to speech recognition (Deep Speech 1)
achieves an average relative improvement of Word Error
Rate of 3.95%. On a similar but more advanced Deep
Speech 2 model Dense-Sparse-Dense is applied iteratively
two times. On the Frst iteration, pruning 50% of the
weights, then 25% of the weights are pruned. After these
two DSD iterations, the average relative improvement is
6.5%.
Conclusion
I hope that Ive managed to explain these research techniques
for ePcient training of deep neural networks in a transparent
way. Work on this post allowed me to grasp how novel and
clever these techniques are. A solid understanding of these
approaches will allow you to incorporate them into your model
training procedure when needed.
...
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 12 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
Editors Note: Heartbeat is a contributor-driven online
publication and community dedicated to exploring the emerging
intersection of mobile app development and machine learning.
Were committed to supporting and inspiring developers and
engineers from all walks of life.
Editorially independent, Heartbeat is sponsored and published by
Fritz AI, the machine learning platform that helps developers
teach devices to see, hear, sense, and think. We pay our
contributors, and we dont sell ads.
If youd like to contribute, head on over to our call for
contributors. You can also sign up to receive our weekly
newsletters (Deep Learning Weekly and the Fritz AI
Newsletter), join us on Slack, and follow Fritz AI on Twitter for
all the latest in mobile machine learning.
Neural Networks Deep Learning Heartbeat Guides And Tutorials
Machine Learning
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 13 of 14 The 4 Research Techniques to Train Deep Neural Network Models More Efficiently 26/05/2020, 21:12
Discover Medium Make Medium Become a member
yours Welcome to a place where Get unlimited access to the
words matter. On Medium, Follow all the topics you best stories on Medium —
smart voices and original care about, and well and support writers while
ideas take center stage - deliver the best stories for youre at it. Just $5/month.
with no ads in sight. Watch you to your homepage and Upgrade
inbox. Explore
About Help Legal
https://heartbeat.fritz.ai/the-4-research-techniques-to-train-deep-neural-network-models-more-efficiently-810ea2886205 Page 14 of 14

View File

@ -0,0 +1,678 @@
The State of Sparsity in Deep Neural Networks
Trevor Gale * 1y Erich Elsen * 2 Sara Hooker 1y
Abstract like image classification and machine translation commonly
have tens of millions of parameters, and require billions ofWe rigorously evaluate three state-of-the-art tech- floating-point operations to make a prediction for a singleniques for inducing sparsity in deep neural net-
arXiv:1902.09574v1 [cs.LG] 25 Feb 2019 input sample.works on two large-scale learning tasks: Trans-
former trained on WMT 2014 English-to-German, Sparsity has emerged as a leading approach to address these
and ResNet-50 trained on ImageNet. Across thou- challenges. By sparsity, we refer to the property that a subset
sands of experiments, we demonstrate that com- of the model parameters have a value of exactly zero 2 . With
plex techniques (Molchanov et al.,2017;Louizos zero valued weights, any multiplications (which dominate
et al.,2017b) shown to yield high compression neural network computation) can be skipped, and models
rates on smaller datasets perform inconsistently, can be stored and transmitted compactly using sparse matrix
and that simple magnitude pruning approaches formats. It has been shown empirically that deep neural
achieve comparable or better results. Based on networks can tolerate high levels of sparsity (Han et al.,
insights from our experiments, we achieve a 2015;Narang et al.,2017;Ullrich et al.,2017), and this
new state-of-the-art sparsity-accuracy trade-off property has been leveraged to significantly reduce the cost
for ResNet-50 using only magnitude pruning. Ad- associated with the deployment of deep neural networks,
ditionally, we repeat the experiments performed and to enable the deployment of state-of-the-art models in
byFrankle & Carbin(2018) andLiu et al.(2018) severely resource constrained environments (Theis et al.,
at scale and show that unstructured sparse archi- 2018;Kalchbrenner et al.,2018;Valin & Skoglund,2018).
tectures learned through pruning cannot be trained Over the past few years, numerous techniques for induc-from scratch to the same test set performance as ing sparsity have been proposed and the set of models anda model trained with joint sparsification and op- datasets used as benchmarks has grown too large to rea-timization. Together, these results highlight the sonably expect new approaches to explore them all. Inneed for large-scale benchmarks in the field of addition to the lack of standardization in modeling tasks, themodel compression. We open-source our code, distribution of benchmarks tends to slant heavily towardstop performing model checkpoints, and results of convolutional architectures and computer vision tasks, andall hyperparameter configurations to establish rig- the tasks used to evaluate new techniques are frequentlyorous baselines for future work on compression not representative of the scale and complexity of real-worldand sparsification. tasks where model compression is most useful. These char-
acteristics make it difficult to come away from the sparsity
literature with a clear understanding of the relative merits
1. Introduction of different approaches.
Deep neural networks achieve state-of-the-art performance In addition to practical concerns around comparing tech-
in a variety of domains including image classification (He niques, multiple independent studies have recently proposed
et al.,2016), machine translation (Vaswani et al.,2017), that the value of sparsification in neural networks has been
and text-to-speech (van den Oord et al.,2016;Kalchbren- misunderstood (Frankle & Carbin,2018;Liu et al.,2018).
ner et al.,2018). While model quality has been shown to While both papers suggest that sparsification can be viewed
scale with model and dataset size (Hestness et al.,2017), as a form of neural architecture search, they disagree on
the resources required to train and deploy large neural net- what is necessary to achieve this. Specifically,Liu et al.
works can be prohibitive. State-of-the-art models for tasks 2 The term sparsity is also commonly used to refer to the pro-
* Equal contribution y This work was completed as part of the portion of a neural networks weights that are zero valued. Higher
Google AI Residency 1 Google Brain 2 DeepMind. Correspondence sparsity corresponds to fewer weights, and smaller computational
to: Trevor Gale<tgale@google.com>. and storage requirements. We use the term in this way throughout
this paper. The State of Sparsity in Deep Neural Networks
(2018) re-train learned sparse topologies with a random Some of the earliest techniques for sparsifying neural net-
weight initialization, whereasFrankle & Carbin(2018) posit works make use of second-order approximation of the loss
that the exact random weight initialization used when the surface to avoid damaging model quality (LeCun et al.,
sparse architecture was learned is needed to match the test 1989;Hassibi & Stork,1992). More recent work has
set performance of the model sparsified during optimization. achieved comparable compression levels with more com-
putationally efficient first-order loss approximations, andIn this paper, we address these ambiguities to provide a further refinements have related this work to efficient em-strong foundation for future work on sparsity in neural net- pirical estimates of the Fisher information of the modelworks.Our main contributions:(1) We perform a com- parameters (Molchanov et al.,2016;Theis et al.,2018).prehensive evaluation of variational dropout (Molchanov
et al.,2017),l0 regularization (Louizos et al.,2017b), and Reinforcement learning has also been applied to automat-
magnitude pruning (Zhu & Gupta,2017) on Transformer ically prune weights and convolutional filters (Lin et al.,
trained on WMT 2014 English-to-German and ResNet-50 2017;He et al.,2018), and a number of techniques have
trained on ImageNet. To the best of our knowledge, we been proposed that draw inspiration from biological phe-
are the first to apply variational dropout andl0 regulariza- nomena, and derive from evolutionary algorithms and neu-
tion to models of this scale. While variational dropout and romorphic computing (Guo et al.,2016;Bellec et al.,2017;
l0 regularization achieve state-of-the-art results on small Mocanu et al.,2018).
datasets, we show that they perform inconsistently for large- A key feature of a sparsity inducing technique is if andscale tasks and that simple magnitude pruning can achieve how it imposes structure on the topology of sparse weights.comparable or better results for a reduced computational While unstructured weight sparsity provides the most flex-budget. (2) Through insights gained from our experiments, ibility for the model, it is more difficult to map efficientlywe achieve a new state-of-the-art sparsity-accuracy trade-off to parallel processors and has limited support in deep learn-for ResNet-50 using only magnitude pruning. (3) We repeat ing software packages. For these reasons, many techniquesthe lottery ticket (Frankle & Carbin,2018) and scratch (Liu focus on removing whole neurons and convolutional filters,et al.,2018) experiments on Transformer and ResNet-50 or impose block structure on the sparse weights (Liu et al.,across a full range of sparsity levels. We show that unstruc- 2017;Luo et al.,2017;Gray et al.,2017). While this is prac-tured sparse architectures learned through pruning cannot tical, there is a trade-off between achievable compressionbe trained from scratch to the same test set performance as levels for a given model quality and the level of structurea model trained with pruning as part of the optimization imposed on the model weights. In this work, we focusprocess. (4) We open-source our code, model checkpoints, on unstructured sparsity with the expectation that it upperand results of all hyperparameter settings to establish rig- bounds the compression-accuracy trade-off achievable withorous baselines for future work on model compression and structured sparsity techniques.sparsification 3 .
3. Evaluating Sparsification Techniques at2. Sparsity in Neural Networks Scale
We briefly provide a non-exhaustive review of proposed
approaches for inducing sparsity in deep neural networks. As a first step towards addressing the ambiguity in the
sparsity literature, we rigorously evaluate magnitude-based
Simple heuristics based on removing small magnitude pruning (Zhu & Gupta,2017), sparse variational dropoutweights have demonstrated high compression rates with (Molchanov et al.,2017), andl0 regularization (Louizosminimal accuracy loss (Strom¨ ,1997;Collins & Kohli,2014; et al.,2017b) on two large-scale deep learning applications:
Han et al.,2015), and further refinement of the sparsifica- ImageNet classification with ResNet-50 (He et al.,2016),
tion process for magnitude pruning techniques has increased and neural machine translation (NMT) with the Transformer
achievable compression rates and greatly reduced computa- on the WMT 2014 English-to-German dataset (Vaswani
tional complexity (Guo et al.,2016;Zhu & Gupta,2017). et al.,2017). For each model, we also benchmark a random
Many techniques grounded in Bayesian statistics and in- weight pruning technique, representing the lower bound
formation theory have been proposed (Dai et al.,2018; of compression-accuracy trade-off any method should be
Molchanov et al.,2017;Louizos et al.,2017b;a;Ullrich expected to achieve.
et al.,2017). These methods have achieved high compres- Here we briefly review the four techniques and introduce sion rates while providing deep theoretical motivation and our experimental framework. We provide a more detailed
connections to classical sparsification and regularization overview of each technique in AppendixA.
techniques.
3 https://bit.ly/2ExE8Yj The State of Sparsity in Deep Neural Networks
3.1. Magnitude Pruning Table 1.Constant hyperparameters for all Transformer exper-
Magnitude-based weight pruning schemes use the magni- iments.More details on the standard configuration for training the
tude of each weight as a proxy for its importance to model Transformer can be found inVaswani et al.(2017).
quality, and remove the least important weights according Hyperparameter Value
to some sparsification schedule over the course of training. dataset translatewmtendepacked
For our experiments, we use the approach introduced in training iterations 500000
Zhu & Gupta(2017), which is conveniently available in the batch size 2048 tokens
TensorFlow modelpruning library 4 . This technique allows learning rate schedule standard transformerbase
for masked weights to reactivate during training based on optimizer Adam
gradient updates, and makes use of a gradual sparsification sparsity range 50% - 98%
schedule with sorting-based weight thresholding to achieve beam search beam size 4; length penalty 0.6
a user specified level of sparsification. These features enable
high compression ratios at a reduced computational cost rel- optimized directly using the reparameterization trick, and
ative to the iterative pruning and re-training approach used the expectedl0 -norm can be computed using the value of the
byHan et al.(2015), while requiring less hyperparame- cumulative distribution function of the random gate variable
ter tuning relative to the technique proposed byGuo et al. evaluated at zero.
(2016).
3.4. Random Pruning Baseline
3.2. Variational Dropout For our experiments, we also include a random sparsification
Variational dropout was originally proposed as a re- procedure adapted from the magnitude pruning technique
interpretation of dropout training as variational inference, ofZhu & Gupta(2017). Our random pruning technique
providing a Bayesian justification for the use of dropout uses the same sparsity schedule, but differs by selecting the
in neural networks and enabling useful extensions to the weights to be pruned each step at random rather based on
standard dropout algorithms like learnable dropout rates magnitude and does not allow pruned weights to reactivate.
(Kingma et al.,2015). It was later demonstrated that by This technique is intended to represent a lower-bound of the
learning a model with variational dropout and per-parameter accuracy-sparsity trade-off curve.
dropout rates, weights with high dropout rates can be re-
moved post-training to produce highly sparse solutions 3.5. Experimental Framework
(Molchanov et al.,2017). For magnitude pruning, we used the TensorFlow model
Variational dropout performs variational inference to learn pruning library. We implemented variational dropout and
the parameters of a fully-factorized Gaussian posterior over l0 regularization from scratch. For variational dropout, we
the weights under a log-uniform prior. In the standard for- verified our implementation by reproducing the results from
mulation, we apply a local reparameterization to move the the original paper. To verify ourl0 regularization implemen-
sampled noise from the weights to the activations, and then tation, we applied our weight-level code to Wide ResNet
apply the additive noise reparameterization to further reduce (Zagoruyko & Komodakis,2016) trained on CIFAR-10 and
the variance of the gradient estimator. Under this parame- replicated the training FLOPs reduction and accuracy re-
terization, we directly optimize the mean and variance of sults from the original publication. Verification results for
the neural network parameters. After training a model with variational dropout andl0 regularization are included in
variational dropout, the weights with the highest learned AppendicesBandC. For random pruning, we modified
dropout rates can be removed to produce a sparse model. the TensorFlow model pruning library to randomly select
weights as opposed to sorting them based on magnitude.
3.3.l0 Regularization For each model, we kept the number of training steps con-
l0 regularization explicitly penalizes the number of non- stant across all techniques and performed extensive hyper-
zero weights in the model to induce sparsity. However, parameter tuning. While magnitude pruning is relatively
thel0 -norm is both non-convex and non-differentiable. To simple to apply to large models and achieves reasonably
address the non-differentiability of thel0 -norm,Louizos consistent performance across a wide range of hyperparame-
et al.(2017b) propose a reparameterization of the neural ters, variational dropout andl0 -regularization are much less
network weights as the product of a weight and a stochastic well understood. To our knowledge, we are the first to apply
gate variable sampled from a hard-concrete distribution. these techniques to models of this scale. To produce a fair
The parameters of the hard-concrete distribution can be comparison, we did not limit the amount of hyperparameter
tuning we performed for each technique. In total, our results 4 https://bit.ly/2T8hBGn encompass over 4000 experiments. The State of Sparsity in Deep Neural Networks
Figure 2.Average sparsity in Transformer layers.Distributions
calculated on the top performing model at 90% sparsity for each
technique.l0 regularization and variational dropout are able to
learn non-uniform distributions of sparsity, while magnitude prun-
ing induces user-specified sparsity distributions (in this case, uni-
form).
form the random pruning technique, randomly removing
weights produces surprisingly reasonable results, which is
perhaps indicative of the models ability to recover from
Figure 1.Sparsity-BLEU trade-off curves for the Transformer. damage during optimization.
Top: Pareto frontiers for each of the four sparsification techniques
applied to the Transformer. Bottom: All experimental results with What is particularly notable about the performance of mag-
each technique. Despite the diversity of approaches, the relative nitude pruning is that our experiments uniformly remove the
performance of all three techniques is remarkably consistent. Mag- same fraction of weights for each layer. This is in stark con-
nitude pruning notably outperforms more complex techniques for trast to variational dropout andl0 regularization, where the
high levels of sparsity. distribution of sparsity across the layers is learned through
the training process. Previous work has shown that a non-
4. Sparse Neural Machine Translation uniform sparsity among different layers is key to achieving
high compression rates (He et al.,2018), and variational
We adapted the Transformer (Vaswani et al.,2017) model dropout andl0 regularization should theoretically be able to
for neural machine translation to use these four sparsifica- leverage this feature to learn better distributions of weights
tion techniques, and trained the model on the WMT 2014 for a given global sparsity.
English-German dataset. We sparsified all fully-connected
layers and embeddings, which make up 99.87% of all of Figure2shows the distribution of sparsity across the differ-
the parameters in the model (the other parameters coming ent layer types in the Transformer for the top performing
from biases and layer normalization). The constant hyper- model at 90% global sparsity for each technique. Bothl0
parameters used for all experiments are listed in table1. We regularization and variational dropout learn to keep more
followed the standard training procedure used byVaswani parameters in the embedding, FFN layers, and the output
et al.(2017), but did not perform checkpoint averaging. transforms for the multi-head attention modules and induce
This setup yielded a baseline BLEU score of 27.29 averaged more sparsity in the transforms for the query and value in-
across five runs. puts to the attention modules. Despite this advantage,l0
regularization and variational dropout did not significantly
We extensively tuned the remaining hyperparameters for outperform magnitude pruning, even yielding inferior re-
each technique. Details on what hyperparameters we ex- sults at high sparsity levels.
plored, and the results of what settings produced the best
models can be found in AppendixD. It is also important to note that these results maintain a
constant number of training steps across all techniques and
that the Transformer variant with magnitude pruning trains4.1. Sparse Transformer Results & Analysis 1.24x and 1.65x faster thanl0 regularization and variational
All results for the Transformer are plotted in figure1. De- dropout respectively. While the standard Transformer train-
spite the vast differences in these approaches, the relative ing scheme produces excellent results for machine transla-
performance of all three techniques is remarkably consis- tion, it has been shown that training the model for longer
tent. Whilel0 regularization and variational dropout pro- can improve its performance by as much as 2 BLEU (Ott
duce the top performing models in the low-to-mid sparsity et al.,2018). Thus, when compared for a fixed training cost
range, magnitude pruning achieves the best results for highly magnitude pruning has a distinct advantage over these more
sparse models. While all techniques were able to outper- complicated techniques. The State of Sparsity in Deep Neural Networks
Table 2.Constant hyperparameters for all RN50 experiments.
Hyperparameter Value
dataset ImageNet
training iterations 128000
batch size 1024 images
learning rate schedule standard
optimizer SGD with Momentum
sparsity range 50% - 98%
5. Sparse Image Classification
To benchmark these four sparsity techniques on a large-
scale computer vision task, we integrated each method into
ResNet-50 and trained the model on the ImageNet large-
scale image classification dataset. We sparsified all convolu-
tional and fully-connected layers, which make up 99.79%
of all of the parameters in the model (the other parameters Figure 3.Sparsity-accuracy trade-off curves for ResNet-50.
coming from biases and batch normalization). Top: Pareto frontiers for variational dropout, magnitude pruning,
and random pruning applied to ResNet-50. Bottom: All experi- The hyperparameters we used for all experiments are listed mental results with each technique. We observe large variation in
in Table2. Each model was trained for 128000 iterations performance for variational dropout andl0 regularization between
with a batch size of 1024 images, stochastic gradient descent Transformer and ResNet-50. Magnitude pruning and variational
with momentum, and the standard learning rate schedule dropout achieve comparable performance for most sparsity levels,
(see AppendixE.1). This setup yielded a baseline top-1 with variational dropout achieving the best results for high sparsity
accuracy of 76.69% averaged across three runs. We trained levels.
each model with 8-way data parallelism across 8 accelera-
tors. Due to the extra parameters and operations required for will be non-zero. 5 .Louizos et al.(2017b) reported results
variational dropout, the model was unable to fit into device applyingl0 regularization to a wide residual network (WRN)
memory in this configuration. For all variational dropout (Zagoruyko & Komodakis,2016) on the CIFAR-10 dataset,
experiments, we used a per-device batch size of 32 images and noted that they observed small accuracy loss at as low
and scaled the model over 32 accelerators. as 8% reduction in the number of parameters during training.
Applying our weight-levell0 regularization implementation
5.1. ResNet-50 Results & Analysis to WRN produces a model with comparable training time
sparsity, but with no sparsity in the test-time parameters.Figure3shows results for magnitude pruning, variational For models that achieve test-time sparsity, we observe sig-dropout, and random pruning applied to ResNet-50. Surpris- nificant accuracy degradation on CIFAR-10. This result isingly, we were unable to produce sparse ResNet-50 mod- consistent with our observation forl els withl 0 regularization applied
0 regularization that did not significantly damage to ResNet-50 on ImageNet.model quality. Across hundreds of experiments, our models
were either able to achieve full test set performance with The variation in performance for variational dropout andl0
no sparsification, or sparsification with test set performance regularization between Transformer and ResNet-50 is strik-
akin to random guessing. Details on all hyperparameter ing. While achieving a good accuracy-sparsity trade-off,
settings explored are included in AppendixE. variational dropout consistently ranked behindl0 regulariza-
tion on Transformer, and was bested by magnitude pruningThis result is particularly surprising given the success ofl0 for sparsity levels of 80% and up. However, on ResNet-50regularization on Transformer. One nuance of thel0 regular- we observe that variational dropout consistently producesization technique ofLouizos et al.(2017b) is that the model
can have varying sparsity levels between the training and 5 The fraction of time a parameter is set to zero during training
test-time versions of the model. At training time, a parame- depends on other factors, e.g. theparameter of the hard-concrete
ter with a dropout rate of 10% will be zero 10% of the time distribution. However, this point is generally true that the training
and test-time sparsities are not necessarily equivalent, and that when sampled from the hard-concrete distribution. How- there exists some dropout rate threshold below which a weight that
ever, under the test-time parameter estimator, this weight is sometimes zero during training will be non-zero at test-time. The State of Sparsity in Deep Neural Networks
Figure 4.Average sparsity in ResNet-50 layers.Distributions Figure 5.Sparsity-accuracy trade-off curves for ResNet-50
calculated on the top performing model at 95% sparsity for each with modified sparsification scheme. Altering the distribution
technique. Variational dropout is able to learn non-uniform dis- of sparsity across the layers and increasing training time yield
tributions of sparsity, decreasing sparsity in the input and output significant improvement for magnitude pruning.
layers that are known to be disproportionately important to model
quality. 5.2. Pushing the Limits of Magnitude Pruning
Given that a uniform distribution of sparsity is suboptimal,
and the significantly smaller resource requirements for ap-
plying magnitude pruning to ResNet-50 it is natural to won-
models on-par or better than magnitude pruning, and that der how well magnitude pruning could perform if we were to
l0 regularization is not able to produce sparse models at distribute the non-zero weights more carefully and increase
all. Variational dropout achieved particularly notable results training time.
in the high sparsity range, maintaining a top-1 accuracy To understand the limits of the magnitude pruning heuristic,over 70% with less than 4% of the parameters of a standard we modify our ResNet-50 training setup to leave the firstResNet-50. convolutional layer fully dense, and only prune the final
The distribution of sparsity across different layer types in the fully-connected layer to 80% sparsity. This heuristic is
best variational dropout and magnitude pruning models at reasonable for ResNet-50, as the first layer makes up a small
95% sparsity are plotted in figure4. While we kept sparsity fraction of the total parameters in the model and the final
constant across all layers for magnitude and random prun- layer makes up only .03% of the total FLOPs. While tuning
ing, variational dropout significantly reduces the amount of the magnitude pruning ResNet-50 models, we observed that
sparsity induced in the first and last layers of the model. the best models always started and ended pruning during
the third learning rate phase, before the second learning rateIt has been observed that the first and last layers are often drop. To take advantage of this, we increase the number ofdisproportionately important to model quality (Han et al., training steps by 1.5x by extending this learning rate region.2015;Bellec et al.,2017). In the case of ResNet-50, the Results for ResNet-50 trained with this scheme are plottedfirst convolution comprises only .037% of all the parame- in figure5.ters in the model. At 98% sparsity the first layer has only
188 non-zero parameters, for an average of less than 3 pa- With these modifications, magnitude pruning outperforms
rameters per output feature map. With magnitude pruning variational dropout at all but the highest sparsity levels while
uniformly sparsifying each layer, it is surprising that it is still using less resources. However, variational dropouts per-
able to achieve any test set performance at all with so few formance in the high sparsity range is particularly notable.
parameters in the input convolution. With very low amounts of non-zero weights, we find it likely
that the models performance on the test set is closely tied toWhile variational dropout is able to learn to distribute spar- precise allocation of weights across the different layers, andsity non-uniformly across the layers, it comes at a significant that variational dropouts ability to learn this distributionincrease in resource requirements. For ResNet-50 trained enables it to better maintain accuracy at high sparsity levels.with variational dropout we observed a greater than 2x in- This result indicates that efficient sparsification techniquescrease in memory consumption. When scaled across 32 that are able to learn the distribution of sparsity across layersaccelerators, ResNet-50 trained with variational dropout are a promising direction for future work. completed training in 9.75 hours, compared to ResNet-50
with magnitude pruning finishing in 12.50 hours on only 8 Its also worth noting that these changes produced mod-
accelerators. Scaled to a 4096 batch size and 32 accelerators, els at 80% sparsity with top-1 accuracy of 76.52%, only
ResNet-50 with magnitude pruning can complete the same .17% off our baseline ResNet-50 accuracy and .41% better
number of epochs in just 3.15 hours. than the results reported byHe et al.(2018), without the The State of Sparsity in Deep Neural Networks
extra complexity and computational requirements of their
reinforcement learning approach. This represents a new
state-of-the-art sparsity-accuracy trade-off for ResNet-50
trained on ImageNet.
6. Sparsification as Architecture Search
While sparsity is traditionally thought of as a model com-
pression technique, two independent studies have recently
suggested that the value of sparsification in neural net-
works is misunderstood, and that once a sparse topology
is learned it can be trained from scratch to the full perfor-
mance achieved when sparsification was performed jointly
with optimization.
Frankle & Carbin(2018) posited that over-parameterized
neural networks contain small, trainable subsets of weights,
deemed ”winning lottery tickets”. They suggest that sparsity
inducing techniques are methods for finding these sparse
topologies, and that once found the sparse architectures can
be trained from scratch withthe same weight initialization Figure 6.Scratch and lottery ticket experiments with magni- that was used when the sparse architecture was learned. tude pruning.Top: results with Transformer. Bottom: Results They demonstrated that this property holds across different with ResNet-50. Across all experiments, training from scratch
convolutional neural networks and multi-layer perceptrons using a learned sparse architecture is unable to re-produce the
trained on the MNIST and CIFAR-10 datasets. performance of models trained with sparsification as part of the
optimization process. Liu et al.(2018) similarly demonstrated this phenomenon
for a number of activation sparsity techniques on convolu-
tional neural networks, as well as for weight level sparsity To clarify the questions surrounding the idea of sparsifi-learned with magnitude pruning. However, they demon- cation as a form of neural architecture search, we repeatstrate this result using a random initialization during re- the experiments ofFrankle & Carbin(2018) andLiu et al.training. (2018) on ResNet-50 and Transformer. For each model,
The implications of being able to train sparse architectures we explore the full range of sparsity levels (50% - 98%)
from scratch once they are learned are large: once a sparse and compare to our well-tuned models from the previous
topology is learned, it can be saved and shared as with sections.
any other neural network architecture. Re-training then
can be done fully sparse, taking advantage of sparse linear 6.1. Experimental Framework
algebra to greatly accelerate time-to-solution. However, the The experiments ofLiu et al.(2018) encompass taking thecombination of these two studies does not clearly establish final learned weight mask from a magnitude pruning model,how this potential is to be realized. randomly re-initializing the weights, and training the model
Beyond the question of whether or not the original random with the normal training procedure (i.e., learning rate, num-
weight initialization is needed, both studies only explore ber of iterations, etc.). To account for the presence of spar-
convolutional neural networks (and small multi-layer per- sity at the start of training, they scale the variance of the
ceptrons in the case ofFrankle & Carbin(2018)). The initial weight distribution by the number of non-zeros in the
majority of experiments in both studies also limited their matrix. They additionally train a variant where they increase
analyses to the MNIST, CIFAR-10, and CIFAR-100 datasets. the number of training steps (up to a factor of 2x) such that
While these are standard benchmarks for deep learning mod- the re-trained model uses approximately the same number of
els, they are not indicative of the complexity of real-world FLOPs during training as model trained with sparsification
tasks where model compression is most useful.Liu et al. as part of the optimization process. They refer to these two
(2018) do explore convolutional architectures on the Ima- experiments as ”scratch-e” and ”scratch-b” respectively.
geNet datasets, but only at two relatively low sparsity levels Frankle & Carbin(2018) follow a similar procedure, but use(30% and 60%). They also note that weight level sparsity the same weight initialization that was used when the sparseon ImageNet is the only case where they are unable to re- weight mask was learned and do not perform the longerproduce the full accuracy of the pruned model. training time variant. The State of Sparsity in Deep Neural Networks
For our experiments, we repeat the scratch-e, scratch-b and sparsity levels, we observe that the quality of the models
lottery ticket experiments with magnitude pruning on Trans- degrades relative to the magnitude pruning baseline as spar-
former and ResNet-50. For scratch-e and scratch-b, we also sity increases. For unstructured weight sparsity, it seems
train variants that do not alter the initial weight distribution. likely that the phenomenon observed byLiu et al.(2018)
For the Transformer, we re-trained five replicas of the best was produced by a combination of low sparsity levels and
magnitude pruning hyperparameter settings at each spar- small-to-medium sized tasks. Wed like to emphasize that
sity level and save the weight initialization and final sparse this result is only for unstructured weight sparsity, and that
weight mask. For each of the five learned weight masks, prior workLiu et al.(2018) provides strong evidence that
we train five identical replicas for the scratch-e, scratch- activation pruning behaves differently.
b, scratch-e with augmented initialization, scratch-b with
augmented initialization, and the lottery ticket experiments. 7. Limitations of This Study For ResNet-50, we followed the same procedure with three
re-trained models and three replicas at each sparsity level Hyperparameter exploration. For all techniques and
for each of the five experiments. Figure6plots the averages models, we carefully hand-tuned hyperparameters and per-
and min/max of all experiments at each sparsity level 6 . formed extensive sweeps encompassing thousands of exper-
iments over manually identified ranges of values. However,
6.2. Scratch and Lottery Ticket Results & Analysis the number of possible settings vastly outnumbers the set
of values that can be practically explored, and we cannotAcross all of our experiments, we observed that training eliminate the possibility that some techniques significantlyfrom scratch using a learned sparse architecture is not able outperform others under settings we did not try.to match the performance of the same model trained with
sparsification as part of the optimization process. Neural architectures and datasets. Transformer and
ResNet-50 were chosen as benchmark tasks to represent aAcross both models, we observed that doubling the number cross section of large-scale deep learning tasks with diverseof training steps did improve the quality of the results for architectures. We cant exclude the possibility that somethe scratch experiments, but was not sufficient to match the techniques achieve consistently high performance acrosstest set performance of the magnitude pruning baseline. As other architectures. More models and tasks should be thor-sparsity increased, we observed that the deviation between oughly explored in future work.the models trained with magnitude pruning and those trained
from scratch increased. For both models, we did not observe
a benefit from using the augmented weight initialization for 8. Conclusion
the scratch experiments. In this work, we performed an extensive evaluation of three
For ResNet-50, we experimented with four different learn- state-of-the-art sparsification techniques on two large-scale
ing rates schemes for the scratch-b experiments. We found learning tasks. Notwithstanding the limitations discussed in
that scaling each learning rate region to double the number section7, we demonstrated that complex techniques shown
of epochs produced the best results by a wide margin. These to yield state-of-the-art compression on small datasets per-
results are plotted in figure6. Results for the ResNet-50 form inconsistently, and that simple heuristics can achieve
scratch-b experiments with the other learning rate variants comparable or better results on a reduced computational bud-
are included with our release of hyperparameter tuning re- get. Based on insights from our experiments, we achieve a
sults. new state-of-the-art sparsity-accuracy trade-off for ResNet-
50 with only magnitude pruning and highlight promisingFor the lottery ticket experiments, we were not able to repli- directions for research in sparsity inducing techniques.cate the phenomenon observed byFrankle & Carbin(2018).
The key difference between our experiments is the complex- Additionally, we provide strong counterexamples to two re-
ity of the tasks and scale of the models, and it seems likely cently proposed theories that models learned through prun-
that this is the main factor contributing to our inability to ing techniques can be trained from scratch to the same test
train these architecture from scratch. set performance of a model learned with sparsification as
part of the optimization process. Our results highlight theFor the scratch experiments, our results are consistent with need for large-scale benchmarks in sparsification and modelthe negative result observed by (Liu et al.,2018) for Im- compression. As such, we open-source our code, check-ageNet and ResNet-50 with unstructured weight pruning. points, and results of all hyperparameter configurations to By replicating the scratch experiments at the full range of establish rigorous baselines for future work.
6 Two of the 175 Transformer experiments failed to train from
scratch at all and produced BLEU scores less than 1.0. We omit
these outliers in figure6 The State of Sparsity in Deep Neural Networks
Acknowledgements Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S.,
Casagrande, N., Lockhart, E., Stimberg, F., van den Oord,We would like to thank Benjamin Caine, Jonathan Frankle, A., Dieleman, S., and Kavukcuoglu, K. Efficient NeuralRaphael Gontijo Lopes, Sam Greydanus, and Keren Gu for Audio Synthesis. InProceedings of the 35th Interna-helpful discussions and feedback on drafts of this paper. tional Conference on Machine Learning, ICML 2018,
Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018¨ ,
References pp. 24152424, 2018.
Bellec, G., Kappel, D., Maass, W., and Legenstein, R. A. Kingma, D. P. and Welling, M. Auto-encoding variational
Deep Rewiring: Training Very Sparse Deep Networks. bayes.CoRR, abs/1312.6114, 2013.
CoRR, abs/1711.05136, 2017. Kingma, D. P., Salimans, T., and Welling, M. Variational
Collins, M. D. and Kohli, P. Memory Bounded Deep Con- dropout and the local reparameterization trick. CoRR,
volutional Networks.CoRR, abs/1412.1442, 2014. URL abs/1506.02557, 2015.
http://arxiv.org/abs/1412.1442. LeCun, Y., Denker, J. S., and Solla, S. A. Optimal Brain
Dai, B., Zhu, C., and Wipf, D. P. Compressing Neural Damage. InNIPS, pp. 598605. Morgan Kaufmann,
Networks using the Variational Information Bottleneck. 1989.
CoRR, abs/1802.10399, 2018. Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning.
InNIPS, pp. 21782188, 2017.Frankle, J. and Carbin, M. The Lottery Ticket Hy-
pothesis: Training Pruned Neural Networks. CoRR, Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang,abs/1803.03635, 2018. URLhttp://arxiv.org/ C. Learning Efficient Convolutional Networks throughabs/1803.03635. Network Slimming. InIEEE International Conference
on Computer Vision, ICCV 2017, Venice, Italy, OctoberGray, S., Radford, A., and Kingma, D. P. Block- 22-29, 2017, pp. 27552763, 2017.sparse gpu kernels.https://blog.openai.com/
block-sparse-gpu-kernels/, 2017. Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T.
Rethinking the Value of Network Pruning. CoRR,
Guo, Y., Yao, A., and Chen, Y. Dynamic Network Surgery abs/1810.05270, 2018.
for Efficient DNNs. InNIPS, 2016. Louizos, C., Ullrich, K., and Welling, M. Bayesian Com-
Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both pression for Deep Learning. InAdvances in Neural In-
Weights and Connections for Efficient Neural Network. formation Processing Systems 30: Annual Conference
InNIPS, pp. 11351143, 2015. on Neural Information Processing Systems 2017, 4-9 De-
cember 2017, Long Beach, CA, USA, pp. 32903300,
Hassibi, B. and Stork, D. G. Second order derivatives for 2017a.
network pruning: Optimal brain surgeon. InNIPS, pp.
164171. Morgan Kaufmann, 1992. Louizos, C., Welling, M., and Kingma, D. P. Learn-
ing Sparse Neural Networks through L0Regularization.
He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learn- CoRR, abs/1712.01312, 2017b.
ing for Image Recognition. In2016 IEEE Conference on Luo, J., Wu, J., and Lin, W. Thinet: A Filter Level PruningComputer Vision and Pattern Recognition, CVPR 2016, Method for Deep Neural Network Compression. InIEEELas Vegas, NV, USA, June 27-30, 2016, pp. 770778, International Conference on Computer Vision, ICCV2016. 2017, Venice, Italy, October 22-29, 2017, pp. 50685076,
2017.He, Y., Lin, J., Liu, Z., Wang, H., Li, L., and Han, S. AMC:
automl for model compression and acceleration on mo- Mitchell, T. J. and Beauchamp, J. J. Bayesian Variablebile devices. InComputer Vision - ECCV 2018 - 15th Selection in Linear Regression.Journal of the AmericanEuropean Conference, Munich, Germany, September 8- Statistical Association, 83(404):10231032, 1988.14, 2018, Proceedings, Part VII, pp. 815832, 2018.
Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H.,
Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, Gibescu, M., and Liotta, A. Scalable Training of Artifi-
H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and cial Neural Networks with Adaptive Sparse Connectivity
Zhou, Y. Deep learning scaling is predictable, empirically. Inspired by Network Science.Nature Communications,
CoRR, abs/1712.00409, 2017. 2018. The State of Sparsity in Deep Neural Networks
Molchanov, D., Ashukha, A., and Vetrov, D. P. Variational Zagoruyko, S. and Komodakis, N. Wide Residual Networks.
Dropout Sparsifies Deep Neural Networks. InProceed- InProceedings of the British Machine Vision Conference
ings of the 34th International Conference on Machine 2016, BMVC 2016, York, UK, September 19-22, 2016,
Learning, ICML 2017, Sydney, NSW, Australia, 6-11 Au- 2016.
gust 2017, pp. 24982507, 2017. Zhu, M. and Gupta, S. To prune, or not to prune: exploring
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. the efficacy of pruning for model compression.CoRR,
Pruning Convolutional Neural Networks for Resource Ef- abs/1710.01878, 2017. URLhttp://arxiv.org/
ficient Transfer Learning.CoRR, abs/1611.06440, 2016. abs/1710.01878.
Narang, S., Diamos, G. F., Sengupta, S., and Elsen, E. Ex-
ploring Sparsity in Recurrent Neural Networks.CoRR,
abs/1704.05119, 2017.
Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling
Neural Machine Translation. InProceedings of the Third
Conference on Machine Translation: Research Papers,
WMT 2018, Belgium, Brussels, October 31 - November 1,
2018, pp. 19, 2018.
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic
Backpropagation and Approximate Inference in Deep
Generative models. InICML, volume 32 ofJMLR
Workshop and Conference Proceedings, pp. 12781286.
JMLR.org, 2014.
Strom, N. Sparse Connection and Pruning in Large Dynamic¨
Artificial Neural Networks. InEUROSPEECH, 1997.
Theis, L., Korshunova, I., Tejani, A., and Huszar, F. Faster´
gaze prediction with dense networks and Fisher pruning.
CoRR, abs/1801.05787, 2018. URLhttp://arxiv.
org/abs/1801.05787.
Ullrich, K., Meeds, E., and Welling, M. Soft Weight-
Sharing for Neural Network Compression. CoRR,
abs/1702.04008, 2017.
Valin, J. and Skoglund, J. Lpcnet: Improving Neural
Speech Synthesis Through Linear Prediction. CoRR,
abs/1810.11846, 2018. URLhttp://arxiv.org/
abs/1810.11846.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,
and Kavukcuoglu, K. Wavenet: A Generative Model for
Raw Audio. InThe 9th ISCA Speech Synthesis Workshop,
Sunnyvale, CA, USA, 13-15 September 2016, pp. 125,
2016.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten-
tion is All you Need. InAdvances in Neural Information
Processing Systems 30: Annual Conference on Neural In-
formation Processing Systems 2017, 4-9 December 2017,
Long Beach, CA, USA, pp. 60006010, 2017. The State of Sparsity in Deep Neural Networks: Appendix
A. Overview of Sparsity Inducing Techniques p(w)with observed dataDinto an updated belief over the
parameters in the form of the posterior distributionp(wjD).Here we provide a more detailed review of the three sparsity In practice, computing the true posterior using Bayes ruletechniques we benchmarked. is computationally intractable and good approximations are
needed. In variational inference, we optimize the parame-A.1. Magnitude Pruning tersof some parameterized modelq (w)such thatq (w)
Magnitude-based weight pruning schemes use the magni- is a close approximation to the true posterior distribution
tude of each weight as a proxy for its importance to model p(wjD)as measured by the Kullback-Leibler divergence
quality, and remove the least important weights according between the two distributions. The divergence of our ap-
to some sparsification schedule over the course of training. proximate posterior from the true posterior is minimized in
Many variants have been proposed (Collins & Kohli,2014; practice by maximizing the variational lower-bound
Han et al.,2015;Guo et al.,2016;Zhu & Gupta,2017),
with the key differences lying in when weights are removed, L() =D Lwhether weights should be sorted to remove a precise pro- KL (q (w)jjp(w)) + D ()
portion or thresholded based on a fixed or decaying value, PwhereLand whether or not weights that have been pruned still re- D () = Eq (w) [logp(yjx;w)]
(x;y)2D
ceive gradient updates and have the potential to return after Using the Stochastic Gradient Variational Bayes (SGVB)being pruned. (Kingma et al.,2015) algorithm to optimize this bound,
Han et al.(2015) use iterative magnitude pruning and re- LD ()reduces to the standard cross-entropy loss, and the
training to progressively sparsify a model. The target model KL divergence between our approximate posterior and prior
is first trained to convergence, after which a portion of over the parameters serves as a regularizer that enforces our
weights are removed and the model is re-trained with these initial belief about the parametersw.
weights fixed to zero. This process is repeated until the In the standard formulation of variational dropout, we as-target sparsity is achieved.Guo et al.(2016) improve on sume the weights are drawn from a fully-factorized Gaussianthis approach by allowing masked weights to still receive approximate posterior.gradient updates, enabling the network to recover from in-
correct pruning decisions during optimization. They achieve
higher compression rates and interleave pruning steps with wij q (wij ) =N(ij ; ij 2 )ij gradient update steps to avoid expensive re-training.Zhu
& Gupta(2017) similarly allow gradient updates to masked Whereandare neural network parameters. For eachweights, and make use of a gradual sparsification schedule training step, we sample weights from this distribution andwith sorting-based weight thresholding to maintain accuracy use thereparameterization trick(Kingma & Welling,2013; while achieving a user specified level of sparsification. Rezende et al.,2014) to differentiate the loss w.r.t. the pa-
Its worth noting that magnitude pruning can easily be rameters through the sampling operation. Given the weights
adapted to induce block or activation level sparsity by re- are normally distributed, the distribution of the activations
moving groups of weights based on their p-norm, average, Bafter a linear operation like matrix multiplication or con-
max, or other statistics. Variants have also been proposed volution is also Gaussian and can be calculated in closed
that maintain a constant level of sparsity during optimization form 7 .
to enable accelerated training (Mocanu et al.,2018).
q (bmj jA) N(mj ; mj )
A.2. Variational Dropout
Consider the setting of a datasetDofNi.i.d. samples PK PK with (x;y)and a standard classification problem where the goal mj = ami ij andmj = a2 mi ij 2 and iji=1 i=1
is to learn the parameterswof the conditional probability whereami 2Aare the inputs to the layer. Thus, rather
p(yjx;w). Bayesian inference combines some initial belief 7 We ignore correlation in the activations, as is done by over the parameterswin the form of a prior distribution Molchanov et al.(2017) The State of Sparsity in Deep Neural Networks: Appendix
than sample weights, we can directly sample the activations andandstretch the distribution s.t.zj takes value 0 or 1
at each layer. This step is known as thelocal reparame- with non-zero probability.
terization trick, and was shown byKingma et al.(2015) to On each training iteration,zreduce the variance of the gradients relative to the standard j is sampled from this distri-
bution and multiplied with the standard neural networkformulation in which a single set of sampled weights must weights. The expectedlbe shared for all samples in the input batch for efficiency. 0 -normLC can then be calcu-
lated using the cumulative distribution function of the hard-Molchanov et al.(2017) showed that the variance of the gra- concrete distribution and optimized directly with stochasticdients could be further reduced by using anadditive noise gradient descent.reparameterization, where we define a new parameter
2 =ij ij 2ij Xjj Xjj LC = (1Qs (0j)) = sigmoid(log
Under this parameterization, we directly optimize the mean j j log )j=1 j=1
and variance of the neural network parameters.
Under the assumption of a log-uniform prior on the weights At test-time,Louizos et al.(2017b) use the following esti-
w, the KL divergence component of our objective function mate for the model parameters.
DKL (q (wij )jjp(wij ))can be accurately approximated
(Molchanov et al.,2017):
=~ z^
z^=min(1;max(0;sigmoid(log)() +))
DKL (q (wij )jjp(wij ))
k1 (k2 +k3 logij )0:5log(1 +1 +kij 1 ) Interestingly,Louizos et al.(2017b) showed that their ob-
k jective function under thel1 = 0:63576 k2 = 1:87320 k3 = 1:48695 0 penalty is a special case of a
variational lower-bound over the parameters of the network
under a spike and slab (Mitchell & Beauchamp,1988) prior.After training a model with variational dropout, the weights
with the highestvalues can be removed. For all their
experiments,Molchanov et al.(2017) removed weights with B. Variational Dropout Implementation
loglarger than 3.0, which corresponds to a dropout rate Verification
greater than 95%. Although they demonstrated good results,
it is likely that the optimalthreshold varies across different To verify our implementation of variational dropout, we
models and even different hyperparameter settings of the applied it to LeNet-300-100 and LeNet-5-Caffe on MNIST
same model. We address this question in our experiments. and compared our results to the original paper (Molchanov
et al.,2017). We matched our hyperparameters to those
used in the code released with the paper 8 . All results areA.3.l0 Regularization listed in table3
To optimize thel0 -norm, we reparameterize the model
weightsas the product of a weight and a random vari- Table 3.Variational Dropout MNIST Reproduction Results. able drawn from the hard-concrete distribution. Network Experiment Sparsity (%) Accuracy (%)
original (Molchanov et al.,2017) 98.57 98.08
ours (log= 3.0) 97.52 98.42LeNet-300-100 ours (log= 2.0) 98.50 98.40
ours (log= 0.1) 99.10 98.13
j =~j zj original (Molchanov et al.,2017) 99.60 99.25
wherez LeNet-5-Caffe ours (log= 3.0) 99.29 99.26
j min(1;max(0;s));s=s() + ours (log= 2.0) 99.50 99.25
s=sigmoid((logulog(1u) +log)=)
andu U(0;1) Our baseline LeNet-300-100 model achieved test set accu-
racy of 98.42%, slightly higher than the baseline of 98.36%
reported in (Molchanov et al.,2017). Applying our varia-In this formulation, theparameter that controls the posi- tional dropout implementation to LeNet-300-100 with these tion of the hard-concrete distribution (and thus the proba- hyperparameters produced a model with 97.52% global spar-bility thatzj is zero) is optimized with gradient descent., sity and 98.42% test accuracy. The original paper produced, andare fixed parameters that control the shape of the
hard-concrete distribution.controls the curvature ortem- 8 https://github.com/ars-ashuha/variational-dropout-sparsifies-
peratureof the hard-concrete probability density function, dnn The State of Sparsity in Deep Neural Networks: Appendix
Our baseline WRN-28-10 implementation trained on
CIFAR-10 achieved a test set accuracy of 95.45%. Using
ourl0 regularization implementation and al0 -norm weight
of .0003, we trained a model that achieved 95.34% accuracy
on the test set while achieving a consistent training-time
FLOPs reduction comparable to that reported byLouizos
et al.(2017b). Floating-point operations (FLOPs) required
to compute the forward over the course of training WRN-
28-10 withl0 are plotted in figure7.
During our re-implementation of the WRN experiments
Figure 7.Forward pass FLOPs for WRN-28-10 trained withl0 fromLouizos et al.(2017b), we identified errors in the orig- regularization.Our implementation achieves FLOPs reductions inal publications FLOP calculations that caused the number comparable to those reported inLouizos et al.(2017b). of floating-point operations in WRN-28-10 to be miscalcu-
lated. Weve contacted the authors, and hope to resolve this
issue to clarify their performance results.
a model with 98.57% global sparsity, and 98.08% test accu-
racy. While our model achieves .34% higher tests accuracy D. Sparse Transformer Experiments with 1% lower sparsity, we believe the discrepancy is mainly
due to difference in our software packages: the authors of D.1. Magnitude Pruning Details
(Molchanov et al.,2017) used Theano and Lasagne for their For our magnitude pruning experiments, we tuned four keyexperiments, while we use TensorFlow. hyperparameters: the starting iteration of the sparsification
Given our model achieves highest accuracy, we can decrease process, the ending iteration of the sparsification process,
thelogthreshold to trade accuracy for more sparsity. With the frequency of pruning steps, and the combination of other
alogthreshold of 2.0, our model achieves 98.5% global regularizers (dropout and label smoothing) used during train-
sparsity with a test set accuracy of 98.40%. With alog ing. We trained models with 7 different target sparsities:
threshold of 0.1, our model achieves 99.1% global sparsity 50%, 60%, 70%, 80%, 90%, 95%, and 98%. At each of
with 98.13% test set accuracy, exceeding the sparsity and these sparsity levels, we tried pruning frequencies of 1000
accuracy of the originally published results. and 10000 steps. During preliminary experiments we identi-
fied that the best settings for the training step to stop pruningOn LeNet-5-Caffe, our implementation achieved a global at were typically closer to the end of training. Based on thissparsity of 99.29% with a test set accuracy of 99.26%, ver- insight, we explored every possible combination of start andsus the originaly published results of 99.6% sparsity with end points for the sparsity schedule in increments of 10000099.25% accuracy. Lowering thelogthreshold to 2.0, our steps with an ending step of 300000 or greater.model achieves 99.5% sparsity with 99.25% test accuracy.
By default, the Transformer uses dropout with a dropout
C.l rate of 10% on the input to the encoder, decoder, and before 0 Regularization Implementation each layer and performs label smoothing with a smooth- Verification ing parameter of .1. We found that decreasing these other
The originall regularizers produced higher quality models in the mid to 0 regularization paper uses a modified version
of the proposed technique for inducing group sparsity in high sparsity range. For each hyperparameter combination,
models, so our weight-level implementation is not directly we tried three different regularization settings: standard la-
comparable. However, to verify our implementation we bel smoothing and dropout, label smoothing only, and no
trained a Wide ResNet (WRN) (Zagoruyko & Komodakis, regularization.
2016) on CIFAR-10 and compared results to those reported
in the original publication for group sparsity. D.2. Variational Dropout Details
As done byLouizos et al.(2017b), we applyl For the Transformer trained with variational dropout, we 0 to the
first convolutional layer in the residual blocks (i.e., where extensively tuned the coefficient for the KL divergence
dropout would normally be used). We use the weight decay component of the objective function to find models that
formulation for the re-parameterized weights, and scale the achieved high accuracy with sparsity levels in the target
weight decay coefficient to maintain the same initial length range. We found that KL divergence weights in the range
scale of the parameters. We use the same batch size of 128 [:1 ;1 ], whereNis the number of samples in the training N N
samples and the same initial log, and train our model on a set, produced models in our target sparsity range.
single GPU. The State of Sparsity in Deep Neural Networks: Appendix
(Molchanov et al.,2017) noted difficulty training some mod- E. Sparse ResNet-50
els from scratch with variational dropout, as large portions
of the model adopt high dropout rates early in training be- E.1. Learning Rate
fore the model can learn a useful representation from the For all experiments, the we used the learning rate schemedata. To address this issue, they use a gradual ramp-up of the used by the official TensorFlow ResNet-50 implementation 9 .KL divergence weight, linearly increasing the regularizer With our batch size of 1024, this includes a linear ramp-upcoefficient until it reaches the desired value. for 5 epochs to a learning rate of .4 followed by learning
For our experiments, we explored using a constant regu- rate drops by a factor of 0.1 at epochs 30, 60, and 80.
larizer weight, linearly increasing the regularizer weight,
and also increasing the regularizer weight following the E.2. Magnitude Pruning Details
cubic sparsity function used with magnitude pruning. For For magnitude pruning on ResNet-50, we trained modelsthe linear and cubic weight schedules, we tried each com- with a target sparsity of 50%, 70%, 80%, 90%, 95%, andbination of possible start and end points in increments of 98%. At each sparsity level, we tried starting pruning at100000 steps. For each hyperparameter combination, we steps 8k, 20k, and 40k. For each potential starting point, wealso tried the three different combinations of dropout and la- tried ending pruning at steps 68k, 76k, and 100k. For everybel smoothing as with magnitude pruning. For each trained hyperparameter setting, we tried pruning frequencies of 2k,model, we evaluated the model with 11logthresholds 4k, and 8k steps and explored training with and without labelin the range[0;5]. For all experiments, we initialized all smoothing. During preliminary experiments, we observedlog2 parameters to the constant value10. that removing weight decay from the model consistently
caused significant decreases in test accuracy. Thus, for allD.3.l0 Regularization Details hyperparameter combinations, we left weight decay on with
For Transformers trained withl the standard coefficient. 0 regularization, we simi-
larly tuned the coefficient for thel0 -norm in the objective For a target sparsity of 98%, we observed that very few hy-function. We observed that much higher magnitude regu- perparameter combinations were able to complete traininglarization coefficients were needed to produce models with without failing due to numerical issues. Out of all the hyper-the same sparsity levels relative to variational dropout. We parameter configurations we tried, only a single model wasfound thatl 100 -norm weights in the range[1 ; ]produced N N able to complete training without erroring from the presencemodels in our target sparsity range. of NaNs. As explained in the main text, at high sparsity
For all experiments, we used the default settings for the levels the first layer of the model has very few non-zero
paramters of the hard-concrete distribution:= 2=3,= parameters, leading to instability during training and low
0:1, and= 1:1. We initialized thelogparameters to test set performance. Pruned ResNet-50 models with the
2:197, corresponding to a 10% dropout rate. first layer left dense did not exhibit these issues.
For each hyperparameter setting, we explored the three reg- E.3. Variational Dropout Detailsularizer coefficient schedules used with variational dropout
and each of the three combinations of dropout and label For variational dropout applied to ResNet-50, we explored
smoothing. the same combinations of start and end points for the kl-
divergence weight ramp up as we did for the start and end
D.4. Random Pruning Details points of magnitude pruning. For all transformer experi-
ments, we did not observe a significant gain from using aWe identified in preliminary experiments that random prun- cubic kl-divergence weight ramp-up schedule and thus onlying typically produces the best results by starting and ending explored the linear ramp-up for ResNet-50. For each combi-pruning early and allowing the model to finish the rest of nation of start and end points for the kl-divergence weight,the training steps with the final sparse weight mask. For our we explored 9 different coefficients for the kl-divergenceexperiments, we explored all hyperparameter combinations loss term: .01 / N, .03 / N, .05 / N, .1 / N, .3 / N, .5 / N, 1 /that we explored with magnitude pruning, and also included N, 10 / N, and 100 / N.start/end pruning step combinations with an end step of less
than 300000. Contrary to our experience with Transformer, we found
ResNet-50 with variational dropout to be highly sensitive
to the initialization for the log2 parameters. With the
standard setting of -10, we couldnt match the baseline accu-
racy, and with an initialization of -20 our models achieved
9 https://bit.ly/2Wd2Lk0 The State of Sparsity in Deep Neural Networks: Appendix
good test performance but no sparsity. After some exper- pruning frequencies of 2k, 4k, and 8k and explored training
imentation, we were able to produce good results with an with and without label smoothing.
initialization of -15.
While with Transformer we saw a reasonable amount of E.6. Scratch-B Learning Rate Variants
variance in test set performance and sparsity with the same For the scratch-b (Liu et al.,2018) experiments with ResNet-
model evaluated at different logthresholds, we did not 50, we explored four different learning rate schemes for the
observe the same phenomenon for ResNet-50. Across a extended training time (2x the default number of epochs).
range of logvalues, we saw consistent accuracy and nearly
identical sparsity levels. For all of the results reported in the The first learning rate scheme we explored was uniformly
main text, we used a logthreshold of 0.5, which we found scaling each of the five learning rate regions to last for
to produce slightly better results than the standard threshold double the number of epochs. This setup produced the best
of 3.0. results by a wide margin. We report these results in the main
text.
E.4.l0 Regularization Details The second learning rate scheme was to keep the standard
learning rate, and maintain the final learning rate for theForl0 regularization, we explored four different initial log extra training steps as is common when fine-tuning deepvalues corresponding to dropout rates of 1%, 5%, 10%, neural networks. The third learning rate scheme was to and 30%. For each dropout rate, we extenively tuned thel0 - maintain the standard learning rate, and continually drop norm weight to produce models in the desired sparsity range. the learning rate by a factor of 0.1 every 30 epochs. The lastAfter identifying the proper range ofl0 -norm coefficients, scheme we explored was to skip the learning rate warm-up,we ran experiments with 20 different coefficients in that and drop the learning rate by 0.1 every 30 epochs. Thisrange. For each combination of these hyperparameters, we learning rate scheme is closest to the one used byLiu et al.tried all four combinations of other regularizers: standard (2018). We found that this scheme underperformed relativeweight decay and label smoothing, only weight decay, only to the scaled learning rate scheme with our training setup.label smoothing, and no regularization. For weight decay,
we used the formulation for the reparameterized weights Results for all learning rate schemes are included with the
provided in the original paper, and followed their approach released hyperparameter tuning data.
of scaling the weight decay coefficient based on the initial
dropout rate to maintain a constant length-scale between the
l0 regularized model and the standard model.
Across all of these experiments, we were unable to produce
ResNet models that achieved a test set performance better
than random guessing. For all experiments, we observed that
training proceeded reasonably normally until thel0 -norm
loss began to drop, at which point the model incurred severe
accuracy loss. We include the results of all hyperparameter
combinations in our data release.
Additionally, we tried a number of tweaks to the learning
process to improve the results to no avail. We explored
training the model for twice the number of epochs, training
with much higher initial dropout rates, modifying the
parameter for the hard-concrete distribution, and a modified
test-time parameter estimator.
E.5. Random Pruning Details
For random pruning on ResNet-50, we shifted the set of
possible start and end points for pruning earlier in training
relative to those we explored for magnitude pruning. At
each of the sparsity levels tried with magnitude pruning,
we tried starting pruning at step 0, 8k, and 20k. For each
potential starting point, we tried ending pruning at steps 40k,
68k, and 76k. For every hyperparameter setting, we tried

Binary file not shown.