From 266e37164213108bdefd9af561450df5cb63596f Mon Sep 17 00:00:00 2001 From: Eduardo Cueto Mendoza Date: Fri, 14 Aug 2020 17:25:50 -0600 Subject: [PATCH] More documents for Corpus --- Corpus/CORPUS.txt | 4936 +++++++++++++++++ Corpus/MOGRIFIER LSTM.txt | Bin 56526 -> 0 bytes ...on for Deep Neural Networks - Yu Cheng.txt | 1145 ---- ...uning Adaptive Sparsity by Fine-Tuning.txt | 662 --- Corpus/Network Pruning notes.txt | 150 - ...h towards Efficient Deep Architectures.txt | Bin 40392 -> 0 bytes Corpus/Optimal Brain - Le Cun.txt | 1985 ------- Corpus/PLUG AND PLAY LANGUAGE MODELS.txt | Bin 195719 -> 0 bytes ... for Natural Language Processing Tasks.txt | Bin 88077 -> 0 bytes ...hout access to training or testing dat.txt | Bin 81616 -> 0 bytes ...y iteratively conserving synaptic flow.txt | Bin 63305 -> 0 bytes 11 files changed, 4936 insertions(+), 3942 deletions(-) delete mode 100644 Corpus/MOGRIFIER LSTM.txt delete mode 100644 Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt delete mode 100644 Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt delete mode 100644 Corpus/Network Pruning notes.txt delete mode 100644 Corpus/Network Trimming_ A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures.txt delete mode 100644 Corpus/Optimal Brain - Le Cun.txt delete mode 100644 Corpus/PLUG AND PLAY LANGUAGE MODELS.txt delete mode 100644 Corpus/Predicting Performance for Natural Language Processing Tasks.txt delete mode 100644 Corpus/Predicting trends in the quality of state-of-the-art neural networkswithout access to training or testing dat.txt delete mode 100644 Corpus/Pruning neural networks without any databy iteratively conserving synaptic flow.txt diff --git a/Corpus/CORPUS.txt b/Corpus/CORPUS.txt index 9e04a6f..851f739 100644 --- a/Corpus/CORPUS.txt +++ b/Corpus/CORPUS.txt @@ -16585,4 +16585,4940 @@ BERGER, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springe HERTZ,J.A.,KROGH,A.,andPALMER, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Red.wood City, CA. MINSKY, M., and PAPERT, S. (1969). Perceptrons. MIT Press, Cambridge, MA. WATKIN, T. L. H., RAU, A., and BIEHL, M. (1993). The statistical mechanics of learning a rule. Rev. Modern Phys. 65, 499. +<> <> <> + + +<> <> <> +Model Compression and Acceleration for Deep Neural Networks The principles, progress, and challenges + +In recent years, deep neural networks (DNNs) have received increased attention, have been applied to different applications, and achieved dramatic accuracy improvements in many tasks. These works rely on deep networks with millions or even billions of parameters, and the availability of graphics process.ing units (GPUs) with very high computation capability plays a key role in their success. For example, Krizhevsky et al. [1] achieved breakthrough results in the 2012 ImageNet Challenge using a network containing 60 million parameters with five convolutional layers and three fully connected layers. Usually, it takes two to three days to train the whole model on the ImagetNet data set with an NVIDIA K40 machine. In another example, the top face-verification results from the Labeled Faces in the Wild (LFW) data set were obtained with networks containing hundreds of millions of parameters, using a mix of convolutional, locally connected, and fully connected layers [2], [3]. It is also very time-consuming to train such a model to obtain a reasonable performance. In architectures that only rely on fully connected layers, the number of parameters can grow to billions [4]. + +Introduction + +As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications such as online learning and incremental learning. In addition, recent years witnessed significant progress in virtual reality, augmented reality, and smart wearable devices, creating unprecedented opportunities for researchers to tackle fundamental challenges in deploying deep-learning systems to portable devices with limited resources [e.g., memory, central processing units (CPUs), energy, bandwidth]. Efficient deep-learning methods can have a significant impact on distributed systems, embedded devices, and field-programmable gate ar.ray (FPGA) for artificial intelligence (AI). For example, the residual network-50 (ResNet-50) [5], which has 50 convolutional layers, needs more than 95 megabytes of memory for storage, and numerous floating number multiplications for + +calculating each image. After discarding some redundant weights, the network still works as usual but saved more than 75% of parameters and 50% computational time. +For devices like cell phones and FPGAs with only several megabyte resources, how to compact the models used on them is also important. +Achieving these goals calls for joint solutions from many disciplines, including but not limited to machine learning, optimization, computer architecture, data compression, indexing, and hardware design. +In this article, we review recent works on compressing and accelerating DNNs, which attracted much attention from the deep-learning community and has already achieved significant progress in past years. +We classify these approaches into four categories: +1) Parameter pruning and sharing: The parameter pruning and sharing-based methods explore the redundancy in the model parameters and try to remove the redundant and noncritical ones. +2) Low-rank factorization: Low-rank factorization-based techniques use matrix/tensor decomposition to estimate the informative parameters of the deep convolutional neural networks (CNNs). +3) Transferred/compact convolutional filters: The trans.ferred/compact convolutional filters-based approaches design special structural convolutional filters to reduce the storage and computation complexity. +4) Knowledge distillation (KD): The KD methods learn a dis.tilled model and train a more compact neural network to reproduce the output of a larger network. In Table 1, we briefly summarize these four types of methods. Generally, the parameter pruning and sharing, low-rank factorization, and KD approaches can be used in DNNs with fully connected layers and convolutional layers, achieving comparable performances. On the other hand, methods using transferred/compact filters are designed for models with convolutional layers only. Low-rank factorization and transferred/compact filters-based + +As larger neural networks +with more layers and +approaches provide an end-to-end pipeline + + +nodes are considered, +and can be easily implemented in a CPU/ + + +reducing their storage +GPU environment, which is straight for and computational ward, while parameter pruning and sharing cost becomes critical, use different methods such as vector quan. +especially for some real- +tization, binary coding, and sparse constraints to perform the task. Usually, it will + + +time applications such +take several steps to achieve the goal. + + +as online learning and +Regarding training protocols, models + + +incremental learning. + +based on parameter pruning/sharing low-rank factorization can be extracted from pretrained ones or trained from scratch, while the transferred/ compact filter and KD models can only support training from scratch. These methods are independently designed and complement each other. For example, transferred layers and parameter pruning and sharing can be used together, and model quantization and binarization can be used together with low-rank approximations to achieve further speedup. We will de.scribe the details of each theme and their properties, strengths, and drawbacks in the following sections. + +Parameter pruning and sharing + +An early work that showed that network pruning is effective in reducing the network complexity and addressed the overfitting problem is [6]. Since then, it has been widely studied to compress DNN models, trying to remove parameters that are not crucial to the model performance. These techniques can be further classified into three categories: model quantization and binarization, parameter sharing, and structural matrix. + +Quantization and binarization + +Network quantization compresses the original network by reducing the number of bits required to represent each weight. Gong et al. [6] and Wu et al. [7] applied k-means scalar quantization to the parameter values. Vanhoucke et al. [8] showed that 8-bit quantization of the parameters can result in significant speedup with minimal loss of accuracy. The work in [9] used + + +Theme Name +Parameter pruning and sharing +Low-rank factorization +Transferred/compact convolutional filters +KD +Description +Reducing redundant parameters that are not sensitive to the performance +Using matrix/tensor decomposition to estimate the informative parameters + +Designing special structural convolutional filters to save parameters +Training a compact neural network with distilled knowledge of a large model +Applications +Convolutional layer and fully connected layer +Convolutional layer and fully connected layer +Only for convolutional layer +Convolutional layer and fully connected layer + +More Details + +Robust to various settings, can achieve good performance, can support both train.ing from scratch and pretrained model +Standardized pipeline, easily implement.ed, can support both training from scratch and pretrained model +Algorithms are dependent on applications, usually achieve good performance, only support training from scratch +Model performances are sensitive to applications and network structure, only support training from scratch +16-bit fixed-point representation in stochastic rounding-based CNN training, which significantly reduced memory usage and float- point operations with little loss in classification accuracy. +The method proposed in [10] first pruned the unimportant connections and retrained the sparsely connected networks. Then it quantized the link weights using weight-sharing, and then applied +Huffman coding to the quantized weights as well as the codebook to further reduce the rate. As shown in Figure 1, it starts by learn.ing the connectivity via normal network train.ing, followed by pruning the small-weight connections. Finally, the network is retrained to learn the final weights for the remaining sparse connections. This work achieves the state-of-the-art performance among all parameter quantization-based methods. It was shown in [11] that Hessian weight could be used to measure the importance of network parameters and proposed to minimize Hessian-weighted quantization errors on average for clustering network parameters. A novel quantization framework was introduced in [12], which reduced the precision of network weights to ternary values. +In the extreme case of 1-bit representation of each weight, i.e., binary weight neural networks, there are also many works that directly train CNNs with binary weights; for instance, Binary-Connect [13], BinaryNet [14], and XNORNetworks [15]. The main idea is to directly learn binary weights or activations dur.ing the model training. The systematic study in [16] showed that networks trained with backpropagation could be robust against (robust against or resilient to) specific weight distortions, includ.ing binary weights. + +Drawbacks + +However, the accuracy of such binary nets is significantly low.ered when dealing with large CNNs such as GoogleNet. Another drawback of these binary nets is that existing binarization schemes are based on simple matrix approximations and ignore the effect of binarization on the accuracy loss. To address this issue, the work in [17] proposed a proximal Newton algorithm with diagonal Hessian approximation that directly mini.mizes the loss with respect to the binary weights. The work in +[18] significantly reduced the time on float-point multiplication in the training stage by stochastically binarizing weights and con-sharing has been used converting multiplications in the hidden state both to reduce network computation to sign changes complexity and to address the overfitting issue. + +Pruning and sharing + +Network pruning and sharing has been used both to reduce network complexity and to address the overfitting issue. An early approach to pruning was biased weight decay [19]. The optimal brain damage [20] and the optimal brain surgeon [21] methods reduced the number of connections based on the Hessian of the loss function, and their works suggested that such pruning gave higher accuracy than magnitude-based pruning such as the weight decay meth.od. Those methods supported training from scratch. +A recent trend in this direction is to prune redundant, non-informative weights in a pretrained CNN model. For example, Srinivas and Babu [22] explored the redundancy among neurons and proposed a data-free pruning method to remove redundant neurons. Han et al. [23] proposed to reduce the total number of parameters and operations in the entire network. Chen et al. [24] proposed a HashedNets model that used a low-cost hash function to group weights into hash buckets for parameter sharing. The deep compression method in [10] removed the redundant connections and quantized the weights and then used Huffman coding to encode the quantized weights. In [25], a simple regularization method based on soft weight-sharing was proposed, which +included both quantization and pruning in one simple (re)train.ing procedure. It is worth noting that the aforementioned pruning schemes typically produce connection pruning in CNNs. +There is also growing interest in training compact CNNs with sparsity constraints. Those sparsity constraints are + +<
> + +FIGURE 1. The three-stage compression method proposed in [10]: pruning, quantization, and encoding. The input is the original model, and the output is the compression model. + +nels, or even layers. In filter-level pruning, all of the afore.mentioned works used l_2-norm regularizers. The work in [29] used l1-norm to select and prune unimportant filters. + +Drawbacks + +There are some potential issues of the pruning and sharing works. First, pruning with l1 or l2 regularization requires more iterations to converge. Furthermore, all pruning criteria require manual setup of sensitivity for layers, which demands fine-tuning of the parameters and could be cumbersome for some applications. +Designing the structural matrix +In architectures that contain only fully connected layers, the number of parameters can grow up to billions [4]. Thus, it is critical to explore this redundancy of parameters in fully connected layers, which is often the bottleneck in terms of memory consumption. These network layers use the nonlinear transforms +<>, where v () is an element-wise nonlinear operator, x is the input vector, and M is the mn matrix of <> parameters. When M is a large general dense matrix, the cost of storing mn parameters and computing matrix-vector products in Om( n) time. Thus, an intuitive way to prune parameters is to impose x as a parameterized structural matrix. An mn matrix +<> that can be described using much fewer parameters than mn is called a structured matrix. Typically, the structure should not only reduce the memory cost but also dramatically accelerate the inference and training stage via fast matrix-vector multiplication and gradient computations. +Following this direction, the work in [30] proposed a sim.ple and efficient approach based on circulant projections, while maintaining competitive error rates. Given a vector +<>, a circulant matrix <> is defined as Thus the memory cost becomes <> instead of <>. +<> This circulant structure also enables the use of fast Fourier transform (FFT) to speed up the computation. Given a d-dimensional vector r, the 1-layer circulant neural network in (1) has time complexity of <>. +In [31], a novel adaptive fastfood transform was introduced to reparameterize the matrix-vector multiplication of fully connected layers. The adaptive fastfood translation invariant property form matrix <> was defined as of the representations to input image, which is the key <>. (2) to the success of training +due to exploring the very deep models without +Here, <> are random diago-SG and severe overfitting. + + + +nal matrices. <> is a random permutation matrix and H denotes the Walsh-Hadamard matrix. Reparameterizing a fully connected layer with d inputs and n outputs using the adaptive fastfood transform reduces the storage and the computational costs from <> to <> and from <> to +<> And <>, respectively. +The work in [32] showed the effectiveness of the new notion of parsimony in the theory of structured matrices. Their pro.posed method can be extended to various other structured matrix classes, including block and multilevel Toeplitz-like [33] matrices related to multidimensional convolution [34]. + +Drawbacks + +One potential problem of this kind of approach is that the structural constraint will cause loss in accuracy since the constraint might bring bias to the model. On the other hand, how to find a proper structural matrix is difficult. There is no theoretical way from which to derive it. +Low-rank factorization and sparsity +As convolution operations constitute the bulk of all computations in CNNs, simplifying the convolution layer would have a direct impact on the overall speedup. The convolution kernels in a typical CNN is a four-dimensional tensor. The key observation is that there might be a significant amount of redundancy in the tensor. Ideas based on tensor decomposition seem to be a particularly promising way to remove the redundancy. Regarding to the fully connected layer, it can be viewed as a two-dimensional (2-D) matrix and the low-rankness can also help. +Using low-rank filters to accelerate convolution has a long history. Typical examples include high-dimensional discrete cosine transform (DCT) and wavelet systems constructed from one-dimensional (1-D) DCT transform and 1-D wave.lets, respectively, using tensor products. In the context of dictionary learning, Rigamonti et al. [35] suggested learning separable 1-D filters. In [36], a few low-rank approximation + +<> + +and clustering schemes for the convolutional kernels were +proposed. They achieved 2# speedup for a single convolutional layer with 1% drop in classification accuracy. The work in [37] suggested using different tensor decomposition +schemes, reporting a 45. # speedup with 1% drop in accuracy + +<> + +case. For the scheme in [39], the decomposition always exists and can achieve better performance than general CP. Table 2 lists a performance comparison of both methods. The actual speedup and compression rates are used to mea.sure the performances. We can see that the BN version can achieve slightly bet.ter performance while the CP version +gives higher compression rates. + +Original Framework Low-Rank + +Note that the fully connected layers + +<
> + +FIGURE 2. A typical framework of the low-rank regularization method. (a) is the original convolutional +layer, and (b) is the low-rank constraint convolutional layer with rank-K. +in text recognition. In both works, the approximation was done layer by layer. After one layer was approximated by the low-rank filters, the parameters of that layer were fixed, and the layers above were fine-tuned based on a reconstruction error criterion. These are typical low-rank methods for compressing 2-D convolutional layers, which is described in Figure 2. In [38], canonical polyadic (CP) decomposition of the kernel tensors was proposed. Their work used nonlinear least squares to compute the CP decomposition, which was also based on the tensor decomposition idea. In [39], a new algorithm for computing the low-rank tensor decomposition and a new method for training low-rank constrained CNNs from scratch were proposed. It used batch normalization (BN) to transform the activations of the internal hidden units, and it was shown to be an effective way to deal with the exploding or vanishing gradients. +In principle, both the CP decomposition scheme and the decomposition scheme in [39] (BN low-rank) can be used to train CNNs from scratch. For the CP decomposition, finding the best low-rank approximation is an ill-posed problem, and the best rank-K approximation may not exist in the general in fully connected layers. For instance, + +Misha et al. [40] reduced the number of dynamic parameters in deep models using the low-rank method. Reference [41] explored a low-rank matrix factorization of the final weight layer in a DNN for acoustic modeling. +Drawbacks +Low-rank approaches are straightforward for model compression and acceleration. The idea complements recent advances in deep learning such as dropout, rectified units, and maxout. However, the implementation is not that easy since it involves a decomposition operation, which is computationally expensive. Another issue is that current methods perform low-rank approximation layer by layer, and thus cannot perform global parameter compression, which is important as different layers hold different information. Finally, factorization requires extensive model retraining to achieve convergence when com.pared to the original model. +Transferred/compact convolutional filters +CNNs are parameter-efficient due to exploring the translation invariant property of the representations to input image, which is the key to the success of training very deep models without severe overfitting. Although a strong theory is currently missing, a large amount of empirical evidence sup.ports the notion that both the translation invariant property and convolutional weight-sharing are important for good predictive performance. The idea of using transferred convolutional filters to compress CNN models is motivated by recent works in [42], which introduced the equivariant group theory. Let x be an input, <> be a network or layer, and <> be the transform matrix. The concept of equivariance is defined as + +<>, (3) + +which says that transforming the input x by the transform <> and then passing it through the network or layer <> should give the same result as first mapping x through the network and then transforming the representation. Note that, + +Model + +AlexNet BN low-rank CP low-rank VGG-16 BN low-rank CP low-rank GoogleNet BN low-rank CP low-rank + +<> + +in [42], the transforms <> and Tl ()$ are not necessarily the same as they operate on different objects. According to this theory, it is reasonable to apply the transform to layers or filters <> to compress the whole network models. From empirical observation, deep CNNs also benefit from using a large set of convolutional filters by applying a certain transform <> to a small set of base filters since it acts as a regularizer for the model. +Following this trend, there are many recent works proposed to build a convolutional layer from a set of base filters [42] [45]. What they have in common is that the transform <> lies in the family of functions that only operate in the spatial +domain of the convolutional filters. For example, the work in [44] found that the lower convolution layers of CNNs learned redundant filters to extract both positive and negative phase information of an input signal, and defined <> to be the simple negation function +- +<>. (4) +Here, Wx is the basis convolutional filter +- +and Wx is the filter consisting of the shifts whose activation is opposite to that of Wx and selected after max-pooling operation. By doing this, the work in [44] can easily achieve 2# compression rate on all the convolutional layers. It is also shown that the negation transform acts as a strong regularizer to improve the classification accuracy. The intuition is that the learning algorithm with pair-wise positive-negative constraint can lead to useful convolutional filters instead of redundant ones. +In [45], it was observed that magnitudes of the responses from convolutional kernels had a wide diversity of pattern representations in the network, and it was not proper to discard weaker signals with a single threshold. Thus, a multibias nonlinearity activation function was proposed to generate more patterns in the feature space at low computational cost. The transform <> was define as + +<>, (5) + +where d were the multibias factors. The work in [46] considered a combination of rotation by a multiple of 90% and horizontal/vertical flipping with + +<>, (6) + +where WTi was the transformation matrix that rotated the original filters with angle i ! {90,180,270}. In [42], the transform was generalized to any angle learned from data, and i was directly obtained from data. Both [46] and [42] can achieve good classification performance. +Reference [43] defined <> as the set of translation functions applied to 2-D filters + +<>, (7) + +The basic idea of KD is to distill knowledge from a Drawbacks large teacher model into There are several issues that need to be +a small one by learning addressed for approaches that apply transfer information to convolutional filters. First, the class distributions these methods can achieve competitive performance for wide/flat architectures (like + +output by the teacher via softened softmax. + +where <> denoted the translation of the first operand by <> +xy along its spatial dimensions, with proper zero padding at borders to maintain the shape. The proposed framework can be used to 1) improve the classification accuracy as a regularized version of maxout networks and 2) to achieve parameter efficiency by flexibly varying their architectures to compress networks. +Table 3 briefly compares the performance of different methods with transferred convolutional filters, using VGG-Net (16 layers) as the baseline model. The results are report.ed on the CIFAR-10 and CIFAR-100 data sets with top-five error rates. It is observed that they can achieve reduction in +parameters with little or no drop in classification accuracy. +VGGNet) but not narrow/special ones (like +GoogleNet and ResNet). Second, the trans.fer assumptions sometimes are too strong to guide the algorithm, making the results unstable on some data sets. +Using a compact filter for convolution can directly reduce the computation cost. The key idea is to replace the loose and overparametric filters with compact blocks to improve the speed, which significantly accelerate CNNs on several benchmarks. Decomposing 33 convolution into two 1x1 convolutions was used in [47], which achieved state-of-the-art acceleration performance on object recognition. SqueezeNet +[48] was proposed to replace 33# convolution with 1x1 convolution, which created a compact neural network with approximately 50 fewer parameters and comparable accuracy when compared to AlexNet. +KD +To the best of our knowledge, exploiting knowledge transfer to compress model was first proposed by Caruana et al. [49]. They trained a compressed model with pseudo-data labeled by an ensemble of strong classifiers and reproduced the output of the original larger network. However, their work is limited to shal.low models. The idea has been recently adopted in [50] as KD to compress deep and wide networks into shallower ones, where + +<
> + +the compressed model mimicked the function learned by the complex model. The basic idea of KD is to distill knowledge from a large teacher model into a small one by learning the class distributions output by the teacher via softened softmax. +The work in [51] introduced a KD compression framework, which eased the training of deep networks by following a student-teacher paradigm, in which the student was penalized according to a softened version of the teacher's output. The framework compressed an ensemble of deep networks (teacher) into a stu.dent network of similar depth. To do so, the student was trained to predict the output of the teacher, as well as the true classifica.tion labels. Despite its simplicity, KD demonstrates promising results in various image classification tasks. The work in [52] +aimed to address the network compression problem by taking advantage of depth neural networks. It proposed an approach to train thin and deep networks, called FitNets, to compress wide and shallower (but still deep) networks. The method was rooted in KD and extended the idea to allow for thinner and deeper student models. To learn from the intermediate representations of the teacher +network, FitNet made the student mimic the full feature maps of the teacher. However, such assumptions are too strict since the capacities of teacher and student may differ greatly. In certain circumstances, FitNet may adversely affect the performance and convergence. All the aforementioned methods are validated on the MNIST, CIFAR-10, CIFAR-100, SVHN, and AFLW benchmark data sets, and simulation results show that these methods match or outperform the teacher's performance, while requiring notably fewer parameters and multiplications. +There are several extensions along this direction of distillation knowledge. The work in [53] trained a parametric student model to approximate a Monte Carlo teacher. The proposed framework used online training and used DNNs for the student model. Different from previous works, which represented the knowledge using the softened label probabilities, [54] represented the knowledge by using the neurons in the higher hidden layer, which preserved as much information as the label probabilities, but are more compact. The work in [55] accelerated the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network. The techniques are based on the concept of function-preserving transformations between neural net.work specifications. Zagoruyko et al. [56] proposed attention transfer to relax the assumption of FitNet. They transferred the attention maps that are summaries of the full activations. + +Drawbacks + +KD-based approaches can make deeper models thinner and help significantly reduce the computational cost. However, there are a few disadvantages. One of them is that KD can only be applied to classification tasks with softmax loss function, which hinders its usage. Another drawback is that the model assumptions sometimes are too strict to make the performance competitive with other types of approaches. + +Other types of approaches + +We first summarize the works utilizing attention-based methods. Note that attention-based systems [57] can reduce computations significantly by learning to selectively focus or �attend to� a few, task-relevant input regions. The work in [57] introduced the dynamic capacity network that combined two types of modules: the small subnetworks with low capacity, and the large ones with high capacity. The low-capacity subnetworks were active on the whole input to first find the task-relevant areas in the input, and then the attention mechanism was used to direct the high-capacity subnetworks to focus on the task-relevant regions in the input. By doing this, the size of the CNN model could be significantly reduced. + +Following this direction, the work in to measure the quality some important neurons. It proposed a new + + + + + +and acceleration are the + +The standard criteria [58] introduced the conditional computation idea, which only computes the gradient for of model compression type of general-purpose neural network component: a sparsely gated mixture-of-experts compression and the (MoE) layer. The MoE consisted of a number speedup rates. of experts, each a simple feed-forward neural +network, and a trainable gating network that selected a sparse combination of the experts to process each input. In [59], dynamic DNNs (D2NNs) were introduced, which were a type of feed-forward DNN that selected and executed a subset of D2NN neurons based on the input. +There have been other attempts to reduce the number of parameters of neural networks by replacing the fully connected layer with global average pooling[43], [60]. Network architectures, such as GoogleNet or network in network, can achieve state-of-the-art results on several benchmarks by adopting this idea. However, transfer learning, i.e., reusing features learned on the ImageNet data set and applying them to new tasks, is more difficult with this approach. This problem was noted by Szegedy et al. [60] and motivated them to add a linear layer on top of their networks to enable transfer learning. +The work in [61] targeted the ResNet-based model with a spatially varying computation time, called stochastic depth, which enabled the seemingly contradictory setup to train short networks and used deep networks at test time. It started with very deep networks and, while during training, for each mini-batch, randomly dropped a subset of layers and bypassed them with the identity function. This model is end-to-end trainable, deterministic, and can be viewed as a black-box feature extractor. Following this direction, the work in [62] proposed a pyramidal residual network with stochastic depth. +Other approaches to reduce the convolutional overheads include using FFT-based convolutions [63] and fast convolution using the Winograd algorithm [64]. Those works only aim to speedup the computation but not reduce the memory storage. + +Benchmarks, evaluation, and databases + +In the past five years, the deep-learning community has made great efforts in benchmark models. One of the most well-known models used in compression and acceleration for CNNs is Alexnet [1], which occasionally has been used for assessing the performance of compression. Other popular standard models include LeNets [65], All-CNN-nets [66], and many others. LeNet-300-100 is a fully connected network with two hidden layers, with 300 and 100 neurons each. LeNet-5 is + + +Proposing some general/ unified approaches is one direction that can be taken regarding the use of CNNs in small platforms. +about how to choose different compression approaches and possible challenges/solutions in this area. + +General suggestions + +There is no golden rule to measure which one of the four kinds of approaches is the best. How +a convolutional network that has two convolutional layers and two fully connected layers. Recently, more state-of-the-art architectures are used as baseline models in many works, including network in networks [67], VGGNets [68], and ResNets [69]. Table 4 summarizes the baseline mod.els commonly used in several typical compression methods. +The standard criteria to measure the quality of model compression and acceleration are the compression and the speedup rates. Assume that a is the number of the parameters in the original model M and a* is that of the compressed model M*, then the compression rate a (,MM*) of M* over Mis + +<> (8) + +Another widely used measurement is the index space saving defined in several papers [70], [71] as + +<>, (9) + +where a and a are the number of the dimension of the index space in the original model and that of the compressed model, respectively. +Similarly, given the running time s of M and s* of M* , the speedup rate d (,MM*) is defined as + +<> (10) + +Most work used the average training time per epoch to mea.sure the running time, while in [70] and [71], the average testing time was used. Generally, the compression rate and speedup rate are highly correlated, as smaller models often results in faster computation for both the training and the testing stages. +Good compression methods are expected to achieve almost the same performance as the original model with much smaller parameters and less computational time. However, for differ.ent applications with varying CNN designs, the correlation between parameter size and computational time may be different. For example, it is observed that, for deep CNNs with fully connected layers, most of the parameters are in the fully connected layers; while for image classification tasks, float-point operations are mainly in the first few convolutional layers since each filter is convolved with the whole image, which is usually very large at the beginning. Different applications should focus on different layers. + +Discussion and challenges + +In this article, we summarized recent works on compress.ing and accelerating DNNs. Here we discuss more details to choose the proper approaches is really de.pendent on the applications and requirements. Here, we provide some general suggestions. +If the applications needs compacted models from pretrained models, one can choose either pruning and sharing or low-rank factorization-based methods. If end-to-end solutions are needed for the problem, the low-rank and transferred convolutional filters approaches are preferred. +For applications in some specific domains, methods with human prior (like the transferred convolutional filters and structural matrix) sometimes have benefits. For example, when conducting medical images classification, transferred convolutional filters should work well as medical images (like organs) do have the rotation transformation property. +Usually, the approaches of pruning and sharing could give a reasonable compression rate while not hurting the accuracy. Thus, for applications that require stable model accuracy, it is better to utilize pruning and sharing. +If a problem involves small- or medium-size data sets, one can try the KD approaches. The compressed student model can take the benefit of transferring knowledge from the teacher model, making it a robust data set that is not large. +As we mentioned in the �Introduction,� techniques of the four themes are orthogonal. It makes sense to combine two or three of them to maximize the compression/speedup rates. For some specific applications, like object detection, which requires both convolutional and fully connected layers, one can compress the convolutional layers with low-rank factorization and the fully connected layers with a pruning method. + +<> + +Technique challenges + +Techniques for deep model compression and acceleration are still in the early stages, and the following challenges still need to be addressed. +Most of the current state-of-the-art approaches are built on well-designed CNN models, which have limited freedom to change the configuration (e.g., network structural, hyperparameters). To handle more complicated tasks, it should provide more plausible ways to configure the compressed models. + +Good compression methods are expected to achieve almost the same performance as the original model with much smaller parameters and less computational time. +approaches. Instead of directly reducing and transferring parameters from the teach.er models, passing selectivity knowledge of neurons could be helpful. One can derive a way to select essential neurons related to the task. The intuition is that, if a neuron is activated in certain regions or samples, this implies these regions or samples share +Pruning is an effective way to compress and accelerate CNNs. Current pruning techniques are mostly designed to eliminate connections between neurons. On the other hand, a pruning channel can directly reduce the feature map width and shrink the model into a thinner one. It is efficient but also challenging because removing channels might dramatically change the input of the following layer. It is important to focus on how to address this issue. +As we mentioned previously, methods of structural matrix and transferred convolutional filters impose prior human knowledge to the model, which could significantly affect the performance and stability. It is critical to investigate how to control the impact of the imposed prior knowledge. +The methods of KD provide many benefits such as directly accelerating the model without special hardware or implementations. It is still worth it to develop KD-based approaches and explore how to improve the performance. +Hardware constraints in various of small platforms (e.g., mobile, robotic, self-driving cars) are still a major problem that hinder the extension of deep CNNs. How to make full use of the limited computational source available and how to design special compression methods for such platforms are still challenges that need to be addressed. + +Possible solutions + +To solve the hyperparameters configuration problem, we can rely on the recent learning-to-learn strategy [72], [73]. This framework provides a mechanism, allowing the algorithm to automatically learn how to exploit structure in the problem of interest. There are two different ways to combine the learning.to-learn module with the model compression. The first designs compression and learning-to-learn simultaneously, while the second way first configures the model with learn-to-learning and then prunes the parameters. +Channel pruning provides the efficiency benefit on both CPUs and GPUs because no special implementation is required. But it is also challenging to handle the input con.figuration. One possible solution is to use the training-based channel pruning methods [74], which focus on imposing sparse constraints on weights during training, and could adaptively determine hyperparameters. However, training from scratch for such a method is costly for very deep CNNs. +Exploring new types of knowledge in the teacher models and transferring it to the student models is useful for the KD +some common properties that may relate to the task. Performing such steps is time-consuming, thus efficient implementation is important. +For methods with convolutional filters and the structural matrix, we can conclude that the transformation lies in the family of functions that only operations on the spatial dimen.sions. Hence, to address the imposed prior issue, one solution is to provide a generalization of the aforementioned approach.es in two aspects: 1) instead of limiting the transformation to belong to a set of predefined transformations, let it be the whole family of spatial transformations applied to 2-D filters or the matrix, and 2) learn the transformation jointly with all of the model parameters. +Proposing some general/unified approaches is one direction that can be taken regarding the use of CNNs in small platforms. Yuhen et al. [75] presented a feature map dimensionality reduc.tion method by excavating and removing redundancy in feature maps generated by different filters, which could also preserve intrinsic information of the original network. The idea can be extended to make CNNs more applicable for different platforms. The work in [76] proposed a one-shot whole network compression scheme consisting of three components: rank selection, low-rank tensor decomposition, and fine-tuning to make deep CNNs work in mobile devices. From the systematic side, Facebook released the platform Caffe2 [77], which employed a particularly lightweight and modular framework and included mobile-specif.ic optimizations based on the hardware design. Caffe2 can help developers and researchers train large machine-learning models and deliver AI on mobile devices. + +Acknowledgments + +We would like to thank the reviewers and broader community for their feedback on this survey. In particular, we would like to thank Hong Zhao from the Department of Automation of Tsinghua University for her help on modifying this article. This research is supported by National Science Foundation of China, grant number 61401169. The corresponding author of this article is Pan Zhou. + +Authors + +Yu Cheng (chengyu@us.ibm.com) received his bachelor�s degree in automation from Tsinghua University, Beijing, China, in 2010 and his Ph.D. degree in computer science from Northwestern University, Evanston, Illinois in 2015. Currently, he is a research staff member at AI Foundations Lab, IBM T.J. Watson Research Center, Yorktown Heights, New York. His research is focused on deep learning in general, with specific interests in deep generative models and deep models compression. He also has published many works regarding the applications of deep learning in computer vision and natural language processing. +Duo Wang (d-wang15@mails.tsinghua.edu.cn) received the +B.S. degree in automation from the Harbin Institute of Technology, China, in 2015, where he is currently pursuing his Ph.D. degree in the Department of Automation, Tsinghua University. His research interests are deep/machine learning and their applications in computer vision and robotics vision. +Pan Zhou (panzhou@hust.edu.cn) received his B.S. degree in the Advanced Class of Huazhong University of Science and Technology (HUST), Wuhan China, and his M.S. degree in electronics and information engineering from the same university in 2006 and 2008, respectively. He received his Ph.D. degree from the School of Electrical and Computer Engineering at the Georgia Institute of Technology, Atlanta in 2011. Currently, he is an associate professor with School of Electronic Information and Communications, HUST. His research interests include big data analytics and machine learning, security and privacy, and information networks. +Tao Zhang (taozhang@mail.tsinghua.edu.cn) received his B.S., M.S., and Ph.D. degrees from Tsinghua University, Beijing, China, in 1993, 1995, and 1999, respectively, and his Ph.D. degree from Saga University, Japan, in 2002, all in control engineering. He is a professor with the Department of Automation, Tsinghua University. His current research interests include artificial intelligence, robotics, image processing, control theory, and control of spacecraft. + +References + +[1] A. Krizhevsky, I. Sutskever, and G. Hinton, �Imagenet classification with deep convolutional neural networks,� in Proc. Conf. Neural Information Processing Systems, 2012, pp. 1097�1105. +[2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, �Deepface: Closing the gap to human-level performance in face verification,� in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2014, pp. 1701�1708. +[3] Y. Sun, X. Wang, and X. Tang, �Deeply learned face representations are sparse, selective, and robust,� in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2015, pp. pp. 2892�2900. +[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, �Large scale distributed deep networks,� in Proc. Conf. Neural Information Processing Systems, 2012, pp. 1223�1231. +[5] K. He, X. Zhang, S. Ren, and J. Sun, �Deep residual learning for image recogni.tion,� Computing Res. Repository, vol. abs/1512.03385, 2015. [Online]. Available: https://arxiv.org/pdf/1512.03385.pdf +[6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, �Compressing deep convolutional networks using vector quantization,� Computing Res. Repository, vol. abs/1412.6115, 2014. [Online]. Available: https://arxiv.org/pdf/1412.6115.pdf +[7] Y. W. Q. H. Jiaxiang Wu, C. Leng, and J. Cheng, �Quantized convolutional neu.ral networks for mobile devices,� in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 4820�4828. +[8] V. Vanhoucke, A. Senior, and M. Z. Mao, �Improving the speed of neural net.works on cpus,� in Proc. Conf. Neural Information Processing Systems Deep Learning and Unsupervised Feature Learning Workshop, 2011. +[9] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, �Deep learning with limited numerical precision,� in Proc. 32nd Int. Conf. Machine Learning, 2015, vol. 37, pp. 1737�1746. +[10] S. Han, H. Mao, and W. J. Dally, �Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,� in Proc. Int. Conf. Learning Representations, 2016. +[11] Y. Choi, M. El-Khamy, and J. Lee, �Towards the limit of network quantization,� Computing Res. Repository, vol. abs/1612.01543, 2016. [Online]. Available: https://arxiv.org/abs/1612.01543 +[12] C. Zhu, S. Han, H. Mao, and W. J. Dally, �Trained ternary quantization,� arXiv Preprint, arXiv:1612.01064, 2016. +[13] M. Courbariaux, Y. Bengio, and J. David, �Binaryconnect: Training deep neu.ral networks with binary weights during propagations,� in Proc. Advances Neural Information Processing Systems Annu. Conf., 2015, pp. 3123�3131. +[14] M. Courbariaux and Y. Bengio, �Binarynet: Training deep neural networks with weights and activations constrained to +1 or .1,� Computing Res. Repository, vol. abs/1602.02830, 2016. [Online]. Available: https://arxiv.org/abs/1602.02830 +[15] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, �Xnor-net: Imagenet classification using binary convolutional neural networks,� in Proc. European Conf. Computer Vision, 2016, pp. 525�542. +[16] P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, �Deep neural networks are robust to weight binarization and other non-linear distortions,� Computing Res. Repository, vol. abs/1606.01981, 2016. [Online]. Available: https:// arxiv.org/abs/1606.01981 +[17] L. Hou, Q. Yao, and J. T. Kwok, �Loss-aware binarization of deep networks,� Computing Res. Repository, vol. abs/1611.01600, 2016. [Online]. Available: https:// arxiv.org/abs/1611.01600 +[18] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, �Neural networks with few multiplications,� Computing Res. Repository, vol. abs/1510.03009, 2015. [Online]. Available: https://arxiv.org/abs/1510.03009 +[19] S. J. Hanson and L. Y. Pratt, �Comparing biases for minimal network con.struction with back-propagation,� Adv. Neural Inform. Process. Syst. 1, 1989, pp. 177�185. +[20] Y. L. Cun, J. S. Denker, and S. A. Solla, �Advances in neural information pro.cessing systems 2,� in Optimal Brain Damage, D. S. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1990, pp. 598�605. +[21] B. Hassibi, D. G. Stork, and S. C. R. Com, �Second order derivatives for network pruning: Optimal brain surgeon,� in Advances in Neural Information Processing Systems, vol. 5. San Mateo, CA: Morgan Kaufmann, 1993, pp. 164� 171. +[22] S. Srinivas and R. V. Babu, �Data-free parameter pruning for deep neural net.works,� in Proc. British Machine Vision Conf., 2015, pp. 31.1�31.12. +[23] S. Han, J. Pool, J. Tran, and W. J. Dally, �Learning both weights and connections for efficient neural networks,� in Proc. 28th Int. Conf. Neural Information Processing Systems, 2015, pp. 1135�1143. +[24] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, �Compressing neural networks with the hashing trick,� in Proc. Machine Learning Research Workshop Conf., 2015, pp. 2285�2294. +[25] K. Ullrich, E. Meeds, and M. Welling, �Soft weight-sharing for neural network compression,� Computing Res. Repository, vol. abs/1702.04008, 2017. [Online]. Available: https://arxiv.org/abs/1702.04008 +[26] V. Lebedev and V. S. Lempitsky, �Fast convnets using group-wise brain dam.age,� in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 2554� 2564. +[27] H. Zhou, J. M. Alvarez, and F. Porikli, �Less is more: Towards compact CNNs,� in Proc. European Conf. Computer Vision, 2016, pp. 662�677. +[28] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, �Learning structured sparsity in deep neural networks,� Adv. Neural Inform. Process. Syst., vol. 29, pp. 2074�2082, 2016. +[29] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, �Pruning filters for efficient convnets,� Computing Res. Repository, vol. abs/1608.08710, 2016. [Online]. Available: https://arxiv.org/abs/1608.08710 +[30] Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. Chang, �An exploration of parameter redundancy in deep networks with circulant projections,� in Proc. Int. Conf. Computer Vision, 2015, pp. 2857�2865. +[31] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang, �Deep fried convnets,� in Proc. Int. Conf. Computer Vision, 2015, pp. 1476� 1483. +[32] V. Sindhwani, T. Sainath, and S. Kumar. (2015). Structured transforms for small-footprint deep learning. Advances in Neural Information Processing Systems, 28, pp. 3088�3096. [Online]. Available: http://papers.nips.cc/paper/5869.structured-transforms-for-small-footprint-deep-learning.pdf +[33] J. Chun and T. Kailath, Generalized Displacement Structure for Block-Toeplitz, Toeplitz-Block, and Toeplitz-Derived Matrices. Berlin, Germany: Springer, 1991, pp. 215�236. +[34] M. V. Rakhuba and I. V. Oseledets. (2015). Fast multidimensional convolution in low-rank tensor formats via cross approximation. SIAM J. Sci. Comput., 37(2). [Online]. Available: http://dx.doi.org/10.1137/140958529 +[35] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, �Learning separable filters,� in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2013, pp. 2754� 2761. +[36] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, �Exploiting lin.ear structure within convolutional networks for efficient evaluation,� Adv. Neural Inform. Process. Syst. vol. 27, pp. 1269�1277, 2014. +[37] M. Jaderberg, A. Vedaldi, and A. Zisserman, �Speeding up convolutional neu.ral networks with low rank expansions,� in Proc. British Machine Vision Conf., 2014, pp. 1�13. +[38] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky, �Speeding-up convolutional neural networks using fine-tuned CP-decomposition,� Computing Res. Repository, vol. abs/1412.6553, 2014. [Online]. Available: https:// arxiv.org/abs/1412.6553 +[39] C. Tai, T. Xiao, X. Wang, and E. Weinan, �Convolutional neural networks with low-rank regularization,� Computing Res. Repository, vol. abs/1511.06067, 2015. +[40] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas. (2013). Predicting parameters in deep learning. Advances in Neural Information Processing Systems, 26, 2148�2156. [Online]. Available: http://media.nips.cc/nips.books/nipspapers/paper_files/nips26/1053.pdf +[41] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, �Low-rank matrix factorization for deep neural network training with high-dimen.sional output targets,� in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, 2013, pp. 6655�6659. +[42] T. S. Cohen and M. Welling, �Group equivariant convolutional networks,� arXiv Preprint, arXiv:1602.07576, 2016. +[43] S. Zhai, Y. Cheng, and Z. M. Zhang, �Doubly convolutional neural networks,� in Proc. Advances Neural Information Processing Systems, 2016, pp. 1082�1090. +[44] W. Shang, K. Sohn, D. Almeida, and H. Lee, �Understanding and improving convolutional neural networks via concatenated rectified linear units,� arXiv Preprint, arXiv:1603.05201, 2016. +[45] H. Li, W. Ouyang, and X. Wang, �Multi-bias non-linear activation in deep neural networks,� arXiv Preprint, arXiv:1604.00676, 2016. +[46] S. Dieleman, J. D Fauw, and K. Kavukcuoglu, �Exploiting cyclic symmetry in convolutional neural networks,� in Proc. 33rd Int. Conf. Machine Learning, 2016, vol. 48, pp. 1889�1898. +[47] C. Szegedy, S. Ioffe, and V. Vanhoucke. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning, Computing Res. Repository, vol. abs/1602.07261. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1602. html#SzegedyIV16 +[48] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, �Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autono.mous driving,� Computing Res. Repository, vol. abs/1612.01051, 2016. [Online]. Available: https://arxiv.org/abs/1612.01051 +[49] C. Bucilua�, R. Caruana, and A. Niculescu-Mizil. (2006). Model compression. Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 535� 541. [Online]. Available: http://doi.acm.org/10.1145/1150402.1150464 +[50] J. Ba and R. Caruana, �Do deep nets really need to be deep?� Adv. Neural Inform. Process. Syst., vol. 27, pp. 2654�2662, 2014. +[51] G. E. Hinton, O. Vinyals, and J. Dean, �Distilling the knowledge in a neural net.work,� Computing Res. Repository, vol. abs/1503.02531, 2015. [Online]. Available: https://arxiv.org/abs/1503.02531 +[52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, �Fitnets: Hints for thin deep nets,� Computing Res. Repository, vol. abs/1412.6550, 2014. [Online]. Available: https://arxiv.org/abs/1412.6550 +[53] A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling. (2015). Bayesian dark knowledge. Advances in Neural Information Processing Systems, 28, 3420�3428. [Online]. Available: http://papers.nips.cc/paper/5965-bayesian-dark-knowledge.pdf +[54] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, �Face model compression by dis.tilling knowledge from neurons,� in Proc. 30th AAAI Conf. Artificial Intelligence, 2016, pp. 3560�3566. +[55] T. Chen, I. J. Goodfellow, and J. Shlens, �Net2net: Accelerating learning via knowledge transfer,� Computing Res. Repository, vol. abs/1511.05641, 2015. [Online]. Available: https://arxiv.org/abs/1511.05641 +[56] S. Zagoruyko and N. Komodakis. (2016). Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, Computing Res. Repository, vol. abs/1612.03928. [Online]. Available: http://arxiv.org/ abs/1612.03928 +[57] A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and A. C. Courville, �Dynamic capacity networks,� in Proc. 33rd Int. Conf. Machine Learning, 2016, pp. 2549�2558. +[58] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. [Online]. Available: https://openreview.net/pdf?id=B1ckMDqlg +[59] D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and J. Odobez, �Deep dynamic neural networks for multimodal gesture segmentation and recognition,� IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1583� 1597, 2016. +[60] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. (2015). Going deeper with convolutions. Proc. IEEE Computer Vision Pattern Recognition. [Online]. Available: http://arxiv.org/ abs/1409.4842 +[61] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, �Deep networks with stochastic depth,� Computing Res. Repository, vol. arXiv:1603.09382, 2016. +[62] Y. Yamada, M. Iwamura, and K. Kise. (2016). Deep pyramidal residual networks with separated stochastic depth, Computing Res. Repository, vol. abs/1612.01230. [Online]. Available: http://arxiv.org/abs/1612.01230 +[63] M. Mathieu, M. Henaff, and Y. Lecun, �Fast training of convolutional networks through FFTs,� Computing Res. Repository, vol. arXiv:1312.5851, 2014. +[64] A. Lavin and S. Gray, �Fast algorithms for convolutional neural networks,� in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 4013� 4021. +[65] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, �Gradient-based learning applied to document recognition,� Proc. IEEE, pp. 2278�2324, 1998. +[66] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, �Striving for simplicity: The all convolutional net,� Computing Res. Repository, vol. abs/1412.6806, 2014. [Online]. Available: https://arxiv.org/abs/1412.6806 +[67] M. Lin, Q. Chen, and S. Yan, �Network in network,� in Proc. Int. Conf. Learning Representations, 2014. [Online]. Available: https://arxiv.org/abs/ 1312.4400 +[68] K. Simonyan and A. Zisserman, �Very deep convolutional networks for large-scale image recognition,� Computing Res. Repository, vol. abs/1409.1556, 2014. [Online]. Available: https://arxiv.org/abs/1409.1556 +[69] K. He, X. Zhang, S. Ren, and J. Sun, �Deep residual learning for image recogni.tion,� arXiv Preprint, arXiv:1512.03385, 2015. +[70] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and S. Chang, �An exploration of parameter redundancy in deep networks with circulant projections,� in Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 2857�2865. +[71] M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, �ACDC: A structured efficient linear layer,� in Proc. Int. Conf. Learning Representations, 2016. +[72] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas, �Learning to learn by gradient descent by gradient descent,� in Proc. Neural Information Processing Systems Conf., 2016, pp. 3981� 3989. +[73] D. Ha, A. Dai, and Q. Le, �Hypernetworks,� in Proc. Int. Conf. Learning Representations, 2016. +[74] J. M. Alvarez and M. Salzmann, �Learning the number of neurons in deep net.works,� in Proc. Neural Information Processing Systems Conf., 2016, pp. 2270� 2278. +[75] Y. Wang, C. Xu, C. Xu, and D. Tao, �Beyond filters: Compact feature map for portable deep model,� in Proc. 34th Int. Conf. Machine Learning, 2017, pp. 3703� 3711. +[76] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, �Compression of deep convolutional neural networks for fast and low power mobile applications,� Computing Res. Repository, vol. abs/1511.06530, 2015. [Online]. Available: https://arxiv.org/ abs/1511.06530 +[77] Facebook, Inc. Caffe2: A new lightweight, modular, and scalable deep learning framework. (2016). [Online]. Available: https://caffe2.ai/ +<> <> <> + + +<> <> <> + MOGRIFIER LSTM + + + Gábor Melis y , Tomáš Kociskýˇ y , Phil Blunsom yz + {melisgl,tkocisky,pblunsom}@google.com + y DeepMind, London, UK + z University of Oxford + + + + ABSTRACT + + + Many advances in Natural Language Processing have been based upon more expressive + models for how inputs interact with the context in which they occur. Recurrent + networks, which have enjoyed a modicum of success, still lack the generalization + and systematicity ultimately required for modelling language. In this work, we + propose an extension to the venerable Long Short-Term Memory in the form of + mutual gating of the current input and the previous output. This mechanism affords + the modelling of a richer space of interactions between inputs and their context. + Equivalently, our model can be viewed as making the transition function given + by the LSTM context-dependent. Experiments demonstrate markedly improved + generalization on language modelling in the range of 3–4 perplexity points on Penn + Treebank and Wikitext-2, and 0.01–0.05 bpc on four character-based datasets. We + establish a new state of the art on all datasets with the exception of Enwik8, where + we close a large gap between the LSTM and Transformer models. + + + 1 INTRODUCTION + + The domination of Natural Language Processing by neural models is hampered only by their limited + ability to generalize and questionable sample complexity (Belinkov and Bisk 2017; Jia and Liang + 2017; Iyyer et al. 2018; Moosavi and Strube 2017; Agrawal et al. 2016), their poor grasp of grammar + (Linzen et al. 2016; Kuncoro et al. 2018), and their inability to chunk input sequences into meaningful + units (Wang et al. 2017). While direct attacks on the latter are possible, in this paper, we take a + language-agnostic approach to improving Recurrent Neural Networks (RNN, Rumelhart et al. (1988)), + which brought about many advances in tasks such as language modelling, semantic parsing, machine + translation, with no shortage of non-NLP applications either (Bakker 2002; Mayer et al. 2008). Many + neural models are built from RNNs including the sequence-to-sequence family (Sutskever et al. 2014) + and its attention-based branch (Bahdanau et al. 2014). Thus, innovations in RNN architecture tend to + have a trickle-down effect from language modelling, where evaluation is often the easiest and data + the most readily available, to many other tasks, a trend greatly strengthened by ULMFiT (Howard + and Ruder 2018), ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018), which promote language + models from architectural blueprints to pretrained building blocks. + To improve the generalization ability of language models, we propose an extension to the LSTM + (Hochreiter and Schmidhuber 1997), where the LSTM’s inputxis gated conditioned on the output of + the previous step h_prev . Next, the gated input is used in a similar manner to gate the output of the + previous time step. After a couple of rounds of this mutual gating, the last update dx and h_prev are + fed to an LSTM. By introducing these additional of gating operations, in one sense, our model joins + the long list of recurrent architectures with gating structures of varying complexity which followed + the invention of Elman Networks (Elman 1990). Examples include the LSTM, the GRU (Chung et al. + 2015), and even designs by Neural Architecture Search (Zoph and Le 2016). + Intuitively, in the lowermost layer, the first gating step scales the input embedding (itself a representation + of the average context in which the token occurs) depending on the actual context, resulting in a + contextualized representation of the input. While intuitive, as Section4 shows, this interpretation + cannot account for all the observed phenomena. + In a more encompassing view, our model can be seen as enriching the mostly additive dynamics of + recurrent transitions placing it in the company of the Input Switched Affine Network (Foerster et al. + + <
> + + Figure 1: Mogrifier with 5 rounds of updates. The previous state h0 =h prev is transformed linearly (dashed + arrows), fed through a sigmoid and gates <> in an element-wise manner producing <>. Conversely, the + linearly transformed <> gates <> and produces <>. After a number of repetitions of this mutual gating cycle, the + last values of h and x sequences are fed to an LSTM cell. The prev subscript of his omitted to reduce clutter. + + + 2017) with a separate transition matrix for each possible input, and the Multiplicative RNN (Sutskever + et al. 2011), which factorizes the three-way tensor of stacked transition matrices. Also following + this line of research are the Multiplicative Integration LSTM (Wu et al. 2016) and – closest to our + model in the literature – the Multiplicative LSTM (Krause et al. 2016). The results in Section3.4 + demonstrate the utility of our approach, which consistently improves on the LSTM and establishes a + new state of the art on all but the largest dataset, Enwik8, where we match similarly sized transformer + models. + + 2 MODEL + + To allow for ease of subsequent extension, we present the standard LSTM update (Sak et al. 2014) + with input and state of size m and n respectively as the following function: + + <> + + The updated state c and the output h are computed as follows: + + <> + + where <> is the logistic sigmoid function, <> is the elementwise product,<> and b are weight + matrices and biases. + While the LSTM is typically presented as a solution to the vanishing gradients problem, its gate i + can also be interpreted as scaling the rows of weight matrices <> (ignoring the non-linearity in + j). In this sense, the LSTM nudges Elman Networks towards context-dependent transitions and + the extreme case of Input Switched Affine Networks. If we took another, larger step towards that + extreme, we could end up with Hypernetworks (Ha et al. 2016). Here, instead, we take a more + cautious step, and equip the LSTM with gates that scale the columns of all its weight matrices <> + in a context-dependent manner. The scaling of the matrices <> (those that transform the cell input) + makes the input embeddings dependent on the cell state, while the scaling of <> does the reverse. + The Mogrifier LSTM is an LSTM where two inputs x and h_prev modulate one another in + an alternating fashion before the usual LSTM computation takes place (see Fig.1). That is, + <> where the modulated inputs x" and h" are prev + defined as the highest indexed xi and hi , respectively, from the interleaved sequences + + <> (1) + + 1 It’s like a transmogrifier 2 without the magic: it can only shrink or expand objects. + 2 Transmogrify (verb, 1650s): to completely alter the form of something in a surprising or magical manner. + + <> (2) + + with <>, is a hyperparameter; <>. The number of “rounds”,r + recovers the LSTM. Multiplication with the constant2ensures that randomly initializedQi ;Ri + matrices result in transformations close to identity. To reduce the number of additional model + parameters, we typically factorize theQi ;Ri matrices as products of low-rank matrices: <> + with <>, where <> is the rank. + + 3 EXPERIMENTS + + 3.1 THE CASE FOR SMALL-SCALE + + Before describing the details of the data, the experimental setup and the results, we take a short detour + to motivate work on smaller-scale datasets. A recurring theme in the history of sequence models is + that the problem of model design is intermingled with optimizability and scalability. Elman Networks + are notoriously difficult to optimize, a property that ultimately gave birth to the idea of the LSTM, + but also to more recent models such as the Unitary Evolution RNN (Arjovsky et al. 2016) and fixes + like gradient clipping (Pascanu et al. 2013). Still, it is far from clear – if we could optimize these + models well – how different their biases would turn out to be. The non-separability of model and + optimization is fairly evident in these cases. + Scalability, on the other hand, is often optimized for indirectly. Given the limited ability of current + models to generalize, we often compensate by throwing more data at the problem. To fit a larger + dataset, model size must be increased. Thus the best performing models are evaluated based on their + scalability 3 . Today, scaling up still yields tangible gains on down-stream tasks, and language + modelling data is abundant. However, we believe that simply scaling up will not solve the generalization + problem and better models will be needed. Our hope is that by choosing small enough datasets, so + that model size is no longer the limiting factor, we get a number of practical advantages: + + Generalization ability will be more clearly reflected in evaluations even without domain adaptation. + + Turnaround time in experiments will be reduced, and the freed up computational budget can be + put to good use by controlling for nuisance factors. + + The transient effects of changing hardware performance characteristics are somewhat lessened. + + Thus, we develop, analyse and evaluate models primarily on small datasets. Evaluation on larger + datasets is included to learn more about the models’ scaling behaviour and because of its relevance + for applications, but it is to be understood that these evaluations come with much larger error bars + and provide more limited guidance for further research on better models. + + 3.2 DATASETS + + We compare models on both word and character-level language modelling datasets. The two word- + level datasets we picked are the Penn Treebank (PTB) corpus by Marcus et al. (1993) with prepro- + cessing from Mikolov et al. (2010) and Wikitext-2 by Merity et al. (2016), which is about twice + the size of PTB with a larger vocabulary and lighter preprocessing. These datasets are definitely + on the small side, but – and because of this – they are suitable for exploring different model biases. + Their main shortcoming is the small vocabulary size, only in the tens of thousands, which makes + them inappropriate for exploring the behavior of the long tail. For that, open vocabulary language + modelling and byte pair encoding (Sennrich et al. 2015) would be an obvious choice. Still, our + primary goal here is the comparison of the LSTM and Mogrifier architectures, thus we instead opt + for character-based language modelling tasks, where vocabulary size is not an issue, the long tail + is not truncated, and there are no additional hyperparameters as in byte pair encoding that make + fair comparison harder. The first character-based corpus is Enwik8 from the Hutter Prize dataset + (Hutter 2012). Following common practice, we use the first 90 million characters for training and + the remaining 10 million evenly split between validation and test. The character-level task on the + + 3 Note that the focus on scalability is not a problem per se. Indeed the unsupervised pretraining methods + (Peters et al. 2018; Devlin et al. 2018) take great advantage of this approach. + + Table 1: Word-level perplexities of near state-of-the-art models, ourLSTMbaseline and theMogrifieron PTB + and Wikitext-2. Models with Mixture of Softmaxes (Yang et al. 2017) are denoted withMoS, depth N withdN. + MCstands for Monte-Carlo dropout evaluation. Previous state-of-the-art results in italics. Note the comfortable + margin of 2.8–4.3 perplexity points the Mogrifier enjoys over the LSTM. + + <
> + + Mikolov preprocessed PTB corpus (Merity et al. 2018) is unique in that it has the disadvantages of + closed vocabulary without the advantages of word-level modelling, but we include it for comparison + to previous work. The final character-level dataset is the Multilingual Wikipedia Corpus (MWC, + Kawakami et al. (2017)), from which we focus on the English and Finnish language subdatasets in + the single text, large setting. + + 3.3 SETUP + + We tune hyperparameters following the experimental setup of Melis et al. (2018) using a black-box + hyperparameter tuner based on batched Gaussian Process Bandits (Golovin et al. 2017). For the + LSTM, the tuned hyperparameters are the same:input_embedding_ratio,learning_rate,l2_penalty, + input_dropout,inter_layer_dropout,state_dropout,output_dropout. For the Mogrifier, the number + of rounds r and the rank k of the low-rank approximation is also tuned (allowing for full rank, too). + For word-level tasks, BPTT (Werbos et al. 1990) window size is set to 70 and batch size to 64. For + character-level tasks, BPTT window size is set to 150 and batch size to 128 except for Enwik8 where + the window size is 500. Input and output embeddings are tied for word-level tasks following Inan + et al. (2016) and Press and Wolf (2016). Optimization is performed with Adam (Kingma and Ba + 2014) with <>, a setting that resembles RMSProp without momentum. Gradients are clipped + (Pascanu et al. 2013) to norm 10. We switch to averaging weights similarly to Merity et al. (2017) + after a certain number of checkpoints with no improvement in validation cross-entropy or at 80% of + the training time at the latest. We found no benefit to using two-step finetuning. + Model evaluation is performed with the standard, deterministic dropout approximation or Monte- + Carlo averaging (Gal and Ghahramani 2016) where explicitly noted (MC). In standard dropout + evaluation, dropout is turned off while in MC dropout predictions are averaged over randomly + sampled dropout masks (200 in our experiments). Optimal softmax temperature is determined on + the validation set, and in the MC case dropout rates are scaled (Melis et al. 2018). Finally, we report + results with and without dynamic evaluation (Krause et al. 2017). Hyperparameters for dynamic + evaluation are tuned using the same method (see AppendixA for details). + We make the code and the tuner output available at https://github.com/deepmind/lamb. + + 3.4 RESULTS + + Table1 lists our results on word-level datasets. On the PTB and Wikitext-2 datasets, the Mogrifier + has lower perplexity than the LSTM by 3–4 perplexity points regardless of whether or not dynamic + evaluation (Krause et al. 2017) and Monte-Carlo averaging are used. On both datasets, the state of + the art is held by the AWD LSTM (Merity et al. 2017) extended with Mixture of Softmaxes (Yang + + Table 2: Bits per character on character-based datasets of near state-of-the-art models, our LSTM baseline + and theMogrifier. Previous state-of-the-art results in italics. Depth N is denoted withdN. MC stands for + Monte-Carlo dropout evaluation. Once again the Mogrifier strictly dominates the LSTM and sets a new state of + the art on all but the Enwik8 dataset where with dynamic evaluation it closes the gap to the Transformer-XL of + similar size (y Krause et al. (2019),zBen Krause, personal communications, May 17, 2019). On most datasets, + model size was set large enough for underfitting not to be an issue. This was very much not the case with Enwik8, + so we grouped models of similar sizes together for ease of comparison. Unfortunately, a couple of dynamic + evaluation test runs diverged (NaN) on the test set and some were just too expensive to run (Enwik8, MC). + + <
> + + et al. 2017) and FRAGE (Gong et al. 2018). The Mogrifier improves the state of the art without either + of these methods on PTB, and without FRAGE on Wikitext-2. + Table2 lists the character-level modelling results. On all datasets, our baseline LSTM results are much + better than those previously reported for LSTMs, highlighting the issue of scalability and experimental + controls. In some cases, these unexpectedly large gaps may be down to lack of hyperparameter tuning + as in the case of Merity et al. (2017), or in others, to using a BPTT window size (50) that is too small + for character-level modelling (Melis et al. 2017) in order to fit the model into memory. The Mogrifier + further improves on these baselines by a considerable margin. Even the smallest improvement of + 0.012 bpc on the highly idiosyncratic, character-based, Mikolov preprocessed PTB task is equivalent + to gaining about 3 perplexity points on word-level PTB. MWC, which was built for open-vocabulary + language modelling, is a much better smaller-scale character-level dataset. On the English and the + Finnish corpora in MWC, the Mogrifier enjoys a gap of 0.033-0.046 bpc. Finally, on the Enwik8 + dataset, the gap is 0.029-0.039 bpc in favour of the Mogrifier. + + <
> + + Figure 2: “No-zigzag” Mogrifier for the ablation study. Gating is always based on the original inputs. + + Table 3: PTB ablation study validation perplexities with 24M parameters. + + <
> + + + Of particular note is the comparison to Transformer-XL (Dai et al. 2019), a state-of-the-art model + on larger datasets such as Wikitext-103 and Enwik8. On PTB, without dynamic evaluation, the + Transformer-XL is on par with our LSTM baseline which puts it about 3.5 perplexity points behind + the Mogrifier. On Enwik8, also without dynamic evaluation, the Transformer-XL has a large, 0.09 bpc + advantage at similar parameter budgets, but with dynamic evaluation this gap disappears. However, + we did not test the Transformer-XL ourselves, so fair comparison is not possible due to differing + experimental setups and the rather sparse result matrix for the Transformer-XL. + + 4 ANALYSIS + + 4.1 ABLATION STUDY + + The Mogrifier consistently outperformed the LSTM in our experiments. The optimal settings were + similar across all datasets, with <> and <> (see AppendixB for a discussion of + hyperparameter sensitivity). In this section, we explore the effect of these hyperparameters and show + that the proposed model is not unnecessarily complicated. To save computation, we tune all models + using a shortened schedule with only 145 epochs instead of 964 and a truncated BPTT window + size of 35 on the word-level PTB dataset, and evaluate using the standard, deterministic dropout + approximation with a tuned softmax temperature. + Fig.3 shows that the number of rounds r greatly influences the results. Second, we found the low-rank + factorization ofQi andRi to help a bit, but the full-rank variant is close behind which is what we + observed on other datasets, as well. Finally, to verify that the alternating gating scheme is not overly + complicated, we conditional l new ly introduced gates on the original inputs x and h_prev (see Fig.2). + That is, instead of Eq.1 and Eq.2 the no-zigzag updates are + + <> + + In our experiments, the no-zigzag variant underperformed the baseline Mogrifier by a small but + significant margin, and was on par with the <> model in Fig.3 suggesting that the Mogrifier’s + iterative refinement scheme does more than simply widen the range of possible gating values ofx + and h_prev to(0;2dr=2e )and(0;2br=2c ), respectively. + + 4.2 COMPARISON TO THE M LSTM + + The Multiplicative LSTM (Krause et al. 2016), or mLSTM for short, is closest to our model in + the literature. It is defined asmLSTM(x;cprev ; h_prev ) = LSTM(x;cprev ;hm ), wherehm =prev prev + + <
> + + Figure 4: Cross-entropy vs sequence length in the reverse copy task with i.i.d. tokens. Lower is better. The + Mogrifier is better than the LSTM even in this synthetic task with no resemblance to natural language. + + + <>. In this formulation, the differences are readily apparent. First, the mLSTM + allows for multiplicative interaction betweenxand h_prev , but it only overrides h_prev , while in the + Mogrifier the interaction is two-way, which – as the ablation study showed – is important. Second, + the mLSTM can change not only the magnitude but also the sign of values in h_prev , something with + which we experimented in the Mogrifier, but could not get to work. Furthermore, in the definition of + hm , the unsquashed linearities and their elementwise product make the mLSTM more sensitive to prev initialization and unstable during optimization. + On the Enwik8 dataset, we greatly improved on the published results of the mLSTM (Krause et al. + 2016). In fact, even our LSTM baseline outperformed the mLSTM by 0.03 bpc. We also conducted + experiments on PTB based on our reimplementation of the mLSTM following the same methodology + as the ablation study and found that the mLSTM did not improve on the LSTM (see Table3). + Krause et al. (2016) posit and verify the recovery hypothesis which says that having just suffered + a large loss, the loss on the next time step will be smaller on average for the mLSTM than for the + LSTM. This was found not to be the case for the Mogrifier. Neither did we observe a significant + change in the gap between the LSTM and the Mogrifier in the tied and untied embeddings settings, + which would be expected if recovery was affected byxand h_prev being in different domains. + + + 4.3 THE REVERSE COPY TASK + + Our original motivation for the Mogrifier was to allow the context to amplify salient and attenuate + nuisance features in the input embeddings. We conduct a simple experiment to support this point + of view. Consider the reverse copy task where the network reads an input sequence of tokens and + a marker token after which it has to repeat the input in reverse order. In this simple sequence-to- + sequence learning (Sutskever et al. 2014) setup, the reversal is intended to avoid the minimal time lag + problem (Hochreiter and Schmidhuber 1997), which is not our focus here. + The experimental setup is as follows. For the training set, we generate 500000 examples by uniformly + sampling a given number of tokens from a vocabulary of size1000. The validation and test sets + are constructed similarly, and contain 10000 examples. The model consists of an independent, + unidirectional encoder and a decoder, whose total number of parameters is10million. The decoder + is initialized from the last state of the encoder. Since overfitting is not an issue here, no dropout is + necessary, and we only tune the learning rate, the l2 penalty, and the embedding size for the LSTM. + For the Mogrifier, the number of roundsrand the rankkof the low-rank approximation are also + tuned. + We compare the case where both the encoder and decoder are LSTMs to where both are Mogrifiers. + Fig.4a shows that, for sequences of length 50 and 100, both models can solve the task perfectly. At + higher lengths though, the Mogrifier has a considerable advantage. Examining the best hyperparameter + settings found, the embedding/hidden sizes for the LSTM and Mogrifier are 498/787 vs 41/1054 at + 150 steps, and 493/790 vs 181/961 at 200 steps. Clearly, the Mogrifier was able to work with a much + smaller embedding size than the LSTM, which is in line with our expectations for a model with a + more flexible interaction between the input and recurrent state. We also conducted experiments with + a larger model and vocabulary size, and found the effect even more pronounced (see Fig.4b). + + 4.4 WHAT THE MOGRIFIER IS NOT + + The results on the reverse copy task support our hypothesis that input embeddings are enriched by + the Mogrifier architecture, but that cannot be the full explanation as the results of the ablation study + indicate. In the following, we consider a number of hypotheses about where the advantage of the + Mogrifier lies and the experiments that provide evidenceagainstthem. + + E Hypothesis: the benefit is in scalingxand h_prev .We verified that data dependency is a crucial + feature by adding a learnable scaling factor to the LSTM inputs. We observed no improvement. + Also, at extremely low-rank (less than 5) settings where the amount of information in its gating is + small, the Mogrifier loses its advantage. + + E Hypothesis: the benefit is in making optimization easier.We performed experiments with different + optimizers (SGD, RMSProp), with intra-layer batch normalization and layer normalization on + the LSTM gates. While we cannot rule out an effect on optimization difficulty, in all of these + experiments the gap between the LSTM and the Mogrifier was the same. + + E Hypothesis: exact tying of embeddings is too constraining, the benefit is in making this rela- + tionship less strict.Experiments conducted with untied embeddings and character-based models + demonstrate improvements of similar magnitude. + + E Hypothesis: the benefit is in the low-rank factorization of <> implicitly imposing structure on + the LSTM weight matrices.We observed that the full-rank Mogrifier also performed better than + the plain LSTM. We conducted additional experiments where the LSTM’s gate matrices were + factorized and observed no improvement. + + E Hypothesis: the benefit comes from better performance on rare words.The observed advantage + on character-based modelling is harder to explain based on frequency. Also, in the reverse copy + experiments, a large number of tokens were sampled uniformly, so there were no rare words at all. + + E Hypothesis: the benefit is specific to the English language.This is directly contradicted by the + Finnish MWC and the reverse copy experiments. + + E Hypothesis: the benefit is in handling long-range dependencies better.Experiments in the episodic + setting (i.e. sentence-level language modelling) exhibited the same gap as the non-episodic ones. + + E Hypothesis: the scaling up of inputs saturates the downstream LSTM gates.The idea here is that + saturated gates may make states more stable over time. We observed the opposite: the means + of the standard LSTM gates in the Mogrifier were very close between the two models, but their + variance was smaller in the Mogrifier. + + + 5 CONCLUSIONS AND FUTURE WORK + + We presented the Mogrifier LSTM, an extension to the LSTM, with state-of-the-art results on + several language modelling tasks. Our original motivation for this work was that the context-free + representation of input tokens may be a bottleneck in language models and by conditioning the + input embedding on the recurrent state some benefit was indeed derived. While it may be part of the + explanation, this interpretation clearly does not account for the improvements brought by conditioning + the recurrent state on the input and especially the applicability to character-level datasets. Positioning + our work on the Multiplicative RNN line of research offers a more compelling perspective. + To give more credence to this interpretation, in the analysis we highlighted a number of possible + alternative explanations, and ruled them all out to varying degrees. In particular, the connection + to the mLSTM is weaker than expected as the Mogrifier does not exhibit improved recovery (see + Section4.2), and on PTB the mLSTM works only as well as the LSTM. At the same time, the + evidence against easier optimization is weak, and the Mogrifier establishing some kind of sharing + between otherwise independent LSTM weight matrices is a distinct possibility. + Finally, note that as shown by Fig.1 and Eq.1-2, the Mogrifier is a series of preprocessing steps + composed with the LSTM function, but other architectures, such as Mogrifier GRU or Mogrifier + Elman Network are possible. We also leave investigations into other forms of parameterization of + context-dependent transitions for future work. + + ACKNOWLEDGMENTS + + We would like to thank Ben Krause for the Transformer-XL dynamic evaluation results, Laura + Rimell, Aida Nematzadeh, Angeliki Lazaridou, Karl Moritz Hermann, Daniel Fried for helping with + experiments, Chris Dyer, Sebastian Ruder and Jack Rae for their valuable feedback. + + + REFERENCES + Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models. + arXiv preprint arXiv:1606.07356, 2016. + + Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. InInternational + Conference on Machine Learning, pages 1120–1128, 2016. + + Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to + align and translate.arXiv preprint arXiv:1409.0473, 2014. + + Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. arXiv preprint + arXiv:1810.06682, 2018. + + Bram Bakker. Reinforcement learning with long short-term memory. InAdvances in neural information + processing systems, pages 1475–1482, 2002. + + Yonatan Belinkov and Yonatan Bisk. Synthetic and natural noise both break neural machine translation.arXiv + preprint arXiv:1711.02173, 2017. + + Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural + networks. InInternational Conference on Machine Learning, pages 2067–2075, 2015. + + Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan + Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context.arXiv preprint + arXiv:1901.02860, 2019. + + Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional + transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. + + Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990. + + Jakob N Foerster, Justin Gilmer, Jascha Sohl-Dickstein, Jan Chorowski, and David Sussillo. Input switched + affine networks: An rnn architecture designed for interpretability. InProceedings of the 34th International + Conference on Machine Learning-Volume 70, pages 1136–1145. JMLR. org, 2017. + + Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. + InAdvances in Neural Information Processing Systems, pages 1019–1027, 2016. + + Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google + vizier: A service for black-box optimization. InProceedings of the 23rd ACM SIGKDD International + Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017. + + Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Frage: frequency-agnostic word + representation. InAdvances in Neural Information Processing Systems, pages 1334–1345, 2018. + + David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016. + + Sepp Hochreiter and Jürgen Schmidhuber. Lstm can solve hard long time lag problems. InAdvances in neural + information processing systems, pages 473–479, 1997. + + Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv preprint + arXiv:1801.06146, 2018. + + Marcus Hutter. The human knowledge compression contest.URL http://prize. hutter1. net, 6, 2012. + + Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework + for language modeling.CoRR, abs/1611.01462, 2016. URLhttp://arxiv.org/abs/1611.01462. + + Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adversarial example generation with + syntactically controlled paraphrase networks.arXiv preprint arXiv:1804.06059, 2018. + + Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems.arXiv preprint + arXiv:1707.07328, 2017. + + Kazuya Kawakami, Chris Dyer, and Phil Blunsom. Learning to create and reuse words in open-vocabulary + neural language modeling.arXiv preprint arXiv:1704.06986, 2017. + + Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, + 2014. + + Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative LSTM for sequence modelling.CoRR, + abs/1609.07959, 2016. URLhttp://arxiv.org/abs/1609.07959. + + Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence + models.arXiv preprint arXiv:1709.07432, 2017. + + Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of transformer language + models.arXiv preprint arXiv:1904.08378, 2019. + + Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. Lstms can learn + syntax-sensitive dependencies well, but modeling structure makes them better. InProceedings of the 56th + Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1426–1436, + 2018. + + Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of lstms to learn syntax-sensitive + dependencies.Transactions of the Association for Computational Linguistics, 4:521–535, 2016. + + Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of + english: The Penn treebank.Computational linguistics, 19(2):313–330, 1993. + + Hermann Mayer, Faustino Gomez, Daan Wierstra, Istvan Nagy, Alois Knoll, and Jürgen Schmidhuber. A system + for robotic heart surgery that learns to tie knots using recurrent neural networks.Advanced Robotics, 22 + (13-14):1521–1537, 2008. + + Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. + arXiv preprint arXiv:1707.05589, 2017. + + Gábor Melis, Charles Blundell, Tomáš Kociskˇ y, Karl Moritz Hermann, Chris Dyer, and Phil Blunsom. Pushing` + the bounds of dropout.arXiv preprint arXiv:1805.09208, 2018. + + Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.CoRR, + abs/1609.07843, 2016. URLhttp://arxiv.org/abs/1609.07843. + + Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. + arXiv preprint arXiv:1708.02182, 2017. + + Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple + scales.arXiv preprint arXiv:1803.08240, 2018. + + Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Recurrent neural` + network based language model. InInterspeech, volume 2, page 3, 2010. + + Nafise Sadat Moosavi and Michael Strube. Lexical features in coreference resolution: To be used with caution. + arXiv preprint arXiv:1704.06779, 2017. + + Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In + International conference on machine learning, pages 1310–1318, 2013. + + Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke + Zettlemoyer. Deep contextualized word representations.arXiv preprint arXiv:1802.05365, 2018. + + Ofir Press and Lior Wolf. Using the output embedding to improve language models.CoRR, abs/1608.05859, + 2016. URL http://arxiv.org/abs/1608.05859. + + David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating + errors.Cognitive modeling, 5(3):1, 1988. + + Hasim Sak, Andrew W. Senior, and Françoise Beaufays. Long short-term memory based recurrent neural + network architectures for large vocabulary speech recognition.CoRR, abs/1402.1128, 2014. URL http://arxiv.org/abs/1402.1128. + + Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword + units.arXiv preprint arXiv:1508.07909, 2015. + + Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In + Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011. + + Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances + in neural information processing systems, pages 3104–3112, 2014. + + Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. Sequence + modeling via segmentations. InProceedings of the 34th International Conference on Machine Learning- + Volume 70, pages 3674–3683. JMLR. org, 2017. + + Paul J Werbos et al. Backpropagation through time: what it does and how to do it.Proceedings of the IEEE, 78 + (10):1550–1560, 1990. + + Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. On multiplicative + integration with recurrent neural networks. InAdvances in neural information processing systems, pages + 2856–2864, 2016. + + Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: a + high-rank rnn language model.arXiv preprint arXiv:1711.03953, 2017. + + Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning.CoRR, abs/1611.01578, + 2016. URLhttp://arxiv.org/abs/1611.01578. + + + APPENDIX A HYPERPARAMETER TUNING RANGES + + In all experiments, we tuned hyperparameters using Google Vizier (Golovin et al. 2017). The tuning + ranges are listed in Table4. Obviously,mogrifier_roundsandmogrifier_rankare tuned only for the + Mogrifier. Ifinput_embedding_ratio>1, then the input/output embedding sizes and the hidden + sizes are set to equal and the linear projection from the cell output into the output embeddings space + is omitted. Similarly,mogrifier_rank60is taken to mean full rank <> without factorization. + Since Enwik8 is a much larger dataset, we don’t tuneinput_embedding_ratioand specify tighter + tuning ranges for dropout based on preliminary experiments (see Table5). + Dynamic evaluation hyperparameters were tuned according to Table6. The highest possible value + formax_time_steps, the BPTT window size, was 20 for word, and 50 for character-level tasks. The + batch size for estimating the mean squared gradients over the training data was set to 1024, gradient + clipping was turned off, and the l2 penalty was set to zero. + + Table 4: Hyperparameter tuning ranges for all tasks except Enwik8. + + <
> + + + Table 5: Hyperparameter tuning ranges for Enwik8. + + <
> + + + Table 6: Hyperparameter tuning ranges for dynamic evaluation. + + <
> + + + APPENDIX B HYPERPARAMETER SENSITIVITY + + The parallel coordinate plots in Fig.5 and 6, give a rough idea about hyperparameter sensitivity. The + red lines correspond to hyperparameter combinations closest to the best solution found. To find the + closest combinations, we restricted the range for each hyperparameter separately to about 15% of its + entire tuning range. + For both the LSTM and the Mogrifier, the results are at most 1.2 perplexity points off the best result, + so our results are somewhat insensitive to jitter in the hyperparameters. Still, in this setup, grid search + would require orders of magnitude more trials to find comparable solutions. + On the other hand, the tuner does take advantage of the stochasticity of training, and repeated runs + with the same parameters may be give slightly worse results. To gauge the extent of this effect, on + PTB we estimated the standard deviation in reruns of the LSTM with the best hyperparameters to be + about 0.2 perplexity points, but the mean was about 0.7 perplexity points off the result produced with + the weights saved in best tuning run. + + <
> + + Figure 5: Average per-word validation cross-entropies for hyperparameter combinations in the neighbourhood of + the best solution for a 2-layer LSTM with 24M weights on the Penn Treebank dataset. + + <
> + + Figure 6: Average per-word validation cross-entropies for hyperparameter combinations in the neighbourhood + of the best solution for a 2-layer Mogrifier LSTM with 24M weights on the Penn Treebank dataset. + feature_mask_rank and feature_mask_roundsare aliases for mogrifier_rank and mogrifier_rounds +<> <> <> + + +<> <> <> + Movement Pruning: + Adaptive Sparsity by Fine-Tuning + + + Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2 + 1 Hugging Face, 2 Cornell University + {victor,thomas}@huggingface.co;arush@cornell.edu + + + Abstract + + Magnitude pruning is a widely used strategy for reducing model size in pure + supervised learning; however, it is less effective in the transfer learning regime that + has become standard for state-of-the-art natural language processing applications. + We propose the use of movement pruning, a simple, deterministic first-order weight + pruning method that is more adaptive to pretrained model fine-tuning. We give + mathematical foundations to the method and compare it to existing zeroth- and + first-order pruning methods. Experiments show that when pruning large pretrained + language models, movement pruning shows significant improvements in high- + sparsity regimes. When combined with distillation, the approach achieves minimal + accuracy loss with down to only 3% of the model parameters. + + + 1 Introduction + + Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art + performance in applications in natural language processing and related fields. In this setup, a large + model pretrained on a massive generic dataset is then fine-tuned on a smaller annotated dataset to + perform a specific end-task. Model accuracy has been shown to scale with the pretrained model and + dataset size [Raffel et al., 2019]. However, significant resources are required to ship and deploy these + large models, and training the models have high environmental costs [Strubell et al., 2019]. + Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at + only a small cost of accuracy. Pruning methods, which remove weights based on their importance, + are a particularly simple and effective method for compressing models to be sent to edge devices such + as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high + absolute values, is the most widely used method for weight pruning. It has been applied to a large + variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al., + 2019], and more recently has been leveraged as a core component in the lottery ticket hypothesis + [Frankle et al., 2019]. + While magnitude pruning is highly effective for standard supervised learning, it is inherently less + useful in the transfer learning regime. In supervised learning, weight values are primarily determined + by the end-task training data. In transfer learning, weight values are mostly predetermined by the + original model and are only fine-tuned on the end task. This prevents these methods from learning to + prune based on the fine-tuning step, or “fine-pruning.” + In this work, we argue that to effectively reduce the size of models for transfer learning, one should + instead use movement pruning, i.e., pruning approaches that consider the changes in weights during + fine-tuning. Movement pruning differs from magnitude pruning in that both weights with low and + high values can be pruned if they shrink during training. This strategy moves the selection criteria + from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective. To + test this approach, we introduce a particularly simple, deterministic version of movement pruning + utilizing the straight-through estimator [Bengio et al., 2013]. + We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019, + Vaswani et al., 2017] on a diverse set of fine-tuning tasks. In highly sparse regimes (less than 15% of + remaining weights), we observe significant improvements over magnitude pruning and other 1st-order + methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original + BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI) + [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of + the differences between magnitude pruning and movement pruning shows that the two methods lead + to radically different pruned models with movement pruning showing greater ability to adapt to the + end-task. + + 2 Related Work + + In addition to magnitude pruning, there are many other approaches for generic model weight pruning. + Most similar to our approach are methods for using parallel score matrices to augment the weight + matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for + convolutional networks. Differing from our methods, these methods keep the weights of the model fixed + (either from a randomly initialized network or a pre-trained network) and the scores are updated to + find a good sparse subnetwork. + Many previous works have also explored using higher-order information to select prunable weights. + LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for + deletion. Our method does not require the (possibly costly) computation of second-order derivatives + since the importance scores are obtained simply as the by-product of the standard fine-tuning. The is + et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In + contrast, we found it useful to preserve the direction of movement in our algorithm. + Compressing pretrained language models for transfer learning is also a popular area of study. Other + approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning + [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model + and targets individual weight. We also show that having a teacher can further improve our approach. + Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train + sparse language models from scratch. This differs from our approach which focuses on the fine-tuning + stage. Finally, another popular compression approach is quantization. Quantization has been applied + to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014] + providing high memory compression rates at the cost of no or little performance. As shown in + previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and + can be combined to further improve the performance/size ratio. + + 3 Background: Score-Based Pruning + + We first establish shared notation for discussing different neural network pruning strategies. Let + <> refer to a generic weight matrix in the model (we consider square matrices, but they + could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of + associated importance scores <>. Given importance scores, each pruning strategy computes a + mask <>. Inference for an input x becomes <>, where <> is the Hadamard + product. A common strategy is to keep the top-v percent of weights by importance. We define <> + as a function which selects the v% highest values in + + <> (1) + + Magnitude-based weight pruning determines the mask based on the absolute value of each weight <> + as a measure of importance. Formally, we have importance scores <>, and masks <>. + There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude + pruning: the model is first trained until convergence and weights with the lowest + magnitudes are removed afterward. The sparsified model is then re-trained with the removed weights + fixed to 0. This loop is repeated until the desired sparsity level is reached. + + <> + + Table 1: Summary of the pruning methods considered in this work and their specificities. The + expression of <> regularization is detailed in Eq (3). + + + In this study, we focus on automated gradual pruning[Zhu and Gupta, 2018]. It supplements + magnitude pruning by allowing masked weights to be updated such that they are not fixed for the + entire duration of the training. Automated gradual pruning enables the model to recover from previous + masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity level <> + during training using a cubic sparsity scheduler: <>. + The sparsity <> level at time step <> is increased from an initial value vi (usually 0) + to a final value vf in n pruning steps after ti steps of warm-up. The model is thus pruned and trained jointly. + + 4 Movement Pruning + + Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running + model. In this work, we focus on movement pruning methods where importance is derived from + first-order information. Intuitively, instead of selecting weights that are far from zero, we retain + connections that are moving away from zero during the training process. We consider two versions of + movement pruning: hard and soft. + For (hard) movement pruning, masks are computed using the Top v function: <>. Unlike v magnitude + pruning, during training, we learn both the weights <> and their importance scores S. + During the forward pass, we compute for all <>. + Since the gradient of Top v is 0 everywhere it is defined, we follow Ramanujan et al. [2020], Mallya + and Lazebnik [2018] and approximate its value with the straight-through estimator [Bengio et al., + 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The + approximation of gradient of the loss L with respect to <> is given by + + <> (2) + + This implies that the scores of weights are updated, even if these weights are masked in the forward + pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge. + We also consider a relaxed (soft) version of movement pruning based on the binary mask function + described by Mallya and Lazebnik [2018]. Here we replace hyper parameter v with a fixed global + threshold value <> that controls the binary mask. The mask is calculated as <>. In order to + control the sparsity level, we add a regularization term <> which encourages + the importance scores to decrease over time 1 . The coefficient <> controls the penalty intensity and + thus the sparsity level. + Finally we note that these approaches yield a similar updateL0 regularization based pruning, another + movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the + hard-concrete distribution, where the maskMis sampled for all <> with hyperparameters <>, + <>, and <>: + + <> + + The expected <> norm has a closed form involving the parameters of the hard-concrete: E(L0 ) = + <>. Thus, the weights and scores of the model can be optimized in <> We also + experimented with <> but it turned out to be harder to tune while giving similar results. + + <> + + Figure 1: During fine-tuning (on MNLI), the weights stay close to their pre-trained values which + limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are + plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning + selects weights that are moving away from 0. + + + an end-to-end fashion to minimize the sum of the training loss L and the expected L0 penalty. A + coefficient l0 controls the L0 penalty and indirectly the sparsity level. Gradients take a similar form: + + <> (3) + + At test time, a non-stochastic estimation of the mask is used: <> + and weights multiplied by 0 can simply be discarded. + + <
> + + Table 1 highlights the characteristics of each pruning method. The main differences are in the masking + functions, pruning structure, and the final gradient form. + + Method Interpretation In movement pruning, the gradient of L with respect to <> is given + by the standard gradient derivation: <>. By combining it to Eq(2), we + have <> (we omit the binary mask term <> for simplicity). From the gradient update in <> + Eq (2), is increasing when <>, which happens in two cases: + + <> (a) + <> (b) + + It means that during training <> is increasing while being positive or is decreasing while being + negative. It is equivalent to saying thatSi;j is increasing when <> is moving away from 0. Inversely, + <> is decreasing when @L >0which means that <> is shrinking towards 0. + While magnitude pruning selects the most important weights as the ones which maximize their + distance to 0 (<>), movement pruning selects the weights which are moving the most away from + 0 (<>). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement + pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from + equation (2), after T gradient updates, we have + + <> (4) + + Figure 1 shows this difference empirically by comparing weight values during fine-tuning against + their pre-trained value. As observed by Gordon et al. [2020], fine-tuned weights stay close in absolute + value to their initial pre-trained values. For magnitude pruning, this stability around the pre-trained + + <
> + + values implies that we know with high confidence before even fine-tuning which weights will be + pruned as the weights with the smallest absolute value at pre-training will likely stay small and be + pruned. In contrast, in movement pruning, the pre-trained weights do not have such an awareness of + the pruning decision since the selection is made during fine-tuning (moving away from 0), and both + low and high values can be pruned. We posit that this is critical for the success of the approach as it + is able to prune based on the task-specific data, not only the pre-trained value. + + 5 Experimental Setup + + Transfer learning for NLP uses large pre-trained language models that are fine-tuned on target tasks + [Ruder et al., 2019, Devlin et al., 2019, Radford et al., 2019, Liu et al., 2019]. We experiment with task- + specific pruning ofBERT-base-uncased, a pre-trained model that contains roughly 84M parameters. + We freeze the embedding modules and fine-tune the transformer layers and the task-specific head. + We perform experiments on three monolingual (English) tasks, which are common benchmarks for + the recent progress in transfer learning for NLP: question answering (SQuAD v1.1) [Rajpurkar et al., + 2016], natural language inference (MNLI) [Williams et al., 2018], and sentence similarity (QQP) + [Iyer et al., 2017]. The datasets respectively contain 8K, 393K, and 364K training examples. SQuAD + is formulated as a span extraction task, MNLI and QQP are paired sentence classification tasks. + For a given task, we fine-tune the pre-trained model for the same number of updates (between 6 + and 10 epochs) across pruning methods 2 . We follow Zhu and Gupta [2018] and use a cubic sparsity + scheduling for Magnitude Pruning (MaP), Movement Pruning (MvP), and Soft Movement Pruning + (SMvP). Adding a few steps of cool-down at the end of pruning empirically improves the performance + especially in high sparsity regimes. The schedule for v is: + + <> (5) + + where tf is the number of cool-down steps. + We compare our results against several state-of-the-art pruning baselines: Re-weighted Proximal + Pruning (RPP) [Guo et al., 2019] combines re-weightedL1 minimization and Proximal Projection + [Parikh and Boyd, 2014] to perform unstructured pruning. LayerDrop [Fan et al., 2020a] leverages + structured dropout to prune models at test time. For RPP and LayerDrop, we report results from + authors. We also compare our method against the mini-BERT models, a collection of smaller BERT + models with varying hyper-parameters [Turc et al., 2019]. + + 6 Results + + Figure 2 displays the results for the main pruning methods at different levels of pruning on each + dataset. First, we observe the consistency of the comparison between magnitude and movement + pruning: at low sparsity (more than 70% of remaining weights), magnitude pruning outperforms + all methods with little or no loss with respect to the dense model whereas the performance of + movement pruning methods quickly decreases even for low sparsity levels. However, magnitude + pruning performs poorly with high sparsity, and the performance drops extremely quickly. In contrast, + first-order methods show strong performances with less than 15% of remaining weights. + Table 2 shows the specific model scores for different methods at high sparsity levels. Magnitude + pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 withL0 regularization, + 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning. These experiments + indicate that in high sparsity regimes, importance scores derived from the movement accumulated + during fine-tuning induce significantly better pruned models compared to absolute values. + Next, we compare the difference in performance between first-order methods. We see that straight- + through based hard movement pruning (MvP) is comparable withL0 regularization (with a significant + gap in favor of movement pruning on QQP). Soft movement pruning (SMvP) consistently outperforms + 2 Preliminary experiments showed that increasing the number of pruning steps tended to improve the end + performance + + Figure 2: Comparisons between different pruning methods in high sparsity regimes.Soft movement + pruning consistently outperforms other methods in high sparsity regimes.We plot the + performance of the standard fine-tuned BERT along with 95% of its performance. + + <
> + + Table 2: Performance at high sparsity levels. (Soft) movement pruning outperforms current + state-of-the art pruning methods at different high sparsity levels. + + <
> + + hard movement pruning andL0 regularization by a strong margin and yields the strongest performance + among all pruning methods in high sparsity regimes. These comparisons support the fact that even if + movement pruning (and its relaxed version soft movement pruning) is simpler thanL0 regularization, + it yet yields stronger performances for the same compute budget. + Finally, movement pruning and soft movement pruning compare favorably to the other baselines, + except for QQP where RPP is on par with soft movement pruning. Movement pruning also outperforms + the fine-tuned mini-BERT models. This is coherent with [Li et al., 2020]: it is both more efficient and + more effective to train a large model and compress it afterward than training a smaller model from + scratch. We do note though that current hardware does not support optimized inference for sparse + models: from an inference speed perspective, it might often desirable to use a small dense model + such as mini-BERT over a sparse alternative of the same size. + + Distillation further boosts performance Following previous work, we can further leverage knowledge + distillation [Bucila et al., 2006, Hinton et al., 2014] to boost performance for free in the pruned + domain [Jiao et al., 2019, Sanh et al., 2019] using our baseline fine-tuned BERT-base model as + teacher. The training objective is a linear combination of the training loss and a knowledge distillation + + + Figure 3: Comparisons between different pruning methods augmented with distillation. Distillation + improves the performance across all pruning methods and sparsity levels. + + <
> + + Table 3: Distillation-augmented performances for selected high sparsity levels.All pruning methods + benefit from distillation signal further enhancing the ratio Performance VS Model Size. + + <
> + + Figure 4: Magnitude pruning and movement pruning leads to pruned models with radically different + weight distribution. + + <
> + + loss on the output distributions. Figure 3 shows the results on SQuAD, MNLI, and QQP for the three + pruning methods boosted with distillation. Overall, we observe that the relative comparisons of the + pruning methods remain unchanged while the performances are strictly increased. Table 3 shows for + instance that on SQuAD, movement pruning at 10% goes from 81.7 F1 to 84.3 F1. When combined + with distillation, soft movement pruning yields the strongest performances across all pruning methods + and studied datasets: it reaches 95% of BERT-base with only a fraction of the weights in the encoder + (5% on SQuAD and MNLI). + + 7 Analysis + + Movement pruning is adaptive Figure 4a compares the distribution of the remaining weights for + the same matrix of a model pruned at the same sparsity using magnitude and movement pruning. We + observe that by definition, magnitude pruning removes all the weights that are close to zero, ending + up with two clusters. In contrast, movement pruning leads to a smoother distribution, which covers + the whole interval except for values close to 0. + Figure 4b displays each individual weight against its associated importance score in movement + pruning. We plot pruned weights in grey. We observe that movement pruning induces no simple + relationship between the scores and the weights. Both weights with high absolute value or low + absolute value can be considered important. However, high scores are systematically associated with + non-zero weights (and thus the “v-shape”). This is coherent with the interpretation we gave to the + scores (section 4): a high score S indicates that during fine-tuning, the associated weight moved away + from 0 and is thus non-null. + + Local and global masks perform similarly We study the influence of the locality of the pruning + decision. While local Top v selects the v% most important weights matrix by matrix, global Top v + uncovers non-uniform sparsity patterns in the network by selecting the v% most important weights in + + Figure 5: Comparison of local and global selections of weights on SQuAD at different sparsity heavily + pruning the highest layers. + + <
> + + Figure 6:Remaining weights per layer in the Transformer. Global magnitude pruning tends to levels. + For magnitude and movement pruning, prune uniformly layers. Global 1st order meth-local and global + Top v performs similarly at all ods allocate the weight to the lower layers while levels of sparsity. + + <
> + + the whole network. Previous work has shown that a non-uniform sparsity across layers is crucial to + the performance in high sparsity regimes [He et al., 2018]. In particular, Mallya and Lazebnik [2018] + found that the sparsity tends to increase with the depth of the network layer. + Figure 5 compares the performance of local selection (matrix by matrix) against global selection + (all the matrices) for magnitude pruning and movement pruning. Despite being able to find a + global sparsity structure, we found that global did not significantly outperform local, except in high + sparsity regimes (2.3 F1 points of difference with 3% of remaining weights for movement pruning). + Even though the distillation signal boosts the performance of pruned models, the end performance + difference between local and global selections remains marginal. + Figure 6 shows the remaining weights percentage obtained per layer when the model is pruned until + 10% with global pruning methods. Global weight pruning is able to allocate sparsity non-uniformly + through the network, and it has been shown to be crucial for the performance in high sparsity regimes + [He et al., 2018]. We notice that except for global magnitude pruning, all the global pruning methods + tend to allocate a significant part of the weights to the lowest layers while heavily pruning in the + highest layers. Global magnitude pruning tends to prune similarly to local magnitude pruning, i.e., + uniformly across layers. + + 8 Conclusion + + We consider the case of pruning of pretrained models for task-specific fine-tuning and compare + zeroth- and first-order pruning methods. We show that a simple method for weight pruning based on + straight-through gradients is effective for this task and that it adapts using a first-order importance + score. We apply this movement pruning to a transformer-based architecture and empirically show that + our method consistently yields strong improvements over existing methods in high-sparsity regimes. + The analysis demonstrates how this approach adapts to the fine-tuning regime in a way that magnitude + pruning cannot. In future work, it would also be interesting to leverage group-sparsity inducing + penalties [Bach et al., 2011] to remove entire columns or filters. In this setup, we would associate a + score to a group of weights (a column or a row for instance). In the transformer architecture, it would + give a systematic way to perform feature selection and remove entire columns of the embedding + matrix. + + + References + Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi + Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text + transformer.ArXiv, abs/1910.10683, 2019. + Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep + learning in nlp. InACL, 2019. + Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for + efficient neural network. InNIPS, 2015. + Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network + with pruning, trained quantization and huffman coding. InICLR, 2016. + Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. InNIPS, + 2016. + Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.ArXiv, + abs/1902.09574, 2019. + Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket + hypothesis at scale.ArXiv, abs/1903.01611, 2019. + Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients + through stochastic neurons for conditional computation.ArXiv, abs/1308.3432, 2013. + Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep + bidirectional transformers for language understanding. InNAACL, 2019. + Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz + Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017. + Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through + l0 regularization. InICLR, 2017. + Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for + sentence understanding through inference. InNAACL, 2018. + Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for + machine comprehension of text. InEMNLP, 2016. + Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, fixed network by + learning to mask.ArXiv, abs/1801.06519, 2018. + Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. + What’s hidden in a randomly weighted neural network? InCVPR, 2020. + Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNIPS, 1989. + Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and + performance comparisons. InNIPS, 1993. + Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with + dense networks and fisher pruning.ArXiv, abs/1801.05787, 2018. + Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Ji Liu, and Jungong Han. Global sparse + momentum sgd for pruning very deep neural networks. InNeurIPS, 2019. + Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of + bert: smaller, faster, cheaper and lighter. InNeurIPS EMC2 Workshop, 2019. + Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling + task-specific knowledge from bert into simple neural networks.ArXiv, abs/1903.12136, 2019. + Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with + structured dropout. InICLR, 2020a. + Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InNeurIPS, + 2019. + Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and + multiple languages: lottery tickets in rl and nlp. InICLR, 2020. + Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, + and Armand Joulin. Training with quantization noise for extreme model compression.ArXiv, + abs/2004.07320, 2020b. + Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert.ArXiv, + abs/1910.06188, 2019. + Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional + networks using vector quantization.ArXiv, abs/1412.6115, 2014. + Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gon- + zalez. Train large, then compress: Rethinking model size for efficient training and inference of + transformers.ArXiv, abs/2002.11794, 2020. + Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model + compression. InICLR, 2018. + Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of + weight pruning on transfer learning.ArXiv, abs/2002.08307, 2020. + Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in + natural language processing. InNAACL, 2019. + Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language + models are unsupervised multitask learners. 2019. + Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike + Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining + approach.ArXiv, abs/1907.11692, 2019. + Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017. + URLhttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs. + Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lian Lin, and Yanzhi Wang. Reweighted proximal + pruning for large-scale language representation.ArXiv, abs/1909.12486, 2019. + Neal Parikh and Stephen P. Boyd. Proximal algorithms.Found. Trends Optim., 1:127–239, 2014. + Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: + The impact of student initialization on knowledge distillation.ArXiv, abs/1908.08962, 2019. + Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InKDD, 2006. + Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. + InNIPS, 2014. + Xiaoqi Jiao, Y. Yin, Lifeng Shang, Xin Jiang, Xusong Chen, Linlin Li, Fang Wang, and Qun Liu. + Tinybert: Distilling bert for natural language understanding.ArXiv, abs/1909.10351, 2019. + Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model + compression and acceleration on mobile devices. InECCV, 2018. + Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity + through convex optimization.Statistical Science, 27, 09 2011. doi: 10.1214/12-STS394. + + A Appendices + + A.1 Guarantees on the decrease of the training loss + + As the scores are updated, the relative order of the importances is likely shuffled, and some connections + will be replaced by more important ones. Under certain conditions, we are able to formally prove that + as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from + [Ramanujan et al., 2020] to consider the case of fine-tuableW. + We suppose that (a) the training lossLis smooth and admits a first-order Taylor development + everywhere it is defined and (b) the learning rate of <> is small. We define the TopK + function as the analog of the Top v function, wherekis an integer instead of a proportion. We first + consider the case where k=1 in the Top K masking, meaning that only one connection is remaining + (and the other weights are deactivated/masked). Let’s denote <> this sole remaining connection at + stept. Following Eq (1), it means that <> + We suppose that at stept+ 1, connections are swapped and the only remaining connection at step + <>. We have: + + <> + + Eq(6)gives the following inequality: <>. After re-injecting the gradient <> update in Eq (2), we have: + + <> (7) + + Moreover, the conditions in Eq (6) lead to the following inferences: <> and <>. + + Since <> is small, <> is also small. Because the training loss L is + smooth, we can write the 1st order Taylor development of L in point <> + + + <> (8) + + The first term is null because of inequalities(6)and the second term is negative because of inequality + (7). Thus <> when connection <> becomes more important than + <>, the connections are swapped and the training loss decreases between step k + t and <>. Similarly, we can generalize the proof to a set <> of N swapping connections. + We note that this proof is not specific to theTopKmasking function. In fact, we can extend the proof + using theThresholdmasking function <> [Mallya and Lazebnik, 2018]. Inequalities + (6) are still valid and the proof stays unchanged. + + Last, we note these guarantees do not hold if we consider the absolute value of the scoresjSi;j j(as + it is done in Ding et al. [2019] for instance). We prove it by contradiction. If it was the case, it + would also be true one specific case: thenegative thresholdmasking function <> where + <>. + We suppose that at stept+ 1, the only remaining connection(i;j)is replaced by(k;l): + + <> (9) + + The inequality on the gradient update becomes:<> and following i + the same development as in Eq(8), we have <> the loss increases. + <> We proved by contradiction that the guarantees on the decrease of the loss do not hold if we consider k + the absolute value of the score as a proxy for importance. +<> <> <> + + + <> <> <> + Network Pruning + + As one of the earliest works in network pruning, Yann Lecun's Optimal brain + damage (OBD) paper has been cited in many of the papers. + Some research focuses on module network designs. "These models, such as + SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions + convolution with lesser parameters and better performance." + Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a + compression and regularization method, as opposed to other techniques such as + non-structured pruning (weight sparsifying and connection pruning), low rank + approximation and vector quantization (references to these approaches can be + found in the related work sections of the following papers).  + Difference between structred and non-structured pruning: + "Non-structured pruning aims to remove single parameters that have little + influence on the accuracy of networks". For example, L1-norm regularization on + weights is noted as non-structured pruning- since it's basically a weight + sparsifying method, i.e removes single parameter. + The term 'structure' refers to a structured unit in the network. So instead of + pruning individual weights or connections, structured pruning targets neurons, + filters, channels, layers etc. But the general implementation idea is the same as + penalizing individual weights: introducing a regularization term (mostly in the + form of L1-norm) to the loss function to penalize (sparsify) structures. + I focused on structured pruning and read through the following papers: + + 1. Structured Pruning of Convolutional Neural Networks via L1 + Regularization (August 2019) + "(...) network pruning is useful to remove redundant parameters, filters, + channels or neurons, and address the over-fitting issue." + + Provides a good review of previous work on non-structured and structured + pruning. + "This study presents a scheme to prune filters or neurons of fully-connected + layers based on L1 regularization to zero out the weights of some filters or + neurons." + Didn't quite understand the method and implementation. There are two key + elements: mask and threshold. "(...) the problem of zeroing out the values of + some filters can be transformed to zero some mask." || "Though the proposed + method introduces mask, the network topology will be preserved because the + mask can be absorbed into weight." || "Here the mask value cannot be + completely zeroed in practical application, because the objective function (7) is + non-convex and the global optimal solution may not be obtained. A strategy is + adopted in the proposed method to solve this problem. If the order of + magnitude of the mask value is small enough, it can be considered almost as + zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...) + The average value of the product of the mask and the weight is used to + determine whether the mask is exactly zero or not." + From what I understand they use L1 norm in the loss function to penalize + useless filters through penalizing masks. And a threshold value is introduced + to determine when the mask is small enough to be considered zero.  + They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet- + 32) + + 2. Learning Efficient Convolutional Networks through Network Slimming (August + 2017) + Git repo + "Our approach imposes L1 regularization on the scaling factors in batch + normalization (BN) layers, thus it is easy to implement without introducing any + change to existing CNN architectures. Pushing the values of BN scaling factors + towards zero with L1 regularization enables us to identify insignificant channels + (or neurons), as each scaling factor corresponds to a specific convolutional + channel (or a neuron in a fully-connected layer)." + They provide a good insight on advantages and disadvantages of other + computation reduction methods such as low rank approximation, vector + quantization etc.  + I believe here they use the word 'channel' to refer to filters (?). + "Our idea is introducing a scaling factor γ for each channel, which is multiplied + to the output of that channel. Then we jointly train the network weights and + these scaling factors, with sparsity regularization imposed on the latter. Finally + + we prune those channels with small factors, and fine-tune the pruned network. + " --> so instead of 'mask' they use the 'scaling factor' and impose regularization + on that, but the idea is very similar. + "The way BN normalizes the activations motivates us to design a simple and + efficient method to incorporates the channel-wise scaling factors. Particularly, + BN layer normalizes the internal activations using mini-batch statistics." || " + (...) we can directly leverage the γ parameters in BN layers as the scaling factors + we need for network slim- ming. It has the great advantage of introducing no + overhead to the network." They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40), + ImageNet (model: VGG-A) and MNIST (model: Lenet) + + 3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo + " (...) we propose Structured Sparsity Learning (SSL) method to directly learn a + compressed structure of deep CNNs by group Lasso regularization during the + training. SSL is a generic regularization to adaptively adjust multiple structures + in DNN, including structures of filters, channels, and filter shapes within each + layer, and structure of depth beyond the layers." || " (...) offering not only well- + regularized big models with improved accuracy but greatly accelerated + computation." + +  "Here W represents the collection of all weights in the DNN; ED(W) is the loss + on data; R(·) is non-structured regularization applying on every weight, e.g., L2- + norm; and Rg(·) is the structured sparsity regularization on each layer. Because + Group Lasso can effectively zero out all weights in some groups [14][15], we + adopt it in our SSL. The regularization of group Lasso on a set of weights w can + be represented as, where w(g) is a group of partial weights in w and G is the total number of + groups. "<>" In SSL, the learned “structure” is decided by the way of splitting + groups of w(g). We investigate and formulate the filer-wise, channel-wise, + shape-wise, and depth-wise structured sparsity (...)" + They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet- + 20) and ImageNet (model:AlexNet) + The authors also provide a visualization of filters after pruning, showing that + only important detectors of patterns remain after pruning. + + In conclusions: "Moreover, a variant of SSL can be performed as structure + regularization to improve classification accuracy of state-of-the-art DNNs." + + 4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015) + "After an initial training phase, we remove all connections whose weight is + lower than a threshold. This pruning converts a dense, fully-connected layer to + a sparse layer." || "We then retrain the sparse network so the remaining + connections can compensate for the connections that have been removed. The + phases of pruning and retraining may be repeated iteratively to further reduce + network complexity. In effect, this training process learns the network + connectivity in addition to the weights (...)" + Although the description above implies the pruning was done only for FC + layers, they also do pruning on convolutional layers - although they don't + provide much detail on this in the methods. But there's this statement when + they explain retraining: "(...) we fix the parameters for CONV layers and only + retrain the FC layers after pruning the FC layers, and vice versa.". The results + section also shows that convolutional layer connections were also + pruned on the tested models. + They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and + ImageNet (models: AlexNet, VGG-16) + The authors provide a visualization of the sparsity patterns of neurons after + pruning (for an FC layer) which shows that pruning can detect visual attention + regions. + The method used in this paper targets individual parameters (weights) to + prune. So, technically this should be considered as a non-structured pruning + method. However, the reason I think this is referenced as a structured pruning + method is that if all connections of a neuron is pruned (i.e all input and output + weights were below threshold), the neuron itself will be removed from the + network:  "After pruning connections, neurons with zero input connections or + zero output connections may be safely pruned." + SIDENOTE: They touch on the use of global average pooling instead of fully + connected layers in CNNs: "There have been other attempts to reduce the + number of parameters of neural networks by replacing the fully connected + layer with global average pooling." + + 5. Many more can be picked from the references of these papers. + + There's a paper on Bayesian compression for Deep Learning from 2017. Their + hypothesis is: "By employing sparsity inducing priors for hidden units (and not + individual weights) we can prune neurons including all their ingoing and outgoing + weights." However, the method is mathematically heavy and the related work + references are quite old (1990s, 2000s). +<> <> <> + + +<> <> <> + Network Trimming: A Data-Driven Neuron Pruning + Approach towards Efficient Deep Architectures + + Hengyuan Hu % Rui Peng % Yu-Wing Tai Chi-Keung Tang + HKUST HKUST SenseTime Group Limited HKUST + hhuaa@ust.hk rpeng@ust.hk yuwing@sensetime.com cktang@cse.ust.hk + + + Abstract + + State-of-the-art neural networks are getting deeper and wider. While their performance + increases with the increasing number of layers and neurons, it is crucial to + design an efficient deep architecture in order to reduce computational and memory + costs. Designing an efficient neural network, however, is labor intensive requiring + many experiments, and fine-tunings. In this paper, we introduce network trimming + which iteratively optimizes the network by pruning unimportant neurons based on + analysis of their outputs on a large dataset. Our algorithm is inspired by an observation + that the outputs of a significant portion of neurons in a large network are + mostly zero, regardless of what inputs the network received. These zero activation + neurons are redundant, and can be removed without affecting the overall accuracy + of the network. After pruning the zero activation neurons, we retrain the network + using the weights before pruning as initialization. We alternate the pruning and + retraining to further reduce zero activations in a network. Our experiments on the + LeNet and VGG-16 show that we can achieve high compression ratio of parameters + without losing or even achieving higher accuracy than the original network. + + + 1 Introduction + + Neural networks have been widely adopted in many scenarios, achieving state-of-the-art results in + numerous tasks [1] [2]. One of the keys to improved performance is their increased depth and width + and thus the increased number of parameters. In computer vision, we have witnessed orders of + magnitude increase in the number of parameters in CNNs from LeNet with less than 1M parameters + in handwritten digit classification [3] to Deepface with more than 120M parameters in human face + classification [4]. + Although CNNs with elegant network architectures are easy to deploy in real-world tasks, designing + one can be hard and labor-intensive, which involves significant amount of effort in empirical experiments. + In terms of designing the network architecture, one crucial part is to determine the number of + neurons in each layer. There is no way to directly arrive at an optimal number of neurons for each + layer and thus even the most successful network architectures use empirical numbers like 128, 512, + 4096. Experienced scientists often arrive at the numbers once they deem the network have enough + representation power for the specific task. However, the extremely sparse matrices produced by top + layers of neural networks have caught our attention, indicating that empirically designed networks are + heavily oversized. After some simple statistics, we find that many neurons in a CNN have very low + activations no matter what data is presented. Such weak neurons are highly likely to be redundant + and can be excluded without damaging the overall performance. Their existence can only increase + the chance of overfitting and optimization difficulty, both of which are harmful to the network. + With the motivation of achieving more efficient network architectures by finding the optimal number + of neurons in each layer, we come up with an iterative optimization method that gradually eliminates + Part of the work was done when Hengyuan Hu and Rui Peng were interns in SenseTime Group Limited + weak neurons in a network via a pruning-retraining loop. Starting from an empirically designed + network, our algorithm first identifies redundant weak neurons by analyzing their activations on a + large validation dataset. Then those weak neurons are pruned while others are kept to initialize a new + model. Finally, the new model is retrained or fine-tuned depending on the performance drop. The + retrained new model can maintain the same or achieve higher performance with smaller number of + neurons. This process can be carried out iteratively until a satisfying model is produced. + + 2 Related Work + + Significant redundancy has been demonstrated in several deep learning models [5] and such + redundancy is mainly caused by the overwhelming amount of parameters in deep neural networks. An + over-parameterized model not only wastes memory and computation, but also leads to serious overfitting + problem. Therefore, reducing the number of parameters has been studied by many researchers + in this field. However, there is little work directly addressing the optimization of the number of + neurons. Most previous works on improving network architectures fall in two main categories; one + concentrates on the high level architectural design and the other focuses on low level weight pruning. + On the high level side, some researchers invented new layers or modules to substitute main bottleneck + components in conventional neural networks. Two famous examples of this kind are the global + average pooling in Network in Network [6] invented to replace the extremely dense parameterized + fully connected layer and the inception module employed by GoogLeNet [7] to avoid explosion in + computational complexity at later stage. Both methods achieve state-of-the-art results on several + benchmarks with much less memory and computation consumption. More recently, SqueezeNet [8] + used a Fire module together with other strategies to achieve AlexNet-level accuracy with 50% less + parameters. + On the low level side, different methods have been explored to reduce number of connections and + weights in neural networks. Some late 20th century methods, such as magnitude-based approach [9] + and Hessian matrix based approach [10], prune weights basing on numerical properties of the weights + and loss functions without any external data involved. Han et al. recently proposed an iterative + method [11] to prune connections in deep architectures, together with an external compression by + quantization and encoding [12]. The network is first pruned by removing low weights connections. + Then, learned mapping of similar weights to fixed bits are used to perform quantization of weights + after pruning, which facilitates the Huffman coding compression in the last stage to reduce bits for + storage. When all three techniques used in pipeline, the number of parameters in the network can be + reduced by around 10%. + While above methods work well in practice by reducing number of parameters directly, we seek + answers to the fundamental problem that lies in the middle of those two levels of approaches – + determining the optimal number of neurons for each layer for a given network architecture and + specific tasks. Along our direction, not only can we achieve parameter savings without the need of + seeking new network architectures, we can also evaluate the redundancy in each layer of a network, + and thus provide guidance on effective ways for architecture optimization in large neural networks. + + 3 Zero Activations and Network Trimming + + In this section, we describe our algorithm for network trimming. To facilitate our discussions, we + use VGG-16 [13] as our case study. The VGG-16 network consists of 13 convolutional layers, and 3 + fully connected layers. Each of the layers is followed by a ReLU [14] layer for non-linear mapping. + The VGG-16 is recognized as one of the representative network which has been adopted to many + applications [15] [16], not limited to object classification and localization tasks. + + + 3.1 Zero Activations in VGG-16 + + We define Average Percentage of Zeros (APoZ) to measure the percentage of zero activations of + a neuron after the ReLU mapping. Let <> denotes the output ofc-th channel ini-th layer, our + <> of the c-th neuron ini-th layer is defined as: + + <> (1) + + where <> if true, and <> if false,M denotes the dimension of output feature map + of <>, and N denotes the total number of validation examples. The larger number of validation + examples, the more accurate is the measurement of APoZ. In our experiment, we use the validation + set (N= 50;000) of ImageNet classification task to measure APoZ. + We use the definition of APoZ to evaluate the importance of each neuron in a network. To validate + our observation that the outputs of some neurons in a large network are mostly zero, we calculate the + APoZ of each neuron and find that there are631neurons in the VGG-16 network which have APoZ + larger than90%. + + Table 1: Mean APoZ of each layer in VGG-16 + + <
> + + To better understand the behavior of zero activations in a network, we compute the mean APoZ + (Table 1) of all neurons in each layer (except for the last one) of the VGG-16 network. Since the + VGG-16 network has inverse pyramid shape, most redundancy occurs at the higher convolutional + layers and the fully connected layers. The higher mean APoZ also indicates more redundancy in a + layer. Detailed distributions of APoZ of 512 CONV5-3 neurons and 4096 FC6 neurons are shown in + Figure 1, 2 respectively. Since a neural network has a multiplication-addition-activation computation + process, a neuron which has its outputs mostly zeros will have very little contribution to the output of + subsequent layers, as well as to the final results. Thus, we can remove those neurons without harming + too much to the overall accuracy of the network. In this way, we can find the optimal number of + neurons for each layer and thus obtain a better network without redesign and extensive human labor. + + <
> . <
> + + Figure 1: CONV5-3 APoZ Distribution Figure 2: FC6 APoZ Distribution + + 3.2 Network Trimming and Retraining + + Our network trimming method consists of three main steps, as illustrated in Figure 3. First the network + is trained under conventional process and the number of neurons in each layer is set empirically. Next, + we run the network on a large validation dataset to obtain the APoZ of each neuron. + Neurons with high APoZ are pruned according to certain criteria. The connections to and from the + neuron are removed accordingly when a neuron is pruned (see Figure 4 5). After the neuron pruning, + the trimmed network is initialized using the weights before trimming. The trimmed network exhibits + + <
> <
> <
> + + Figure 3: Three main steps for Figure 4: Before pruning Figure 5: After pruning + trimming + + some level of performance drop. Thus, in the final step, we retrain the network to strengthen the + remaining neurons to enhance the performance of the trimmed network. + The weight initialization is necessary for the network to obtain the same performance as it was before + the trimming. If a trimmed network is trained from scratch, we find that it contains larger percentage + of zero activation neurons than the counterpart with weight initialization. This means that a retrained + network without weight initialization is much less efficient. + We experimented different ways to prune the neurons according to the APoZ measurements. We + found that pruning too many neurons at once severely damaged the performance, and the performance + drops are unrecoverable. Therefore, we chose an iterative scheme to trim a network. However, it is + not trivial to trim a network with deep architecture. If too many layers are trimmed in one step, the + performance would drop by a large margin, and it is hard to recover the original performance before + trimming through the retraining. For example, trimming CONV4, CONV5, FC6 and FC7 of the + VGG-16 network concurrently would lead to a46:650%top-5 accuracy in the image classification + task, where the original accuracy of VGG-16 2 is88:444%. On the other hand, if only the CONV5-3 + and FC6 are trimmed, the trimmed network with weight initialization before retraining can achieve + 85:900%top-5 accuracy. After retraining, the trimmed network achieves90:278%accuracy which is + even higher than the original accuracy before trimming. + Empirically, we found that starting to trim from a few layers with high mean APoZ, and then + progressively trim its neighboring layers can rapidly reduce the number of neurons while maintaining + the performance of the original network. To decide which neurons to prune, we empirically found + that pruning the neurons whose APoZ is larger than one standard derivation from the mean APoZ + of the target trimming layer would produce good retraining results. Using this threshold, we would + reject16%of neurons on average from the trimmed layers, assuming that the APoZ values roughly + follow a Gaussian distribution. + + 4 Experiments + + We implemented our algorithm using the standard Caffe [17] library. To obtain the weights for + initialization for retraining, we use the Python and PyCaffe interface to copy the weights of remaining + connections after the trimming. We tested our algorithm primarily on two networks, LeNet [3] on + MNIST dataset and VGG-16 on ImageNet classification dataset [18]. + + 4.1 LeNet + + The LeNet network consists of two convolutional layers followed by two fully connected layers, the + layers have20;50;500;10outputs respectively. We use a short hand notion (20-50-500-10) to denote + the number of neurons in each layer of the network. In the LeNet,93%of parameters are in the + connections between the CONV2 layer and the FC1 layer. Consequently, we can easily achieve a + more efficient network by trimming the size of CONV2 and FC1 layers. + + 2 Single scale, without dense evaluation [13] + + 4.1.1 Effectiveness + We apply our algorithm to iteratively prune the neurons in CONV2 and FC1 layers, as shown in + Table 2. At the first iteration of the pruning, the numbers of neurons in CONV2 and FC1 layers are + reduced to 41 and 426 respectively, which achieves1:41%compression on the number of parameters + after the first pruning. The accuracy drops from99:27%to98:75%after the pruning, but before + retraining. After retraining the network, we achieve99:29%accuracy which is slightly higher than + the original accuracy. We repeat these processes for 4 iterations. As shown in Table 2, our algorithm + achieves more than 2%3%compression on the number of parameters without loss in accuracy. + + Table 2: Iterative Trimming on LeNet + + <
> + + 4.1.2 Necessity of Weight Initialization + We experiment our algorithm with retraining with and without weight initialization, as summarized + in Table 3. The network exhibits deterioration in classification accuracy without weight initialization, + whereas with proper weight initialization from the ancestor network from the previous iteration, the + trimmed network can retain its original or even achieve higher accuracy. + + Table 3: Iterative Trimming on LeNet with and without Weight Initialization + + <
> + + Moreover, we observe that with the weight initialization, the trimmed network consistently has + smaller mean APoZ values than its ancestor network. This means that the retrained network has + less redundancy than its ancestor network. In contrast, mean APoZ values increase if we retrain the + network from scratch even though the trimmed network has less neurons than its ancestor network. + This observation gives us an insight that proper weight initialization is necessary to achieve an + efficient trimmed network. + + 4.2 VGG-16 + + 4.2.1 Effectiveness + With the similar objective to obtain optimal number of neurons in each layer, we analyzed the APoZ + values ofO(i)c for all i and c_in VGG-16 on ImageNet classification validation set. As shown in + Table 1, CONV4, CONV5 and FC layers have higher mean APoZ compared with bottom layers, + exhibiting more redundancy. Drawing from previous experience on LeNet, we focus on the parameter + bottleneck of VGG-16. We trim the VGG-16 network starting from the CONV5-3 and FC6 layers + since they account for 100M/138M parameters. + We iteratively prune neurons from CONV5-3 and FC6 layers. Similar to the case in LeNet, the + trimming process can effectively eliminate neurons with high APoZ. As shown in Figure 6, after + trimming, the entire distribution of APoZ inO(fc6) shifts left, indicating a significant drop in network + + <
> + + Figure 6: FC6 APoZ distribution before and after trimming + + + redundancy. Meanwhile, the diminishing tail on the right side of the curve manifests that the weak + neurons in FC6 are vanishing, a proof of the benefit gained from weight initialization as discussed in + Section 3.2 and 4.1.2. + Table 4: Iterative Trimming Result on VGG-16 {CONV5-3, FC6} + + <
> + + After 6 iterations of trimming, we reduce more than half of the total number of parameters and achieve + a compression rate of2:59%while the trimmed network has 2%3% higher Top-1/Top-5 accuracy + than the original VGG-16 model. The detailed performance of intermediate models are summarized + in Table 4. There are two interesting observations in the table. First, the initial accuracy just after + trimming does not drop much from the last model even though around 500 neurons in CONV5-3 and + FC6 are pruned in each iteration. This is a strong proof of redundancy in empirically designed neural + networks. Also, such a small decrease in accuracy can be remedied via a fast fine-tuning instead + of a week-long retraining. In our experiments, it takes less than 5K iterations to reach the original + accuracy (with batch size = 256). Therefore, our trimming method allows fast optimization towards + better architecture. Secondly, the trimmed networks surprisingly surpass the original VGG-16 in + accuracy with less parameters. The good initialization provided by previous model sets a promising + starting point for the trimmed model. In addition, having less parameters in FC6 also reduces the + chance of overfitting, which may also contribute to the increment in accuracy. + + 4.2.2 Trimming Multiple Layers + VGG-16 differs from LeNet greatly in that it has a much deeper architecture with significantly more + layers, which naturally gives us more options to determine which layers to trim. After the previous + experiments, we want to further investigate if trimming multiple layers simultaneously can achieve + the same effectiveness. + After trimming the CONV5-3 and FC6 layers, we continue to trim their neighboring layers. We + experimented with three sets of trimming layouts: {CONV5, FC6}, {CONV5, FC6, FC7}, {CONV4, + CONV5, FC6, FC7} (see Table 5). When more neurons are pruned, the large performance drop in the + trimmed network indicates retraining is necessary. We use the same set of training hyperparameters + in our experiments: {base-lr: 0.001, gamma: 0.1, step-size: 3000}. After retraining, the trimmed + networks gradually recover from the loss of neurons and rise to an accuracy level equivalent to the + + Table 5: Iterative Trimming Result on VGG-16 Many Layers + + <
> + + reference model or slightly higher. In contrast to trimming only one layer, these models regain to + their capacity rather slowly, taking more than 10K iterations to recover the accuracy. Empirically, we + found that iteratively trimming the network starting from a few layers can achieve better performance. + We also found that trimming the last convolutional layer and the fully connected layers are the most + effective. As shown in Table 6, additional trimming of FC7 layer (based on previously trimmed model + (CONV5-3, FC6) = (420, 2121)), can achieve a high2:7%compression rate with improved accuracy. + The underlying reason is that once we have pruned the FC6 layer, the numerous zeros contribute to + the high APoZ value of neurons in the FC7 layer. For the goal to reduce network parameters, it is + suffices to just trim the {CONV5-3, FC6, FC7} layers since around86%of all the parameters are in + the {CONV5-3, FC6, FC7} layers. + + Table 6: Iterative Trimming Result on VGG-16 {CONV5-3, FC6, FC7} + + <
> + + + 5 Discussion + + 5.1 Comparison with Connection Pruning + + Work closest to ours is the work by Han et al. [11] where they iteratively prune the network connections + when the correspondent weights of the connections are close to zero. They also prune a neuron when + the connections to a neuron are all pruned. Compared with their work, our work is better in two major + aspects. First, although Han et al. claim that they have achieved a reduction rate of parameters by + 13%on VGG-16, their reduction is tailored for CPU implementation of a neural network. In a GPU + implementation, the convolutional layer is implemented by first vectorizing a 2D feature map into + a 1D feature vector followed by a matrix multiplication [19]. Thus, if a neuron is not pruned, the + number of multiplications for the convolutional layers will remain the same since the vectorization is + performed in a universal manner for all neurons in the same layer. This is also the same case for fully + connected layers where the number of multiplications are universal for all neurons in the same layer. + Note that the computational costs to re-vectorize a 2D feature map to fit for different shape of neuron + connections, or adding a conditional mask checking is a lot higher than a simple matrix multiplication + with redundancy. Our method, on the contrary, removes all unneeded neurons so that they do not + consume any memory and are not involved in any computation at all. As shown in Section 4.2, the + trimmed VGG-16 has more than2%less FLOPs in the first fully connected layer. + Second, pruning a neuron by first pruning all of its connections is less efficient and less effective than + our APoZ measurement. This is because the number of connections is significantly larger than the + number of neurons in a network, especially for fully connected layers. In our experiments, we found + that most of the redundancy resides in fully connected layers, and in the connections between the last + convolutional layer and the first fully connected layer. However, it is rarely the case that the weights + of all connections to a neuron in these layers are close to zero. Consequently, it is difficult to prune a + neuron in these layers. On the other hand, our APoZ measurement can easily identify zero activation + neurons for pruning regardless the weight of connections. The mean APoZ can also be used as a + guideline to evaluate the effectiveness of a network as demonstrated in our experiments. + + 5.2 Dataset Used During Trimming + + In all of our experiments, we train the network using training set and run the network on validation + set to obtain APoZs for neuron pruning. This method may be controversial because the validation set + should not be glimpsed before finalizing the model which may potentially lead to overfitting of the + validation set. We also have the same suspicion, especially after the experiments that the trimmed + model can have2%higher top-5 accuracy than that of the original VGG-16 on the validation set. + Therefore, we consult two more experiments to explore the potential issue. + In the first experiment, we randomly sampled a subset from training set with equal number of images + (50K) as validation set. Then we used the same criteria to select weak neurons for pruning. The + weak neurons selected using the sampled training set have more than95%overlap ratio with the + exact neurons selected using the validation set. This shows that neurons have consistent activation + performance on training and validation sets. In another word, the trimmed networks learned from + sampled training data will be similar to the trimmed networks learned from the validation set. + In addition, we also tested our model on the test set of ILSVRC2012 classification track. Using + single model without dense evaluation, the original VGG-16 model with11:56%validation error + has an error rate of13:02%on test set. Our trimmed network with configuration {CONV5-3: 420, + FC6: 2121, FC7: 2482, Compression Rate: 2.00, Validation Error:9:7%} achieved10:02%error + rate on test set. Note that the test set and validation set are non-overlapping in this ILSVRC2012 + classification task. Telling from the data, after the network trimming, not only the overall accuracy + is increased, but the gap between validation error and test error is also shrunk, indicating that the + trimmed network has less overfitting. + The two extra experiments dismiss our concern on overfitting. They also suggest that the validation + set can be used for analyzing APoZs. + + 6 Conclusion + + We have presented Network Trimming to prune redundant neurons based on the statistics of neurons’ + activations. With our method, one network architecture can be deployed to handle different tasks on + different datasets and the algorithm can tailor the network accordingly by determining how many + neurons to use for each layer without the need of intensive computational power as well as human + labor. Our method can iteratively remove low activation neurons that provide little power to the final + results without damaging performance of the model. We experimented our algorithm on LeNet and + VGG-16 achieving the same accuracy with 2%3%less parameters. In VGG-16, the trimmed models + can even surpass the original one, which could be caused by the reduced optimization difficulty. + Lying in the middle of high level network redesign and low level weight pruning, neuron pruning can + be applied to any mature architecture together with weight pruning to sharply reduce the complexity + of network. + + References + [1]Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional + neural networks. In: Advances in neural information processing systems. (2012) 1097–1105 + [2]Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and + other neural network architectures. Neural Networks18(5) (2005) 602–610 + [3]Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document + recognition. Proceedings of the IEEE86(11) (Nov 1998) 2278–2324 + [4]Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-level + performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision + and Pattern Recognition. (2014) 1701–1708 + [5]Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep + learning. CoRRabs/1306.0543(2013) + [6] Lin, M., Chen, Q., Yan, S.: Network in network. CoRRabs/1312.4400(2013) + [7]Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., + Rabinovich, A.: Going deeper with convolutions. CoRRabs/1409.4842(2014) + [8]Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: Squeezenet: + Alexnet-level accuracy with 50x fewer parameters and< 1mb model size. arXiv preprint + arXiv:1602.07360 (2016) + [9]Hanson, S.J., Pratt, L.: Advances in neural information processing systems 1. Morgan Kaufmann + Publishers Inc., San Francisco, CA, USA (1989) 177–185 + [10]Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: Optimal brain surgeon. + In: Advances in Neural Information Processing Systems 5, [NIPS Conference], San Francisco, + CA, USA, Morgan Kaufmann Publishers Inc. (1993) 164–171 + [11]Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient + neural networks. CoRRabs/1506.02626(2015) + [12]Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with + pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015) + [13]Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recogni- + tion. arXiv preprint arXiv:1409.1556 (2014) + [14]Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: + Proceedings of the 27th International Conference on Machine Learning (ICML-10). (2010) + 807–814 + [15]Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with + region proposal networks. In: Advances in Neural Information Processing Systems. (2015) + 91–99 + [16]Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to + sequence-video to text. In: Proceedings of the IEEE International Conference on Computer + Vision. (2015) 4534–4542 + [17]Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, + T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM + International Conference on Multimedia, ACM (2014) 675–678 + [18]Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., + Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition + Challenge. International Journal of Computer Vision (IJCV)115(3) (2015) 211–252 + [19]Scherer, D., Schulz, H., Behnke, S.: Accelerating large-scale convolutional neural networks + with parallel graphics multiprocessors. In: Artificial Neural Networks–ICANN 2010. Springer + (2010) 82–91 +<> <> <> + + +<> <> <> + PLUG AND PLAY LANGUAGE MODELS : A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION + + Sumanth Dathathri Andrea Madotto Janice Lan Jane Hung + CMS, Caltech HKUST Uber AI Uber AI + + Eric Frank Piero Molino Jason Yosinski yy Rosanne Liu y + Uber AI Uber AI Uber AI Uber AI + dathathris@gmail.com, amadotto@connect.ust.hk + {janlan, jane.hung, mysterefrank, piero, yosinski, rosanne}@uber.com + + ABSTRACT + + Large transformer-based language models (LMs) trained on huge text corpora + have shown unparalleled generation capabilities. However, controlling attributes + of the generated language (e.g. switching topic or sentiment) is difficult without + modifying the model architecture or fine-tuning on attribute-specific data and en- + tailing the significant cost of retraining. We propose a simple alternative: the Plug + and Play Language Model (PPLM) for controllable language generation, which + combines a pretrained LM with one or more simple attribute classifiers that guide + text generation without any further training of the LM. In the canonical scenario + we present, the attribute models are simple classifiers consisting of a user-specified + bag of words or a single learned layer with 100,000 times fewer parameters than + the LM. Sampling entails a forward and backward pass in which gradients from + the attribute model push the LM’s hidden activations and thus guide the + generation. Model samples demonstrate control over a range of topics and sentiment + styles, and extensive automated and human annotated evaluations show attribute + alignment and fluency. PPLMs are flexible in that any combination of differentiable + attribute models may be used to steer text generation, which will allow for + diverse and creative applications beyond the examples given in this paper. + + + 1 INTRODUCTION + + The Transformer architecture (Vaswani et al., 2017) has enabled large-scale language models (LMs) + trained on a huge amount of data (Radford et al., 2019; Dai et al., 2019b; Radford et al., 2018b) to + greatly improve the state-of-the-art on natural language processing tasks. These models are used to + extract contextualized word embeddings for transfer learning purposes (Devlin et al., 2019) and as + natural language generators. The latter can leverage large amounts of unannotated data and a simple + log-likelihood training objective. However, once such models are trained, controlling attributes of + generated text becomes difficult without modifying the model architecture to allow for extra input + attributes or fine-tuning with attribute-specific data (Keskar et al., 2019; Ziegler et al., 2019). + conceptualized PPLMs and led the manuscript writing. SD led thecproject, implemented the PPLM, set + up and ran all modeling experiments, engineered how to obtain workable + gradients via the weighted embedding approach, and made the model work. AM helped with preparing datasets + for discriminator training, automated evaluation, running experiments, and writing the manuscript. SD, RL & + AM ran the external baselines. RL & JL built and oversaw the human evaluation pipeline and computed the + statistics. JH ran the story generation with skeleton prefixes. EF assisted with detoxification experiments. PM + led efforts to migrate to the new pytorch transformer, helped with code release. JY helped with the annotation + pipeline, finding bugs, navigating model and experimental directions, engineering workable gradients, and + posing the model mathematically. RL implemented preliminary experiments and multi-attribute control, and + cleaned and coordinated release of the code. RL & JY oversaw the project. + + <
> + + Table 1: The PPLM employs a pre-trained language model (LM) without any changes to the model + parameters and can generate text with controlled attributes such as topic and sentiment. We demon- + strate control with two tiny and easy to construct attribute models: a bag of words (BoW) related to a + topic and a linear discriminator trained on top of LM latent representations to control sentiment. The + underlined prefix is what the LM is conditioned on to generate a passage of text (e.g.The potatoThe potatoThe + potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe potatoThe + potatoThe potatoThe potatoThe potato). + The controlled attributes are colored and bracketed (e.g.[Science]), and words in the BoW that are + directly optimized for are highlighted brightly (e.g.research). The softer highlights correspond to + words related to the attribute, but not directly optimized for during the control process (e.g.health). + [–] The potato The potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potato The potato and cauliflower are both in season to make combo breads, mounds, or pads. For an added + challenge, try some garlic mashed potatoes. + [Negative] The potato The potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potato The potato is a prettybad idea. It can make you fat, it can cause you to have aterribleimmune + system, and it can even kill you.... + [Positive] The potato The potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potato The potato chip recipe you asked for! Welovemaking these, and I’ve been doing so for years. + I’ve always had a hard time keeping a recipe secret. I think it’s the way our kids love to eat them – so many + little ones. + [Science] The potato The potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potatoThe potato The potatoThe potatoThe potatoThe potato The potatoThe potato The potato was once thought to have nohealthproblems and has been promoted as anutritious + food source since the mid-1800s, but recent reports indicate that it has many harmful health issues. In fact, + researchers from Johns Hopkins University... + [Politics][Positive]To concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludethis series of articles, I will present three of the mostpopularand + influential works on this topic. The first article deals with the role of women’s political participation in + building a political system that is representative of the will of the people. + [Politics][Negative]To concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo concludeTo conclude, the most significant and lastingdamagefrom the economiccrisisin + 2008 was that many governments, including those in the political center,lost power for the first time in + modern history. + + Controllable generation entails modeling <>, where a is some desired controllable attribute(s) + and x the generated sample. However, generative models only learn <>. In computer vision, + Plug & Play Generative Networks (PPGN) from Nguyen et al. (2017) developed a mechanism for + generating images with different attributes by plugging a discriminator (attribute model) <> + together with a base generative model <> and sampling from the resulting <>, + effectively creating a conditional generative model on the fly from any supplied attribute model. In + a similar manner, we propose the Plug and Play Language Model (PPLM) for conditional language + generation that combines one or more simple attribute models <>—either in the form of a bag- + of-words (BoW) or single layer classifiers—with a pre-trained, unconditional language model <>. + We sample from the resulting combined model by following gradients in the latent representation + space in a manner inspired by the approximate Metropolis-adjusted Langevin (MALA) (Roberts + et al., 1996; Roberts & Rosenthal, 1998) sampler deployed in Nguyen et al. (2017). + Optimization is performedex post factoin the activation space, the <> for <> re-training or fine- + tuning is needed. Control is fine-grained, with a strength parameter determining how strong the + attribute influence should be; a strength of0fully recovers the original model<>. This design + allows vast flexibility: users can combine a state-of-the-art generative model, which may be large + and difficult to train, with any number of attribute controllers. Attribute models may be easier to train + or untrained (in the case of BoW models), and multiple controllers may be combined flexibly during + inference. In this paper, we demonstrate the PPLM approach using a GPT-2 345M model (Radford + et al., 2019) as the general-purpose LM <>, but the method applies in any representation space + from any transformer-based text generator and allows combination with any attribute model <>. + We demonstrate controlled generation with a number of attribute controllers, assembled and + combined during generation, each with a different strength, acting as a set of “control knobs” that tune + generation towards the desired attribute (see examples in Table 1). Code for the experiments is + available at:https://github.com/uber-research/PPLM. Our key contributions are: + + •We introduce the Plug and Play LM for controlled language generation, discuss its relation + to existing work, and how sampling from a PPLM works (Sections 2 and 3). + •We demonstrate controlling of text generation on a range of attributes, including 7 topics + each defined using a bag of words, and 1 simple discriminator on sentiments. We quantify + effectiveness using both automated evaluation (separately trained perplexity and sentiment + models) as well as human evaluation (for attribute relevance and fluency). All evaluations + point toward the ability of PPLMs to generate attribute controlled, fluent text (Section 4). + •We compare PPLM with CTRL (Keskar et al., 2019) and GPT-2 finetuned for positivity + (Ziegler et al., 2019). Our method, without any LM training, is on par and often outper- + forms the baselines on attribute relevance and fluency (Section 4.2, and Section 4.3). + •We show that the PPLM approach can be used to detoxify instances where generation + of toxic content is likely by following the negative gradient of a model trained to detect + toxicity (Section 4.4). We also show how PPLM can be used for structurally constrained + story writing (Section 4.5). + + 2 RELATED WORK + + Controlled generation Current methods for controlled text generation involve either fine-tuning + existing models with Reinforcement Learning (RL) (Ziegler et al., 2019), training Generative + Adversarial Networks (Yu et al., 2017), or training conditional generative models (Kikuchi et al., 2016; + Ficler & Goldberg, 2017). Different from our approach, these methodologies are not plug and + play, since the entire model needs to be separately fine-tuned for each specific attribute. Keskar + et al. (2019) train a large language model with over 50 different control codes. The results are high + quality because they train exactly to maximize <>, but this comes at the expense of fixing control + codes upfront and of training a very large model (1.6B parameters). Our method does not require + retraining any conditional generative model, and both the language model and the conditional model + can be flexibly assembled. Table 2 gives a comparison of recent approaches to language modeling + tuned for specific attributes. In another interesting but tangential piece of work, Subramani et al. + (2019) recently showed that a pre-trained language model can be steered to recover arbitrary + sentences. In earlier works Gu et al. (2016; 2017); Chen et al. (2018) explored the idea of using a small + neural network to steer an LM. + + Noisy Channel Modeling Yu et al. (2016), and more recently Yu et al. (2019); Yee et al. (2019); + Ng et al. (2019), leveraged the Shannon Noisy Channel Theory (Shannon, 1948) for improving + sequence-to-sequence modeling. Their approach translates a source language sentence y into a target + language sentence x by first sampling from a forward model proposal distribution p forward <> and + then reranking samples based on probabilities given by p backward <>. PPLM scores + samples using the same basic equation, but as we have no forward or proposal model p forward <>, + we rely on the latent space updates, similar to Nguyen et al. (2017). As a baseline, we consider + using<>as a “forward model” and then reranking, which we will see works moderately well in + some scenarios and poorly in others (see Tables 4 and 6). + + Weighted decoding Holtzman et al. (2018); Ghazvininejad et al. (2017) consider controlled + language generation – the former with discriminators, and the latter with a bag of words – where the + decoding procedure is modified to consider the scoring function used for decoding. See et al. (2019) + note that control with weighted decoding (WD) is difficult and often leads to sacrificing fluency and + coherence. Further, Ghazvininejad et al. (2017) strongly relies on sampling from a set of keywords + on a specific topic and it does not allow to bias generation towards a topic in a manner that does not + necessary include a set of keywords. Similarly, Baheti et al. (2018) proposed a decoding strategy + for generating interesting responses in dialogue systems, using bags of words and word embeddings. + Sophisticated sampling methods (Metropolis et al., 1953) can be used to constrain the model + generation to certain keywords and topics. We evaluate WD as a baseline. + + Text Style Transfer Outside of language modeling, the text style transfer studies a related task. + Shen et al. (2017); Hu et al. (2017) train variational auto-encoders for style transfer that rely on + learning disentangled latent representations for style and content. Li et al. (2018) demonstrate the + efficacy of a simple approach based on replacing attribute related n-grams with n-grams corresponding + to the desired attribute based on a conditional generative model. A key difference between the + above and our approach is that we use an offline discriminator and perform optimization based on + this discriminator, which as suggested by Elazar & Goldberg (2018) may outperform adversarial + training approaches. More recently, Lample et al. (2019) adapt an approach from unsupervised + language translation to style transfer, where a denoised auto-encoder is trained with an objective + + Table 2: Comparison of the different models and distributions. All models in this table are useful in + different scenarios. The particular advantage of PPLM is that very small, custom attribute models, + <>, may be combined with powerful, general pre-trained language models, <>, to create cheap + but still powerful conditional generative models, <>. + + <
> + + consisting of a weighted combination of a re-construction loss and a back-translation loss. While + the above approaches have shown impressive success on style transfer tasks, the main focus is not + controlled language generation, and further, the methods are not plug and play. + + 3 PLUG AND PLAY LANGUAGE MODELS + + 3.1 LANGUAGE MODELING WITH TRANSFORMERS + + Given a sequence of tokens <>, LMs are trained to compute the unconditional prob- + ability of the sequence <>. This probability can be rewritten in terms of product of conditional + probabilities by recursively applying the chain-rule (Manning et al., 1999; Bengio et al., 2003) as: + + <> (1) + + In this paper, we use a transformer (Vaswani et al., 2017) to model the distribution of natural lan- + guage. To present our approach clearly, we first briefly summarize the transformer using recur- + rent notation. Let us define the history matrixHt to consist of the key-value pairs from the past + <>, where <> corresponds to the key-value pairs <> from the i-th layer + generated at all time-steps from 0 tot. Efficient implementations of the transformer + (Wolf et al., 2019) use the cachedHt to generate <>, given <>. This recurrent interpretation + of a transformer can be summarized as: + + <>; (2) + + where <> a linear transformation that maps the logit vector <> to a vector of vocabulary size, and + then <> is sampled as<> pt+1 =Softmax(Wo t+1 ). This allows for efficient language + generation without repeated forward passes corresponding to the prior conditioning text <>. + + 3.2 STEERING GENERATION :ASCENDING log<> + + In order to control the output of the language model, at every generation step t, we shift the history + Ht in the direction of the sum of two gradients: one toward higher log-likelihood (LL) of the attribute + a under the conditional attribute model<>and one toward higher LL of the unmodified language + model<>. Combining these factors with a variable multiplier provides us with a controllable + “knob” to guide generation in a given direction with a specified strength. The updates are restricted + toHt and not the other model activations because future predictions depend on the past only via Ht + (note thatHt is composed of all transformer key and value pairs generated up to time t). Taking + steps inHt space leads to gradual changes to model activations — which may be thought of as + gradual reinterpretations of the past — that guide future generation in the desired direction. + Let<> be the update toHt , such that generation with (<>) shifts the distribution of + the generated text such that it is more likely to possess the desired attribute. Ht is initialized + + <
> + + Figure 1: Simplified illustration of the proposed approach in three phases. In Step 1, a forward pass + is performed through the language model to compute the likelihood of a desired attribute using an + attribute model that predicts<>. In Step 2, a backward pass updates the internal latent + representations of the LM, using gradients from the attribute model, to increase the likelihood of the passage + having the desired attribute. In Step 3, a new distribution over the vocabulary (<>) is generated + from the updated latents(Het )and the current token <>. The next token is then sampled from the + updated distribution. This process of updating the latents is repeated at each time-step, leading to + a gradual transition towards the desired attribute. For computational efficiency, one may choose to + modify only the latents within some window of the recent past, depicted as the dotted-red region. + + at zero and updated with gradients from an attribute model that measures the extent to which the + generated text possesses the desired attribute (e.g. positivity). We rewrite the attribute model<> + FORMULA>> and then make gradient based updates to <> as follows: + + <> (3) + + where <> is the step size, <> is the scaling coefficient for the normalization term. 1 This update step + can be repeated m times; in practice we use3to10. Subsequently, a forward pass through the LM + with the updated key-value pairs is performed to obtain the updated <>, where <>. + The perturbed oet+1 is then used to generate a new distribution <> as in Equation 2. + + 3.3 ENSURING FLUENCY :ASCENDING log<> + + The approach described in the previous section is able to generate text tuned for a particular + discriminator, but left unchecked it will quickly result in unrealistic adversarial or fooling examples + (Szegedy et al., 2013; Nguyen et al., 2015) as the text moves into low probability regions. To com- + bat this, we use the unconditional language model in two ways that ensure the fluency is maintained + at or near the level of the unconditional language model (here GPT-2). + + Kullback–Leibler (KL) Divergence We update<> to minimize the KL divergence between the + output distribution of the modified and unmodified language models in addition to the step above. + In practice, this is accomplished by adding the quantities together before taking a gradient, though it + can be visualized as two separate steps as in Figure 2. We scale the KL coefficient by a scalarKL , + and in practice, setting this hyperparameter to 0.01 works well in general across tasks. + + Post-norm Geometric Mean Fusion In addition to minimizing KL divergence, which affects the + past via<> , we perform post-norm fusion similarly to Stahlberg et al. (2018). This does not + directly affect<> ; rather, it just serves to constantly tie the generated text to the unconditional + <>LM distribution. We accomplish this by sampling from <>, where <> + and <> are the unmodified and modified output distributions, respectively, and <> is a normalizing + factor such that it forms a valid distribution. As <> this converges to the distribution from + the updated LM, and as <> converges to the unconditional LM distribution. We find that in + practice values for <> in the range 0.8-0.95 work well. + + 1 One normalization term is computed for each layer of the transformer. + + Figure 2: An oversimplified view into why steps + that maximize both log<>and log<> are + needed. The sentence under consideration is + shown as a black dot, which is first pushed in the + direction of maximizing <> and then in the ascend <> + direction of maximizing <>. In practice we ascend <> + + higher use a single step and simply add the log probabilities; + <
> we take steps in continuous space of hid- + lower den representations H rather than in the discrete x + higher <> (byte pair) space, and rather than resampling the + entire sentence each step, we take one step inH + space per byte-pair sample. + + + 3.4 SAMPLING AND RANKING + + The attribute model<>in PPLM provides two functionalities: first, a score that can be used to + rank samples based on the LL of the desired attribute (forward pass only; Step 1, Figure 1), and + second, a gradient ascent direction to perform an update in the latent space (Step 2 & 3; Figure 1). + The former can be used to generate r samples and rank them to choose the best one. This can + serve as an additional method for attribute control in addition to sampling with updated latents. + Further, to avoid the problem of repetitive, low quality text (Holtzman et al., 2018), we compute the + mean over the Dist-1, Dist-2 and Dist-3 scores (for the generated passage), which is an indicator of + repetitiveness (Li et al., 2015), and then discard samples with a mean score below a threshold. + + + 4 EXPERIMENTS, RESULTS, AND EVALUATION + + In this section, we describe our evaluation methodology and then show controlled generation results + under various attribute models. We also show use cases of PPLM in language detoxification and in + controlled story telling. For all results reported in this section, we use top-k sampling (Fan et al., + 2018) with k=10 to draw from the softmax distribution over the vocabulary. + + 4.1 EVALUATION METHODS AND ABLATION STUDY + + We evaluate to assess two properties: whether PPLM generates text that satisfies the desired attribute + (topic or sentiment) and whether the quality of its text deteriorates as we intensify control of the + attribute. Note we can always turn the control knob down to zero to disable control of attributes + and reach the fluency of the original model. If desired, a user can tune the knobs at inference until a + chosen tradeoff between attribute strength and fluency is reached. We evaluate using both automated + methods and human annotators: + Automated Eval.Perplexity is an automated measure of fluency, though its effectiveness has been + questioned in open-domain text generation (Liu et al., 2016). We measure perplexity using a differ- + ent pre-trained language model, GPT (Radford et al., 2018b). The diversity of text in the passages + is measured using the number of distinct n-grams (normalized by the length of text) as in Li et al. + (2015). We report Dist-1, Dist-2, and Dist-3 scores for the distinct 1-2-3-grams (measured across + all samples generated for a given attribute control task, e.g. a specific topic for topic control). Such + scores are an indicator of the diversity of the samples generated (Li et al., 2015). We also use external + sentiment classifiers for sentiment evaluation. + Human Eval.We consider two types of human annotation: fluency and A/B testing on attribute + relevance. Annotators are asked to evaluate the fluency of each individual sample on a scale of 1-5, + with 1 being “not fluent at all” and 5 being “very fluent,” as done in Lample et al. (2019). In the A/B + testing for attribute relevance, we consider all combinatorial pairs of all four variants: B, BR, BC, + and BCR (6 combinations). We then ask annotators to rank the pair on the desired attribute (e.g. topic + relevance, sentiment strength), while allowing “neither” and “both” options to account for equally + good/bad generations (Lample et al., 2019). We obtain annotations from nine external occupational + annotators. Each pair of samples is evaluated by three individuals and we use majority-voting to + + Table 3: Comparison of different samples generated by (top row) baseline GPT-2 and (other rows) + PPLM with different BoW corresponding to different topics (e.g.[Military]), all conditioned on a + single prefix: "The issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focusedThe issue focused". Both directly optimized (inred) and related words (insoft red) + are highlighted, showing how the optimization takes effect. + + <
> + + compute attribute relevance. For fluency, we use average of the three annotations. The method of + generation is completely hidden and the order of samples in A/B testing is randomized. + Ablation study and baselines.We conduct an ablation study with four variants:B: the baseline, + unchanged GPT-2 LM, sampled once;BR: B but sampled r times, with best sample chosen based + on the LL ranking and filtering based on Dist score;BC: update the latent representations(Het )and + then sample once; and lastlyBCR: update the latent representations(Het )and generate r samples, + choose the best sample based on the LL score (after filtering out samples with low Dist scores). As + baseline approaches we considerCTRL: (Keskar et al., 2019), a recent language model;GPT2-FT- + RL: a GPT-2 LM fine-tuned for human evaluated positivity with RL (Ziegler et al., 2019); andWD: + a weighted decoding baseline in which the B LM’s outputs are weighted directly toward maximizing + <> (Ghazvininejad et al., 2017); see Section S7 for details, and Section S11 for hyperparameters. + + 4.2 BOW ATTRIBUTE MODELS + + The simplest attribute model we use gives the log of the sum of likelihoods of each word in some + predefined Bag of Words (BoW). Given a set of keywords <> that specify a topic of + interest and the output distribution of the language model <>, the log likelihood is: + + <> (4) + + We construct BoWs that represent seven distinct topics: SCIENCE, MILITARY, LEGAL, COMPUTERS, + SPACE, POLITICS and RELIGION (see Section S17 for complete word lists). Samples are + shown in Table 3, generated from a single prefix, while being controlled towards each topic. + Interestingly, we find that increasing the probability of generating the words in the bag also increases + the probability of generating related topical words not in the BoW (e.g. in the[Science] sample + shown in Table 3, note that question and philosophers are sampled before the first BoW word,laws). + Table S17 shows the gradual change of topic intensity under fine-grained control. We found that + the optimization procedure works better with updating representations from the past over a finite + window and using an adaptive normalization scheme (see Section S11.3). + For automatic and human evaluation, we generate 420 samples evenly distributed among seven BoW + attribute models and 20 prefixes (see the full list in Section S15), for each of the four variants de- + scribed in the ablation study. See Section S8 for further details on evaluation and results. Table 4 + shows that human annotators find text from BCR (51.7%) and BC (46.9%) to be significantly more + + Table 4: For each treatment in the ablation study, we report mean std-dev across (human and + automated) fluency metrics. The topic (%) reports the fraction of samples matching the target topic, + as evaluated by human annotators. Table S8 provides per-topic results. Approaches BC and BCR + demonstrate significant control over the topic of the generated text, while retaining similar diversity + (Dist-1, Dist-2, Dist-3) scores and minimal degradation in Perplexity and Fluency evaluations vs the + baseline LM (B). The gain from ranking and choosing from multiple samples BR over B is limited + (4.7%). The gain in topic-accuracy from latent (Het ) manipulation (from B to BC) is significantly + higher (35.8%). Perplexity is computed using the GPT LM (Radford et al., 2018a), which differs + from the LM generating text (GPT-2). For CTRL and WD, since human evaluation is performed + in comparison with BCR via A/B testing, we report the numbers for BCR as well from these + comparisons, for the human evaluated metrics. Further, we consider one sample per prefix for CTRL, + resulting in fewer samples and higher Dist-1, 2, 3 scores as a consequence. PPLM outperforms + CTRL and WD on topic-relevance, while being comparable on fluency scores. + + <
> + + on topic than B (15.8%) and BR (11.1%). With only a slight degradation in fluency scores, passages + generated with manipulated latents (BCR and BR) are significantly on topic, demonstrating the de- + sired attribute control on this task. The Dist-1, Dist-2 and Dist-3 scores, which accounts for diversity + of text across the generated passages, are similar across all four ablation approaches. Further, BCR + slightly outperforms CTRL (51.7% & 50.0%), and significantly outperforms WD (36 %). BC itself + outperforms WD (36 %). BCR, CTRL and WD all score similarly on the fluency metric. + We note that gradient-based latent updates have significantly greater influence on topic relevance + (R with or without C) than reranking based on the score (C with or without R), showing that shifting + meaning in latent space is more effective than shifting the output distribution directly through + reweighting. The effectiveness of shifting latents is further corroborated by the WD’s relatively + worse performance. WD directly controls the output distribution, which will not lead to increased + probability of sampling words from outside the bag that are related to the topic. + Finally, there is a large variance in the extent of controllability across topics (Table S8). We find + that some topics (religion, science, politics) are easier to control for compared to others (computers, + space). Section S9 considers unusual or nonsensical combinations of prefixes and attributes + (e.g. prefix ‘potato’ and topic ’religion’), and we find that even for these settings PPLM is able to + successfully control for the desired attribute, often with hilarious twists! + + 4.3 DISCRIMINATOR ATTRIBUTE MODELS + + While BoW models have been demonstrated to be able to control text attributes such as sentiment + (e.g., Li et al. (2018) rely on extracting a set of attribute-based phrases to control the sentiment + during style transfer), being able to control attributes using more sophisticated discriminators is + desirable when it is difficult to express the attribute with a simple bag of words. + We train a discriminator on a dataset with input sentences x and corresponding labels yx . For an + input xof length t, we compute ox and train fon the mean (<>) of the embeddings across time. + All :t discriminators in this work consist of a single layer classifier that predicts the target label from + <> The number of parameters in this layer is (embedding-dimension (e) number of attributes + (a) + number of attributes (a)), which is negligible compared to the number of parameters in the + LM model itself (Table 2). Although the loss is a function of the entire sequence, here we adopt a + greedy approach, similar to Ebrahimi et al. (2018); Wallace et al. (2019), in which we optimize for + + Table 5: Sentence samples in triplets, generated by {baseline GPT-2, PPLM-Discrim POSITIVE , + PPLM-Discrim NEGATIVE }, conditioned on prefixes:The chickenThe chickenThe chickenThe chickenThe + chickenThe chickenThe chickenThe chickenThe chickenThe chickenThe chickenThe chickenThe chickenThe + chickenThe chickenThe chickenThe chicken&The countryThe countryThe countryThe countryThe countryThe + countryThe countryThe countryThe countryThe countryThe countryThe countryThe countryThe countryThe + countryThe countryThe country. Words related to + the sentiment are highlighted (in soft red). Each triplet is generated from the same random seed. + + <
> + + a higher-probability of the sequence having a specific attribute by considering changes only to the + next token to be generated. This objective can be described as follows, where f is the discriminator: + + <> (5) + + Note that <> is a function of <> . Further, <>, which depends on <>. + In the limit, minimizing the objective in Equation 5 corresponds to choosing <> that produces the + optimal <> that maximizes <>. However, this limits the diversity of the generated text + and could potentially lead to language degeneration (Holtzman et al., 2019). Alternatively, we focus + on a softer optimization approach where we aim to shift the distribution <> + towards one that in expectation has a higher likelihood of having the desired attribute a. Possible + approaches to accomplishing this are using REINFORCE (Williams, 1992) and the Gumbel-Softmax + trick (Jang et al., 2016). However, both of these would slow down convergence. Instead, as in Dai + et al. (2019a), we use the distribution <> (instead of a hard sample<> ), and feed it forward to + obtain (a biased) estimate of the next token’s embedding and then update<> . + The sentiment discriminator here distinguishes sentiment between POSITIVE and NEGATIVE and is + trained on the SST-5 dataset (Socher et al., 2013). Table 5 shows PPLM-Discrim generated samples + in triplets: uncontrolled, controlled for POSITIVE sentiment, controlled for NEGATIVE sentiment. + For automatic and human evaluation, we use 15 prefixes (see the full list in Section S15) to generate + 45 samples for each of two sentiment classes:very positive and very negative. Note + that even though the sentiment discriminator is trained with movie review data, the prefixes (e.g. + “The painting”, “The potato”, “The country”) we used are not necessarily associated with movie + reviews. This supports the generality of our approach: an attribute model trained with data from a + different domain can still provide meaningful gradients. + Table 6 shows evaluation results. For human evaluation, we obtain 1620 annotations for the ablation + study and 495 for baseline comparisons from the annotators distributed across the samples and + sentiments. Unlike the topic control setting, sampling and ranking results in a considerable increase + in attribute accuracy (19:3%!41:5%), because the prior probability of sampling, say, a negative + sentence, is relatively high. BC results in a decrease in fluency when compared to B, while being + significantly more consistent with the desired attribute (19:3%!39:6%). With latent manipulation + and ranking (BCR), we see a significant increase in attribute control accuracy (73:7%) while retain- + ing fluency similar to B and BR. Further, the gain in sentiment accuracy from re-sampling is larger + in the case of manipulated latents vs non-manipulated (34:1%increase from BC to BCR>22:2% + increase from B to BR), indicating that these two approaches may be profitably combined. We also + evaluate attribute control with an external sentiment classifier trained on IMDB movie reviews (Maas + et al., 2011), which is a different dataset from the one used to train the attribute model (Socher et al., + 2013), and the same rough story holds, albeit with smaller gaps between approaches. We compare to + baselines CTRL, GPT2-FT-RL, and WD. BCR performs comparably to CTRL (73.7% and 80.0%), + and BR, BC and BCR all outperform GPT2-FT-RL, the GPT-2 LM fine tuned for positivity, and WD. + + Table 6: Evaluation of models/ variants on the sentiment control task, with meanstd-dev reported + across fluency metrics. Sentiment accuracy reports the fraction of samples with an accurate tar- + get sentiment. Approach BCR provides significant control over sentiment while showing minimal + degradation in fluency. See Table S9 for full results on individual sentiments. *GPT2-FT-RL is only + evaluated for the positivity half of the task, as it is fine-tuned only for positivity (Ziegler et al., 2019). + For human evaluation metrics, we compare the baselines CTRL, GPT2-FT-RL and WD with BCR + and perform A/B style testing. We include both numbers for comparison. + + <
> + + 4.4 LANGUAGE DETOXIFICATION + + Language models trained with large corpora of Internet data reflect biases and discrimination + existing in the data. A recent paper by Wallace et al. (2019) conducted adversarial attacks that make + GPT-2 produce racist output when given a carefully optimized trigger string as prefix. They also + find that when simply using “Blacks” as prefix, 2% of GPT-2 samples contain explicit racism. Other + prefixes (e.g., “Asians” or “Jews”) are mentioned but no percentage is reported. We conduct + experiments and report the baseline toxicity percentages to be 10% (“Asians”), 12% (“Jews”) and 8% + (“Blacks”). With adversarial triggers generated from the released codebase by Wallace et al. (2019) + the average toxicity percentage is 63.6%. Further details can be found in Section S13. + PPLMs can be easily adapted for language detoxification by plugging in a toxicity classifier as the + attribute control model and update latents with the negative gradient. We train a single layer classifier + on the toxicity data from the Toxic Comment Classification Challenge (Jigsaw) and show that with + a similar hyper-parameter setting as other PPLM-Discrim methods, it works well on both natural + prompts and adversarial triggers. For natural prompts percentages of toxicity are 6%, 4% and 10%, + respectively, and for adversarial triggers it drastically dropped to 4.6% on average, with statistical + significance. Details on the annotation procedure and full table of percentage and p-values can be + found in Table S23 and Section S13. Note that a model for detoxifying language can also potentially + be maliciously used for generating toxic language, a topic we briefly discuss in Section S6. + + 4.5 CONTROLLED STORY WRITING + + We explore controlled generation for assistive story writing (Peng et al., 2018; Luo et al., 2019; Yao + et al., 2019; Fan et al., 2018). Using uncontrolled LMs for assistive art creation can be difficult. To + help with the structure, we use predefined story skeletons often used in improvisation (Adams). We + fill in the blank between these prefixes with a PPLM. See examples in Table S20 and Table S21. + + + 5 CONCLUSION + + We have presented PPLM, a plug and play method for controlled language generation that flexibly + combines a large, pre-trained LM and a BoW or a small, easy-to-train discriminator. In Section S6 + we discuss the ethics of controlled LMs. PPLM achieves fine-grained control of attributes via a + simple gradient-based sampling mechanism. Because PPLMs can flexibly control generation while + maintaining fluency, they hold great promise for enabling the next generation of language models. + + ACKNOWLEDGEMENTS + + The authors are grateful to Bryan McCann for providing samples for the CTRL baseline, Joel + Lehman for discussion regarding the ethical implications for this work, Jiale Zhi for help with the + computational framework, Colan Chen for creating associated artwork for the blog, Avishek Joey + Bose for helpful discussions, Julien Chaumond, Lysandre Debut, Thomas Wolf, and the Hugging + Face team for co-producing the PPLM demo and helping integrate the code into their transformers + repository, all the annotators at Uber, HKUST and Caltech for their labeling, and members of the + Deep Collective research group for helpful discussion, ideas, and feedback on experiments. + + REFERENCES + Kenn Adams. Improv encyclopedia story spine. http://improvencyclopedia.org/ + games/Story_Spine.html. (accessed September 20, 2019). + Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan. Generating more interesting responses in + neural conversation models with distributional constraints. InProceedings of the 2018 Conference + on Empirical Methods in Natural Language Processing, pp. 3970–3980, 2018. + Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic + language model.Journal of machine learning research, 3(Feb):1137–1155, 2003. + Yun Chen, Victor OK Li, Kyunghyun Cho, and Samuel R Bowman. A stable and effective learning + strategy for trainable greedy decoding.arXiv preprint arXiv:1804.07915, 2018. + Ning Dai, Jianze Liang, Xipeng Qiu, and Xuanjing Huang. Style transformer: Unpaired text style + transfer without disentangled latent representation.arXiv preprint arXiv:1905.05621, 2019a. + Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan + Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context.arXiv + preprint arXiv:1901.02860, 2019b. + Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep + bidirectional transformers for language understanding. InProceedings of the 2019 Conference of + the North American Chapter of the Association for Computational Linguistics: Human Language + Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019. + Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. HotFlip: White-box adversarial ex- + amples for text classification. InProceedings of the 56th Annual Meeting of the Associa- + tion for Computational Linguistics (Volume 2: Short Papers), pp. 31–36, Melbourne, Aus- + tralia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2006. URL + https://www.aclweb.org/anthology/P18-2006. + Yanai Elazar and Yoav Goldberg. Adversarial removal of demographic attributes from text data. + InProceedings of the 2018 Conference on Empirical Methods in Natural Language Process- + ing, pp. 11–21, Brussels, Belgium, October-November 2018. Association for Computational Lin- + guistics. doi: 10.18653/v1/D18-1002. URLhttps://www.aclweb.org/anthology/ + D18-1002. + Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation.arXiv preprint + arXiv:1805.04833, 2018. + Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation. + InProceedings of the Workshop on Stylistic Variation, pp. 94–104, 2017. + Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry + generation system. InProceedings of ACL 2017, System Demonstrations, pp. 43–48, Vancouver, + Canada, July 2017. Association for Computational Linguistics. URLhttps://www.aclweb. + org/anthology/P17-4008. + Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor OK Li. Learning to translate in real-time + with neural machine translation.arXiv preprint arXiv:1610.00388, 2016. + Jiatao Gu, Kyunghyun Cho, and Victor OK Li. Trainable greedy decoding for neural machine + translation.arXiv preprint arXiv:1702.02429, 2017. + Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning + to write with cooperative discriminators.CoRR, abs/1805.06087, 2018. URLhttp://arxiv. + org/abs/1805.06087. + Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degener- + ation.arXiv preprint arXiv:1904.09751, 2019. + Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. Controllable text + generation.CoRR, abs/1703.00955, 2017. URLhttp://arxiv.org/abs/1703.00955. + Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. 2016. + Jigsaw. Toxic comment classification challenge. https://www.kaggle.com/c/ + jigsaw-toxic-comment-classification-challenge/. Accessed: 2019-11-13. + Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong, and Richard Socher. CTRL + - A Conditional Transformer Language Model for Controllable Generation. arXiv preprint + arXiv:1909, 2019. + Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. Con- + trolling output length in neural encoder-decoders. InProceedings of the 2016 Conference on + Empirical Methods in Natural Language Processing, pp. 1328–1338, Austin, Texas, Novem- + ber 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1140. URL + https://www.aclweb.org/anthology/D16-1140. + Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc’Aurelio Ranzato, + and Y-Lan Boureau. Multiple-attribute text rewriting. InInternational Conference on Learning + Representations, 2019. URLhttps://openreview.net/forum?id=H1g2NhC5KQ. + Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A Diversity-Promoting + Objective Function for Neural Conversation Models.arXiv e-prints, art. arXiv:1510.03055, Oct + 2015. + Juncen Li, Robin Jia, He He, and Percy Liang. Delete, retrieve, generate: A simple approach to + sentiment and style transfer.CoRR, abs/1804.06437, 2018. URLhttp://arxiv.org/abs/ + 1804.06437. + Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. + How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics + for dialogue response generation. InProceedings of the 2016 Conference on Empirical Methods + in Natural Language Processing, pp. 2122–2132, 2016. + Fuli Luo, Damai Dai, Pengcheng Yang, Tianyu Liu, Baobao Chang, Zhifang Sui, and Xu Sun. + Learning to control the fine-grained sentiment for story ending generation. InProceedings of the + 57th Annual Meeting of the Association for Computational Linguistics, pp. 6020–6026, 2019. + Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher + Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting + of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, + Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URLhttp: + //www.aclweb.org/anthology/P11-1015. + Christopher D Manning, Christopher D Manning, and Hinrich Schütze.Foundations of statistical + natural language processing. MIT press, 1999. + Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward + Teller. Equation of state calculations by fast computing machines. The journal of chemical + physics, 21(6):1087–1092, 1953. + Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. Facebook fair’s + wmt19 news translation task submission.arXiv preprint arXiv:1907.06616, 2019. + Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High con- + fidence predictions for unrecognizable images.The IEEE Conference on Computer Vision and + Pattern Recognition (CVPR), June 2015. + Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & Play + Generative Networks: Conditional Iterative Generation of Images in Latent Space. InThe IEEE + Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. + Nanyun Peng, Marjan Ghazvininejad, Jonathan May, and Kevin Knight. Towards controllable story + generation. InProceedings of the First Workshop on Storytelling, pp. 43–49, 2018. + Martin Potthast, Tim Gollub, Kristof Komlossy, Sebastian Schuster, Matti Wiegmann, Erika Pa- + tricia Garces Fernandez, Matthias Hagen, and Benno Stein. Crowdsourcing a large corpus of + clickbait on twitter. InProceedings of the 27th International Conference on Computational Lin- + guistics, pp. 1498–1507, 2018. + Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language un- + derstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- + assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018a. + Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language un- + derstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- + assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018b. + Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language + models are unsupervised multitask learners.OpenAI Blog, 1(8), 2019. + Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin + diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1): + 255–268, 1998. + Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of langevin distributions and + their discrete approximations.Bernoulli, 2(4):341–363, 1996. + Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. What makes a good conversation? + How controllable attributes affect human judgments.arXiv e-prints, art. arXiv:1902.08654, Feb + 2019. + Claude Elwood Shannon. A mathematical theory of communication.Bell system technical journal, + 27(3):379–423, 1948. + Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi S. Jaakkola. Style transfer from non-parallel + text by cross-alignment. CoRR, abs/1705.09655, 2017. URLhttp://arxiv.org/abs/ + 1705.09655. + Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, + and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment + treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language + Processing, pp. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computa- + tional Linguistics. URLhttps://www.aclweb.org/anthology/D13-1170. + Felix Stahlberg, James Cross, and Veselin Stoyanov. Simple fusion: Return of the language model. + arXiv preprint arXiv:1809.00125, 2018. + Nishant Subramani, Sam Bowman, and Kyunghyun Cho. Can unconditional language models re- + cover arbitrary sentences?arXiv preprint arXiv:1907.04944, 2019. + Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfel- + low, and Rob Fergus. Intriguing properties of neural networks.CoRR, abs/1312.6199, 2013. + Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, + Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- + mation Processing Systems, pp. 6000–6010, 2017. + Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial + triggers for nlp.arXiv preprint arXiv:1908.07125, 2019. + Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement + learning.Machine learning, 8(3-4):229–256, 1992. + Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, + Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Transformers: State- + of-the-art natural language processing, 2019. + Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and- + write: Towards better automatic storytelling. InProceedings of the AAAI Conference on Artificial + Intelligence, volume 33, pp. 7378–7385, 2019. + Kyra Yee, Nathan Ng, Yann N Dauphin, and Michael Auli. Simple and effective noisy channel + modeling for neural machine translation.arXiv preprint arXiv:1908.05731, 2019. + Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets + with policy gradient. InThirty-First AAAI Conference on Artificial Intelligence, 2017. + Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. The neural noisy + channel.arXiv preprint arXiv:1611.02554, 2016. + Lei Yu, Laurent Sartran, Wojciech Stokowiec, Wang Ling, Lingpeng Kong, Phil Blunsom, and + Chris Dyer. Putting machine translation in context with the noisy channel model.arXiv preprint + arXiv:1910.00553, 2019. + Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul + Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv + preprint arXiv:1909.08593, 2019. URLhttps://arxiv.org/abs/1909.08593. + + + SUPPLEMENTARY INFORMATION FOR : + PLUG AND PLAY LANGUAGE MODELS : A SIMPLE + APPROACH TO CONTROLLED TEXT GENERATION + + S6 ETHICS OF CONTROLLED LANGUAGE MODELS + + There has recently been a substantial discussion around the ethics of capable language models (Rad- + ford et al., 2019; Keskar et al., 2019), both in their potential to recapitulate problematic social biases + and for them to be directly abused for societal harm (e.g. to generate disinformation). While one aim + of this paper is to suggest a mechanism to detoxify language models (Section 4.4), we also acknowl- + edge that nearly the same mechanism could be exploited to instead create more toxic language. Such + possibilities are inherent to general-purpose technologies such as machine learning, and we believe + that on balance this work creates more value than risks. + + S7 DETAILS ON BASELINE METHODS + + We consider three baselines: CTRL, GPT2-FT-RL, and WD. The first two are strong baselines where + large language models are trained (or fine-tuned) specifically to generate texts conditioned on certain + attributes, while WD is considered a weak baseline based on a direct integration of the conditioning + into the decoding. + For each baseline, we generate data from their method, and conduct the same human and automated + evaluations. For human evaluation of attribute relevance, we match baseline data with our method + (BCR in the ablation study), and pass to human annotators for an A/B testing style annotation. As + in the ablation study, human annotators are given a pair of texts, one from baseline, one from ours, + with orders randomized and source hidden, and asked to rank which one is more topic or sentiment + relevant, while having the options of “both” and “neither”. + On top of that, we have human annotators to give the fluency score of each text sample under + each method individually. And automated evaluations of perplexity, sentiment, etc. are also done + individually. + + S7.1 CTRL + + The recent conditional language model, CTRL, from Keskar et al. (2019), trains a 1.6B LM condi- + tioned on around 50 control codes. We use the official released codebase 2 and their open-sourced + model to generate samples for the CTRL baseline. Out of the 7 topics considered in PPLM-BoW, + we found that 5 can be matched with a specific control code in CTRL. We append a secondary + code "Text:" to each primary control code, per the author’s suggestion, to encourage more fluent and + longer passages. The 2 topics missing a match with CTRL are: Military, Space. For positive and + negative sentiments in PPLM-Discrim, we match with the Reviews control code and append a high + and low rating score. + The matched attributes and control codes are listed in Table S7. + Under this setting, for each control code we generate texts prompted by the same prefixes used for + corresponding PPLM attribute model (20 for PPLM-BoW, 15 for PPLM-Discrim). For example, “In + summary” and “To review,” for PPLM-BoW, and “The chicken”, “The lake” for PPLM-Discrim. + Due to the near-greedy sampling method CTRL uses, for each prefix it generates one sample. Hence + we have 20 samples for each matching topic with PPLM-BoW, and 15 samples for positive and 15 + for negative. + + S7.2 GPT2-FT-RL + + A recently released GPT-2 model fine-tuned using human feedback, from Ziegler et al. (2019), + showed success in summarization and text continuation in desired styles. To compare with PPLM, + 2 CTRL codebase:https://github.com/salesforce/ctrl + + Table S7: Control codes used for the model from Keskar et al. (2019) for experiments in Section 4. + + <
> + + we run GPT2-FT-RL 3 to generate positive texts on the same prefixes used in our PPLM-Discrim + experiment. For each prefix, we generate three GPT2-FT-RL samples, and pair them with those + generated from PPLM (BCR in the ablation study) randomly. + + S7.3 WEIGHTED DECODING (WD) + + We consider a simple baseline based on a direct integration of the conditioning into the decoding + procedure, similar to the approach from Ghazvininejad et al. (2017). + + Topic Control with Bag of Words In Ghazvininejad et al. (2017), the authors consider increasing + the likelihood of sampling from a bag of key-words by performing beam-search with a modified + scoring function. + <>; + + where 1 BoW (<>) is an indicator function indicating if the tokenwi is present in the bag BoW. Since, + it has been shown that beam-search results in degradation of language for GPT-2 (Holtzman et al., + 2019), we consider top-5 sampling from a distribution <> defined such that: + + <> + + where <> and FORMULA is the distribution over the vocabulary as predicted by the GPT-2 LM . For + the experiments in Section 4, we set <>. + + Sentiment Control with Discriminator Here, we implemented weighted decoding similarly for + sentiment control. Here we wish to incorporate the score from the attribute model into decoding. To + control for stylea^, instead of sampling from the distributionpt+1 , we sample from <> defined as: + + <> + + <> is the probabilty of the sequence <> possessing attribute <> assigned by the + attribute model. By Bayes’ rule, <>, and we do top-5 + sampling from this distribution. Recall that <> under the language model. + + S8 FURTHER DETAILS ON HUMAN AND AUTOMATED EVALUATION + + We conduct evaluations on attribute relevance and language fluency, both including human and + automated evaluation. + For topic relevance (a.k.a attribute relevance where the attribute is a topic, in our case represented + by a BoW), we rely entirely on human annotation. For sentiment relevance, we rely on human + annotation as well as a separately trained sentiment classifier. We also performed a “clickbait” style + control, for which the effectiveness relies on human annotation. + + 3 GPT2-FT-RL codebase:https://github.com/openai/lm-human-preferences + + For fluency, we use human annotations (between 1 to 5) and automated methods: perplexity, Dist-1, + Dist-2, and Dist-3 scores. + The number of human evaluations are as below: + + <
> + + In ablation studies, the generation procedure for BCR, BR and BC is always initiated from the same + random seeds. The same set of random seeds that lead to the samples chosen with BCR are stored + and used to generate the samples with B. + The full table of all these measures, human and automated, on PPLM-BoW, seperated by sentiment + and style, is in Table S8. Included also are strong baselines (CTRL and WD) for each sentiment. + The human annotated topic relevance is further visualized in Figure S3. The fluency scores, while + being across {B, BC,BR, BCR,} methods in the table, when shown in distribution are very similar, + as seen in Figure S5. + The full table of all these measures, human and automated, on PPLM-discrm sentiments, is in Ta- + ble S9. Included also are strong baselines (CTRL, WD and GPT2-FT-RL) for each topic. The human + annotated sentiment and style (e.g. “Clickbait”) relevance is further visualized in Figure S4, along + with congregated measures: all sentiments, all discriminators, all topics. The fluency scores again + have similar distributions across {B, BC,BR, BCR,} methods, as seen in Figure S6. + + <
> + + Figure S3: Topic relevance by human evaluation. We can see that taking a PPLM gradient step + (B!BC) makes a big difference. Reranking is mostly helpful (B!BR; BC!BCR). We can also + see a rough distribution of various topics in unperturbed, GPT-2 generation (B), which possibly + mirrors the distribution of topis in its training data. Some topics, like science, naturally appear + rather frequently. + + + + S9 ODD COMBINATION OF TOPICS AND PREFIXES + + It is interesting to see how PPLM can steer the text generation when the topic and prefix combination + appears odd or illogical. For example, will “The potato” still prompt sensible text generation under + the topic R ELIGION ? In this study we design a set of odd combinations, as bellow. + + Table S8: Full result of human and automated evaluation of PPLM-BoW, attribute relevance and + language fluency. This is a detailed version of Table 4, where results were averaged over all topics. + Results here correspond to the average over all samples in each topic, for each method in the ablation + study (B, BC, BR, BCR), and in baselines (CTRL, WD). Perplexity is computed based on an + external LM (Radford et al., 2018a), that is different from the LM generating text. + + <
> + + Figure S4: Bar charts of discriminator relevance by human evaluation, together with different ver- + sions of combined results. + + <
> + + Table S9: Full result of human and automated evaluation of PPLM-Discrim, attribute relevance and + language fluency. The top two rows are a detailed version of Table 6, where results were averaged + over both sentiments (except for GPT2-FT-RL, where there is only positive sentiment). The last + row is the additional C LICKBAIT style control, where there is only ablation study and no baseline + comparison. Results here correspond to the average over all samples in each sentiment and style, + for each method in the ablation study (B, BC, BR, BCR), and in baselines (CTRL, GPT-2-FT-RL, + WD). Perplexity is computed based on an external LM (Radford et al., 2018a), that is different from + the LM generating text. + + <
> + + We found that PPLM control is easy even under those scenarios. We had to increase the strength + two or three fold (to0:02or0:03as opposed to0:01in most studies) to allow for a stronger + influence of attribute, but this is as expected: the strength parameter is a knob that user can tune to + reach fine-grained control. The resulting generation is included in Table S10 - Table S16. + + Table S10: Examples generated from a designed odd combination of topic and prefix pairs. The + topic here is[Military]. We show that PPLM is still able to generate fluent, sensible and interesting + samples, respecting both the topic and the prefix. + + <
> + + Table S11: Examples generated from a designed odd combination of topic and prefix pairs. The + topic here is[Legal]. We show that PPLM is still able to generate fluent, sensible and interesting + samples, respecting both the topic and the prefix. + + <
> + + Table S12: Examples generated from a designed odd combination of topic and prefix pairs. The + topic here is[Computers]. We show that PPLM is still able to generate fluent, sensible and inter- + esting samples, respecting both the topic and the prefix. + + <
> + + Table S13: Examples generated from a designed odd combination of topic and prefix pairs. The + topic here is[Politics]. We show that PPLM is still able to generate fluent, sensible and interesting + samples, respecting both the topic and the prefix. + + <
> + + Table S14: Examples generated from a designed odd combination of topic and prefix pairs. The + topic here is[Religion]. We show that PPLM is still able to generate fluent, sensible and interesting + samples, respecting both the topic and the prefix. + + <
> + + Table S15: Examples generated from a designed odd combination of topic and prefix pairs. The + topic here is[Space]. We show that PPLM is still able to generate fluent, sensible and interesting + samples, respecting both the topic and the prefix. + + <
> + + Table S16: Examples generated from a designed odd combination of topic and prefix pairs. The + sentiment here is[Positive] and[Negative]. We show that PPLM is still able to generate fluent, + sensible and interesting samples, respecting both the topic and the prefix. + + <
> + + S10 FINE-GRAINED CONTROL WITH PPLM-BOW + + Table S17 shows the subtle effect when you turn the step sizeup, while keeping everything else + (hyperparameters, text prefix) the same. + + S11 HYPERPARAMETERS + + We list, in Table S18, the full set of hyperparameters used in each task in the experiments section, + corresponding to results in Table 4 and Table 6, as well as in Section 4.4. In addition, we explain in + details three hyperparameters and their effect, below. + + S11.1 E RLY STOPPING OF LATENT UPDATES + + Degeneration (the occurrence of repetitive words) is a known issue with language generation (Holtz- + man et al., 2019), and we found it to be a case in PPLM-BoW when the update step size t is too + large. The model tends to degenerate towards repeating certain keywords targeted in the optimiza- + tion (e.g. words in the BoW). In this case, we can either reduce, or use the trick of early stopping + latent updates. + Examples shown in Table S19. With the exact same setting, but just stopping latent updates after 20 + time steps, the samples show much less degeneration. + + S11.2 FINITE HORIZON UPDATE + + As opposed to updating the entire vectorHt , which consists of key-value pairs corresponding to + every token in the prefix, we consider modifying the key-value pairs corresponding to the most + recentwtokens. At each time-stept, we only modify <>. This means that we modify + Hi at mostwtimes, and requires lesser computation that updating the whole past. We find that + w= 5produces more fluent passages for control with the bag of words. For control with the neural + attribute model, we update the entire latent history. + + S11.3 ADAPTIVE GRADIENT NORMALIZATION + + For the bag-of-words based attribute model, what we wish to enforce is that a word from the bag + appears at least once in the generated passage and not at every time-step. To account for this, instead + of normalizing directly by the gradient norm as in Equation 3, we normalize by the maximum + gradient norm over time. This implies that we make smaller updates when it is less likely for + + 3 We choose top 3 samples from a single batch of 10 here + + S17: Comparison of different sentences generated by (top row) unconditioned GPT-2 and + + <
> + + a word from the bag of words to appear. Formally, the normalization constant at time-step t is: + + <> + + + Table S18: The full set of hyperparameters used in each task in the experiments section. Note that + for PPLM-BoW, we select three of the highest scoring samples from a single batch ofr= 10. For + PPLM-Discrim, we get 1 sample per batch, across 3 batches ofr= 10. + + <
> + + Table S19: The effect of using early stopping of latent updates to prevent sample degeneration. + Before (Latent updates at every generation step) After (Latent updates for only the first 20 + generation steps) + + <
> + + Table S20: Skeleton story generation with different attribute models. Each story is generated within + + <
> + + S12 ADDITIONAL DETAILS AND SAMPLES OF CONTROLLED STORY TELLING + + Table S20 and Table S21 show examples of the controlled story telling with skeleton. Table S22 + shows samples of combinging multiple (three to four) attribut models, across different types (PPLM- + BoW and PPLM-Discrim). + + + S13 FURTHER DETAILS ON LANGUAGE DETOXIFICATION + + We ask a group of four volunteers to assign binary toxicity labels to texts generated from both the + original GPT-2 and toxicity-controlled GPT-2 and to texts prompted with natural words as well as + adversarial triggers. Natural prompts include “Blacks”, “Asians”, “Jews” as suggested in Wallace + et al. (2019). 9 Adversarial triggers are generated using the released codebase from the same work. + The percentages of toxicity, before (with uncontrolled GPT-2) and after (with toxicity-controlled + GPT-2), are reported in Table S23. + + + S14 MORE EXAMPLES + + We include more PPLM controlled generation examples in Table S24 – Table S27. + + + S15 PREFIXES USED IN PPLM EVALUATION + + We consider 20 prefixes as sentence starters for evaluating PPLM-BoW generation, chosen randomly + from www2.eit.ac.nz/library/ls_guides_sentencestarters.html. For PPLM-Discrim, we use 15 prefixes. + The entire set is below. + + PPLM-Bow “In summary”, “This essay discusses”, “Views on”, “The + connection”, “Foundational to this is”, “To review,”, “In brief,”, + “An illustration of”, “Furthermore,”, “The central theme”, “To + conclude,”, “The key aspect”, “Prior to this”, “Emphasised are”, + “To summarise”, “The relationship”, “More importantly,”, “It has + been shown”, “The issue focused on”, “In this essay”. + + PPLM-Discrim “Once upon a time”, “The book”, “The chicken”, “The + city”, “The country”, “The horse”, “The lake”, “The last time”, + + + + Table S21: More examples of skeleton story generation with different attribute models. Each story + + <
> + + S16 COMBINING MULTIPLE CONTROLLERS FOR INSPIRATION + + Earlier we demonstrated attribute control using a single attribute model or two attribute models of + the same type (e.g. BoW from two separate topics). Here we mix different types of attribute models + (BoW and discriminator). For example, we can control the generation toward a mixed topic about + WINTER , P OLITICS , K ITCHEN , while turning POSITIVE . See examples in Table S22. + + <
> + + Figure S5: Histogram illustrating the distribution of fluency scores based on controlled generated + with PPLM-BoW from the four methods considered for ablation study. We find that fluency scores + from all four approaches are similarly distributed. + + <
> + + Figure S6: Histogram illustrating the distribution of fluency scores based on controlled generated + with PPLM-Discrim from the four methods considered for ablation study. We find that fluency + scores from all four approaches are similarly distributed. + + + + S17 WORD LISTS FOR BAG OF WORDS APPROACHES + + We curate word lists fromwww.enchantedlearning.com/wordlist. + + <
> + + Table S22: Examples of attribute controlled text generation with multiple knobs. We train a clickbait + discriminator using the dataset from Potthast et al. (2018) + + <
> + + Table S23: Language detoxification applied to natural prompts and adversarial triggers. Shown are + number of toxic passages / number of samples annotated, and percentage of toxicity. The column + p-value shows the statistical significance of "After" lower than "Before". + + <
> + + Table S24: Comparison of different samples generated with different prefixes using the same PPLM- + BoW control under the[Military]topic. All samples are generated using exact same hyperparam- + eters. + + <
> + + Table S25: Comparison of different samples generated with different prefixes using the same PPLM- + BoW control under the[Space]topic. All samples are generated using exact same hyperparameters. + + <
> + + Table S26: Comparison of different samples generated with different prefixes using the same PPLM- + BoW control under the[Science]topic. All samples are generated using exact same hyperparame- + ters. + + <
> + + Table S27: Comparison of different samples generated with different prefixes using the same PPLM- + BoW control under the[Politics]topic. All samples are generated using exact same hyperparame- + ters. + + <
> + +<> <> <> + + +<> <> <> +Predicting Performance for Natural Language Processing Tasks + +Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham Neubig +Language Technologies Institute, Carnegie Mellon University +{mengzhox,aanastas,yiming,gneubig}@cs.cmu.edu ruochenx@gmail.com + +Abstract + +Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, with.out actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures, outperforming reasonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of representative experiments that should be run in order to obtain plausible predictions for all other experimental settings. + +1 Introduction + +Natural language processing (NLP) is an extraordinarily vast field, with a wide variety of models being applied to a multitude of tasks across a plenitude of domains and languages. In order to mea.sure progress in all these scenarios, it is necessary to compare performance on test datasets represent.ing each scenario. However, the cross-product of tasks, languages, and domains creates an explosion of potential application scenarios, and it is infeasible to collect high-quality test sets for each. In addition, even for tasks where we do have a wide variety of test data, e.g. for well-resourced tasks such as machine translation (MT), it is still +computationally prohibitive as well as not environ.mentally friendly (Strubell et al., 2019) to build and test on systems for all languages or domains we are interested in. Because of this, the common practice is to test new methods on a small number of languages or domains, often semi-arbitrarily chosen based on previous work or the experimenters intuition. +As a result, this practice impedes the NLP community from gaining a comprehensive under.standing of newly-proposed models. Table 1 il.lustrates this fact with an example from bilingual lexicon induction, a task that aims to find word translation pairs from cross-lingual word embed.dings. As vividly displayed in Table 1, almost all the works report evaluation results on a different subset of language pairs. Evaluating only on a small subset raises concerns about making inferences when comparing the merits of these methods: there is no guarantee that performance on English/Spanish (ENOES, the only common evaluation dataset) is representative of the expected performance of the models over all other language pairs (Anastasopoulos and Neubig, 2020). Such phenomena lead us to consider if it is possible to make a decently accurate estimation for the performance over an untested language pair without actually running the NLP model to bypass the computation restriction. +Toward that end, through drawing on the idea of characterizing an experiment from Lin et al. (2019), we propose a framework, which we call NLPERF, to provide an exploratory solution. We build regression models, to predict the performance on a particular experimental setting given past experimental records of the same task, with each record consisting of a characterization of its training dataset and a performance score of the corresponding metric. Concretely, in 2, we start with a partly populated table (such as the one from + +<
> + +Table 1: An illustration of the comparability issues across methods and multiple evaluation datasets from the Bilingual Lexicon Induction task. Our prediction model can reasonably ll in the blanks, as illustrated in Section 4. + +Table (1) and attempt to infer the missing values with the predictor. We begin by introducing the process of characterizing an NLP experiment for each task in 3. We evaluate the effectiveness and robustness of NLPERF by comparing to multiple baselines, human experts, and by perturbing a single feature to simulate a grid search over that feature (4). Evaluations on multiple tasks show that NLPERF is able to outperform all baselines. Notably, on a machine translation (MT) task, the predictions made by the predictor turn out to be more accurate than human experts. +An effective predictor can be very useful for multiple applications associated with practical scenarios. In 5, we show how it is possible to adopt the predictor as a scoring function to find a small subset of experiments that are most representative of a bigger set of experiments. We argue that this will allow researchers to make informed decisions on what datasets to use for training and evaluation, in the case where they cannot experiment on all experimental settings. Last, in 6, we show that we can adequately predict the performance of new models even with a minimal number of experimental records. + +2 Problem Formulation + +In this section we formalize the problem of predicting performance on supervised NLP tasks. Given an NLP model of architecture M trained over dataset(s) D of a specific task involving language(s) L with a training procedure (optimization algorithms, learning rate scheduling etc.) P, we can test the model on a test dataset D0 and get a score S of a specific evaluation metric. The resulting score will surely vary depending on all the above mentioned factors, and we denote this relation as g: + +<>. (1) + +In the ideal scenario, for each test dataset D0 of a specific task, one could enumerate all different settings and find the one that leads to the best performance. As mentioned in Section 1, however, such a brute-force method is computationally infeasible. Thus, we turn to modeling the process and formulating our problem as a regression task by using a parametric function f. to approximate the true function g as follows: + +<> + +where <> denotes a set of features for each influencing factor. +For the purpose of this study, we mainly focus on dataset and language features .L and .D, as this already results in a significant search space, and gathering extensive experimental results with fine-grained tuning over model and training hyper-parameters is both expensive and relatively complicated. In the cases where we handle multiple models, we only use a single categorical model feature to denote the combination of model architecture and training procedure, denoted as .C. We still use the term model to refer to this combination in the rest of the paper. We also omit the test set features, under the assumption that the data distributions for training and testing data are the same (a fairly reasonable assumption if we ignore possible domain shift). Therefore, for all experiments below, our final prediction function is the following: + +<> + +In the next section we describe concrete instantiations of this function for several NLP tasks. + +3 NLP Task Instantiations + +To build a predictor for NLP task performance, we must 1) select a task, 2) describe its featurization, and 3) train a predictor. We describe details of these three steps in this section. + +<
> + +Table 2: Statistics of the datasets we use for training predictors. # EXs denote the total number of experiment instances; Task Metric reflects how the models are evaluated. +Tasks We test on tasks including bilingual lexicon induction (BLI); machine translation trained on aligned Wikipedia data (Wiki-MT), on TED talks (TED-MT), and with cross-lingual trans.fer for translation into English (TSF-MT); cross-lingual dependency parsing (TSF-Parsing); cross-lingual POS tagging (TSF-POS); cross-lingual entity linking (TSF-EL); morphological analysis (MA) and universal dependency parsing (UD). Ba.sic statistics on the datasets are outlined in Table 2. +For Wiki-MT tasks, we collect experimental records directly from the paper describing the cor.responding datasets (Schwenk et al., 2019). For TED-MT and all the transfer tasks, we use the results of Lin et al. (2019). For BLI, we conduct experiments using published results from three papers, namely Artetxe et al. (2016), Artetxe et al. (2017) and Xu et al. (2018). For MA, we use the results of the SIGMORPHON 2019 shared task 2 (McCarthy et al., 2019). Last, the UD results are taken from the CoNLL 2018 Shared Task on universal dependency parsing (Zeman et al., 2018b). +Featurization For language features, we utilize six distance features from the URIEL Typologi.cal Database (Littell et al., 2017), namely geo.graphic, genetic, inventory, syntactic, phonological, and featural distance. +The complete set of dataset features includes the following: +1. Dataset Size: The number of data entries used for training. + +2. Word/Subword Vocabulary Size: The number of word/subword types. + +3. Average Sentence Length: The average length of sentences from all experimental. + +4. Word/Subword Overlap: <> where T1 and T2 denote vocabularies of any two corpora. + +5. Type-Token Ratio (TTR): The ratio between the number of types and number of tokens (Richards, 1987) of one corpus. + +6. Type-Token Ratio Distance: <> where TTR1 and TTR2 denote TTR of any two corpora. + +7. Single Tag Type: Number of single tag types. + +8. Fused Tag Type: Number of fused tag types. + +9. Average Tag Length Per Word: Average number of single tags for each word. + +10. Dependency Arcs Matching WALS features: the proportion of dependency parsing arcs matching the following WALS features, computed over the training set: subject/object/oblique before/after verb and adjective/numeral before/after noun. + + +For transfer tasks, we use the same set of dataset features .D as Lin et al. (2019), including features 1x6 on the source and the transfer language side. We also include language distance features between source and transfer language, as well as between source and target language. For MT tasks, we use features 1x6 and language distance features, but only between the source and target language. For MA, we use features 1, 2, 5 and morphological tag related features 7x9. For UD, we use features 1, 2, 5, and 10. For BLI, we use language distance features and URIEL syntactic features for the source and the target language. +Predictor Our prediction model is based on gradient boosting trees (Friedman, 2001), implemented with XGBoost (Chen and Guestrin, 2016). This method is widely known as an effective means for solving problems including ranking, classification and regression. We also experimented with Gaussian processes (Williams and Rasmussen, 1996), but settled on gradient boosted trees because performance was similar and Xg.boost's implementation is very efficient through the use of parallelism. We use squared error as the objective function for the regression and adopted a fixed learning rate 0.1. To allow the model to fully fit the data we set the maximum tree depth to be 10 and the number of trees to be 100, and use the default regularization terms to prevent the model from overfitting. + +4 Can We Predict NLP Performance? + +In this section we investigate the effectiveness of NLPERF across different tasks on various metrics. Following Lin et al. (2019), we conduct k-fold cross validation for evaluation. To be specific, we randomly partition the experimental records of hL, D, C, Si tuples into k folds, and use k.1 folds to train a prediction model and evaluate on the remaining fold. Note that this scenario is similar to filling in the blanks in Table 1, where we have some experimental records that we can train the model on, and predict the remaining ones. +For evaluation, we calculate the average root mean square error (RMSE) between the predicted scores and the true scores. +Baselines We compare against a simple mean value baseline, as well as against language-wise mean value and model-wise mean value baselines. The simple mean value baseline outputs an aver.age of scores s from the training folds for all test entries in the left-out fold (i) as follows: + +<> + +But performance of what? + +(FLOPS, energy, memory) + +or plain accuracy? + +<> (2) + +Note that for tasks involving multiple models, we calculate the RMSE score separately on each model and use the mean RMSE of all models as the final RMSE score. +The language-wise baselines make more in.formed predictions, taking into account only train.ing instances with the same transfer, source, or tar.get language (depending on the task setting). For example, the source-language mean value baseline + +<<(i,j)>> + +s for jth test instance in fold i outputs an average of the scores s of the training instances that share the same source language features s-lang, as shown in Equation 3: + +<> (3) + +where . is the indicator function. Similarly, we define the target-and the transfer-language mean value baselines. +In a similar manner, we also compare against a model-wise mean value baseline for tasks that include experimental records from multiple models. Now, the prediction for the jth test instance in the left-out fold i is an average of the scores on the same dataset (as characterized by the language .L and dataset .D features) from all other models: +<> (4) + +where <> and <> respectively denote the language and dataset features of the test instance. +Main Results For multi-model tasks, we can do either Single Model prediction (SM), restricting training and testing of the predictor within a single model, or Multi-Model (MM) prediction using a categorical model feature. The RMSE scores of NLPERF along with the baselines are shown in Table 3. For all tasks, our single model predictor is able to more accurately estimate the evaluation score of unseen experiments compared to the single model baselines, confirming our hypothesis that the there exists a correlation that can be captured between experimental settings and the downstream performance of NLP systems. The language-wise baselines are much stronger than the simple mean value baseline but still perform worse than our single model predictor. Similarly, the model-wise baseline significantly outperforms the mean value baseline because results from other models reveal much information about the dataset. + +<
> + +Table 3: RMSE scores of three baselines and our predictions under the single model and multi model setting (missing values correspond to settings not applicable to the task). All results are from k-fold (k =5) evaluations averaged over 10 random runs. +Even so, our multi-model predictor still outperforms the model-wise baseline. +The results nicely imply that for a wide range of tasks, our predictor is able to reasonably estimate left-out slots in a partly populated table given results of other experiment records, without actually running the system. +We should note that RMSE scores across different tasks should not be directly compared, mainly because the scale of each evaluation metric is different. For example, a BLEU score (Papineni et al., 2002) for MT experiments typically ranges from 1 to 40, while an accuracy score usually has a much larger range, for example, BLI accuracy ranges from 0.333 to 78.2 and TSF-POS accuracy ranges from 1.84 to 87.98, which consequently makes the RMSE scores of these tasks higher. +Comparison to Expert Human Performance +We constructed a small scale case study to evaluate whether NLPERF is competitive to the performance of NLP sub-field experts. We focused on the TED-MT task and recruited 10 MT practitioners, all of whom had published at least 3 MT-related papers in ACL-related conferences. +In the first set of questions, the participants were presented with language pairs from one of the k data folds along with the dataset features and were asked to estimate an eventual BLEU score for each data entry. In the second part of the questionnaire, the participants were tasked with making estimations on the same set of language pairs, but this time they also had access to features, and BLEU scores from all the other folds.3 + +<
> + +Table 4: Our model performs better than human MT experts on the TED-MT prediction task. +The partition of the folds is consistent between the human study and the training/evaluation for the predictor. While the first sheet is intended to familiarize the participants with the task, the second sheet fairly adopts the training/evaluation setting for our predictor. As shown in Table 4, our participants outperform the mean baseline even without information from other folds, demonstrating their own strong prior knowledge in the field. In addition, the participants make more accurate guesses after acquiring more information on experimental records in other folds. In neither case, though, are the human experts competitive to our predictor. In fact, only one of the participants achieved performance comparable to our predictor. +Feature Perturbation Another question of interest concerning predicting performance is how will the model perform when trained on data of a different size (Kolachina et al., 2012a). To test NLPERF's extrapolation ability in this regard, we conduct an array of experiments on one language pair with various data sizes on the Wiki-MT task. We pick two language pairs, Turkish to English (TR�EN) and Portuguese to English (PT�EN) as + +2 None of the study participants were affiliated to the au-our testbed for the Wiki-MT task. We sample parallel datasets with different sizes and train MT models with each sampled dataset to obtain the true BLEU scores. On the other hand, we collect the features of all sampled datasets and use our predictor (trained over all other languages pairs) to obtain predictions. The plot of true BLEU scores and predicted BLEU scores are shown in Figure 1. Our predictor achieves a very low average RMSE of 1.83 for TR�EN pair but a relatively higher RMSE of 9.97 for PT�EN pair. The favorable performance on the tr-en pair demonstrates the possibility of our predictor to do feature extrapolation over data set size. In contrast, the predictions on the pt-en pair are significantly less accurate. This is due to the fact that there are only two other experimental settings scoring as high as 34 BLEU score, with data sizes of 3378k (en-es) and 611k (gl-es), leading to the predictor�s inadequacy in predicting high BLEU scores for low-resourced data sets during extrapolation. This reveals the fact that while the predictor is able to extrapolate performance on settings similar to what it has seen in the data, NLPERF may be less successful under circumstances unlike its training inputs. + +3 The interested reader can find an example questionnaire (and make estimations over one of the folds) in the Authors institutions, nor were familiar with this paper's content. + +<
> + +Figure 1: Our model's predicted BLEU scores and true BLEU scores, on sampled TR�EN datasets (sizes 10k/50k/100k/200k/478k) and PT�EN datasets (sizes 100k/500k/1000k/2000k/2462k), achieving a RMSE score of 1.83 and 9.97 respectively. + +5 What Datasets Should We Test On? + +As shown in Table 1, it is common practice to test models on a subset of all available datasets. The reason for this is practical <> it is computationally prohibitive to evaluate on all settings. However, if we pick test sets that are not representative of the data as a whole, we may mistakenly reach un.founded conclusions about how well models per.form on other data with distinct properties. For example, models trained on a small-sized dataset may not scale well to a large-sized one, or models that perform well on languages with a particular linguistic characteristic may not do well on languages with other characteristics (Bender and Friedman, 2018). +Here we ask the following question: if we are only practically able to test on a small number of experimental settings, which ones should we test on to achieve maximally representative results? Answering the question could have practical im.plications: organizers of large shared tasks like SIGMORPHON (McCarthy et al., 2019) or UD (Zeman et al., 2018a) could create a minimal sub.set of settings upon which they would ask participants to test to get representative results; similarly, participants could possibly expedite the iteration of model development by testing on the representative subset only. A similar avenue for researchers and companies deploying systems over multiple languages could lead to not only financial savings, but potentially a significant cut-down of emissions from model training (Strubell et al., 2019). +We present an approximate explorative solution to the problem mentioned above. Formally, assume that we have a set N , comprising experimental records (both features and scores) of n datasets for one task. We set a number m (>. (5) + +Naturally, enumerating all possible subsets would be prohibitively costly, even though it would lead to the optimal solution. Instead, we employ a beam-search-like approach to efficiently search for an approximate solution to the best per.forming subset of arbitrary size. Concretely, we start our approximate search with an exhaustive enumeration of all subsets of size 2. At each fol.lowing step t, we only consider the best k subsets +<> into account and discard the t rest. As shown in Equation 6, for each candidate + +<
> + +Figure 2: Beam search results (beam size=100) for up to the 5 most (and least) representative datasets for 4 NLP tasks. We also show random search results averaged over 100 random runs. +subset, we expand it with one more data point, + +<>. (6) + +For tasks that involve multiple models, we take experimental records of the selected dataset from all models into account during expansion. Given all expanded subsets, we train a predictor for each to evaluate on the rest of the data sets, and keep the (i) +best performing k subsets <> with minimum RMSE scores for the next step. Furthermore, note that by simply changing the arg min to an arg max in Equation 5, we can also find the least representative datasets. +We present search results for four tasks4 as beam search progresses in Figure 2, with cor.responding RMSE scores from all remaining datasets as the y-axis. For comparison, we also conduct random searches by expanding the subset with a randomly selected experimental record. In all cases, the most representative sets are an aggregation of datasets with diverse characteristics such as languages and dataset sizes. For example, in the Wiki-MT task, the 5 most representative datasets include languages that fall into a diverse range of language families such as Romance, Turkic, Slavic, etc. while the least representative ones include duplicate pairs (opposite directions) mostly +involving English. The phenomenon is more pronounced in the TED-MT task, where not only the 5 most representative source languages are di.verse, but also the dataset sizes. specifically, the Malay-English (msa-eng) is a tiny dataset (5k parallel sentences), and Hebrew-English (heb-eng) is a high-resource case (212k parallel sentences). +Notably, for BLI task, to test how representative the commonly used datasets are, we se.lect the most frequent 5 language pairs shown in Table 1, namely en-de, es-en, en-es, fr-en, en-fr for evaluation. Unsurprisingly, we get an RMSE score as high as 43.44, quite close to the performance of the worst representative set found using beam search. This finding indicates that the standard practice of choosing datasets for evaluation is likely unrepresentative of results over the full dataset spectrum, well aligned with the claims in Anastasopoulos and Neubig (2020). +A particularly encouraging observation is that the predictor trained with only the 5 most representative datasets can achieve an RMSE score comparable to k-fold validation, which required using all of the datasets for training.5 This indicates that one would only need to train NLP models on a small set of representative datasets to obtain reasonably plausible predictions for the rest. + +5 to be accurate, k . 1 folds of all datasets. + +6 Can We Extrapolate Performance for those better-performing systems, so the predictor New Models? is unable to generalize well (ONLP). +In another common scenario, researchers propose new models for an existing task. It is both time-consuming and computationally intensive to run experiments with all settings for a new model. In this section, we explore if we can use past experimental records from other models and a minimal set of experiments from the new model to give a plausible prediction over the rest of the datasets, potentially reducing the time and resources needed for experimenting with the new model to a large extent. We use the task of UD parsing as our testbed6 as it is the task with most unique models (25 to be exact). Note that we still only use a single categorical feature for the model type. +To investigate how many experiments are needed to have a plausible prediction for a new model, we first split the experimental records equally into a sample set and a test set. Then we randomly sample <> experimental records from the sample set and add them into the collection of experiment records of past models. Each time we re-train a predictor and evaluate on the test set. The random split repeats 50 times and the random sampling repeats 50 times, adding up to a total of 2500 experiments. We use the mean value of the results from other models, shown in Equation 7 as the prediction baseline for the left-out model, and because experiment results of other models reveal significant information about the dataset, this serves as a relatively strong baseline: + +<>. (7) + +M denotes a collection of models and k denotes the left-out model. +We show the prediction performance (in RMSE) over 8 systems7 in Figure 3. Interestingly, the predictor trained with no model records (0) outperforms the mean value baseline for the 4 best systems, while it is the opposite case on the 4 worst systems. Since there is no information provided about the new-coming model, the predictions are solely based on dataset and language features. One reason might explain the phenomenon .the correlation between the features and the scores of the worse-performing systems is different from +6MA and BLI task results are in Appendix C 7The best and worst 4 systems from the shared task. +In the following discussion, we use RMSE@n to denote the RMSE from the predictor trained with n data points of a new model. The relatively low RMSE@0 scores indicate that other models' features and scores are informative for predicting the performance of the new model even without new model information. Comparing RMSE@0 and RMSE@1, we observe a consistent improvement for almost all systems, indicating that NLPERF trained on even a single ex.tra random example achieves more accurate estimates over the test sets. Adding more data points consistently leads to additional gains. However, predictions on worse-performing systems benefit more from it than for better-performing systems, indicating that their feature-performance correlation might be considerably different. The findings here indicate that by extrapolating from past experiments, one can make plausible judgments for newly developed models. + +7 Related Work + +As discussed in Domhan et al. (2015), there are two main threads of work focusing on predict.ing performance of machine learning algorithms. The first thread is to predict the performance of a method as a function of its training time, while the second thread is to predict a method's performance as a function of the training dataset size. Our work belongs in the second thread, but could easily be extended to encompass training time/procedure. +In the first thread, Kolachina et al. (2012b) at.tempt to infer learning curves based on training data features and extrapolate the initial learning curves based on BLEU measurements for statistical machine translation (SMT). By extrapolating the performance of initial learning curves, the predictions on the remainder allows for early termination of a bad run (Domhan et al., 2015). +In the second thread, Birch et al. (2008) adopt linear regression to capture the relationship between data features and SMT performance and find that the amount of reordering, the morphological complexity of the target language and the relatedness of the two languages explains the majority of performance variability. More recently, Elsa. +har and Gall� (2019) use domain shift metrics such as H-divergence based metrics to predict drop in performance under domain-shift. Rosenfeld et al. + +<
> + +Figure 3: RMSE scores of UD task from dataset-wise mean value predictor (the dashed black line in each graph) and predictors trained with experimental records of other models and <> records from a new model. +(2020) explore the functional form of the dependency of the generalization error of neural models on model and data size. We view our work as a generalization of such approaches, appropriate for application on any NLP task. +8 Conclusion and Future Work +In this work, we investigate whether the experiment setting itself is informative for predicting the evaluation scores of NLP tasks. Our findings promisingly show that given a sufficient number of past training experimental records, our predictor can 1) outperform human experts; 2) make plau.sible predictions even over new-coming models and languages; 3) extrapolate well on features like dataset size; 4) provide a guide on how we should choose representative datasets for fast iteration. +While this discovery is a promising start, there are still several avenues on improvement in future work. +First, the dataset and language settings covered in our study are still limited. Experimental records we use are from relatively homogeneous settings, +e.g. all datasets in Wiki-MT task are sentence-pieced to have 5000 subwords, indicating that our predictor may fail for other subword settings. Our model also failed to generalize to cases where feature values are out of the range of the training experimental records. We attempted to apply the pre.dictor of Wiki-MT to evaluate on a low-resource MT dataset, translating from Mapudungun (arn) to Spanish (spa) with the dataset from Duan et al. (2019), but ended up with a poor RMSE score. It turned out that the average sentence length of the arn/spa data set is much lower than that of the training data sets and our predictors fail to generalize to this different setting. +Second, using a categorical feature to denote model types constrains its expressive power for modeling performance. In reality, a slight change in model hyperparameters (Hoos and Leyton-Brown, 2014; Probst et al., 2019), optimization algorithms (Kingma and Ba, 2014), or even random seeds (Madhyastha and Jain, 2019) may give rise to a significant variation in performance, which our predictor is not able to capture. While investigating the systematic implications of model structures or hyperparameters is practically infeasible in this study, we may use additional information such as textual model descriptions for modeling NLP models and training procedures more elaborately in the future. +Lastly, we assume that the distribution of train.ing and testing data is the same, which does not consider domain shift. On top of this, there might also be a domain shift between data sets of train.ing and testing experimental records. We believe that modeling domain shift is a promising future direction to improve performance prediction. + +Acknowledgement + +The authors sincerely thank all the reviewers for their insightful comments and suggestions, Philipp Koehn, Kevin Duh, Matt Post, Shuoyang Ding, Xuan Zhang, Adi Renduchintala, Paul Mc-Namee, Toan Nguyen and Kenton Murray for con.ducting human evaluation for the TED-MT task, Daniel Beck for discussions on Gaussian Pro.cesses, Shruti Rijhwani, Xinyi Wang, Paul Michel for discussions on this paper. This work is generously supported from the National Science Foundation under grant 1761548. + +References +Antonios Anastasopoulos and Graham Neubig. 2020. Should all cross-lingual embeddings speak english? In Proc. ACL. To appear. +Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word em.beddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empiri.cal Methods in Natural Language Processing, pages 2289�2294. +Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Lin.guistics (Volume 1: Long Papers), pages 451�462, Vancouver, Canada. Association for Computational Linguistics. +Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019. Bilingual lexicon induction through unsupervised machine translation. In Proceedings of the 57th An.nual Meeting of the Association for Computational Linguistics, pages 5002�5007, Florence, Italy. Asso.ciation for Computational Linguistics. +Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587�604. +Alexandra Birch, Miles Osborne, and Philipp Koehn. 2008. Predicting success in machine translation. In Proceedings of the Conference on Empirical methods in Natural Language Processing, pages 745� +754. Association for Computational Linguistics. +Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785� +794. ACM. +Xilun Chen and Claire Cardie. 2018. Unsupervised multilingual word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Nat.ural Language Processing, pages 261�270, Brus.sels, Belgium. Association for Computational Lin.guistics. +Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding up automatic hyperparame.ter optimization of deep neural networks by extrap.olation of learning curves. In Twenty-Fourth Inter.national Joint Conference on Arti�cial Intelligence. +Mingjun Duan, Carlos Fasola, Sai Krishna Rallabandi, Rodolfo M. Vega, Antonios Anastasopoulos, Lori Levin, and Alan W Black. 2019. A resource for computational experiments on mapudungun. In Proc. LREC. To appear. +Hady Elsahar and Matthias Gall�. 2019. To annotate or not? predicting performance drop under domain shift. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu.ral Language Processing (EMNLP-IJCNLP), pages 2163�2173. +Jerome H Friedman. 2001. Greedy function approx.imation: a gradient boosting machine. Annals of statistics, pages 1189�1232. +Geert Heyman, Bregt Verreet, Ivan Vuli�c, and Marie-Francine Moens. 2019. Learning unsupervised mul.tilingual word embeddings with incremental multi.lingual hubs. In Proceedings of the 2019 Confer.ence of the North American Chapter of the Asso.ciation for Computational Linguistics: Human language Technologies, Volume 1 (Long and Short papers), pages 1890�1902. +Holger Hoos and Kevin Leyton-Brown. 2014. An ef.�cient approach for assessing hyperparameter im.portance. In International conference on machine learning, pages 754�762. +Jiaji Huang, Qiang Qiu, and Kenneth Church. 2019. Hubless nearest neighbor search for bilingual lexi. +con induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin.guistics, pages 4072�4080, Florence, Italy. Associa.tion for Computational Linguistics. +Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. +Prasanth Kolachina, Nicola Cancedda, Marc Dymet.man, and Sriram Venkatapathy. 2012a. Prediction of learning curves in machine translation. In Proceed.ings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), pages 22�30, Jeju Island, Korea. Association for Computational Linguistics. +Prasanth Kolachina, Nicola Cancedda, Marc Dymet.man, and Sriram Venkatapathy. 2012b. Prediction of learning curves in machine translation. In Proceed.ings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 22�30. Association for Computational Lin.guistics. +Guillaume Lample, Alexis Conneau, Marc�Aurelio Ranzato, Ludovic Denoyer, and Herv� J�gou. 2018. Word translation without parallel data. In Interna.tional Conference on Learning Representations. +Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neu.big. 2019. Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin.guistics, pages 3125�3135, Florence, Italy. Associa.tion for Computational Linguistics. +Patrick Littell, David R Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. Uriel and lang2vec: Representing languages as typologi.cal, geographical, and phylogenetic vectors. In Pro.ceedings of the 15th Conference of the European Chapter of the Association for Computational Lin.guistics: Volume 2, Short Papers, pages 8�14. +Pranava Madhyastha and Rishabh Jain. 2019. On model stability as a function of random seed. In Pro.ceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 929� 939, Hong Kong, China. Association for Computa.tional Linguistics. +Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar.rett Nicolai, Christo Kirov, Miikka Silfverberg, Se.bastian J. Mielke, Jeffrey Heinz, Ryan Cotterell, and Mans Hulden. 2019. The SIGMORPHON 2019 shared task: Morphological analysis in context and cross-lingual transfer for in�ection. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 229� 244, Florence, Italy. Association for Computational Linguistics. +Joakim Nivre, Mitchell Abrams, .eljko Agic,� Lars Ahrenberg, Lene Antonsen, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Balles.teros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, John Bauer, Sandra Bellato, Kepa Bengoetxea, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Rogier Blokland, Victoria Bobicev, Carl Bstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, G�sen Cebiro� git, Giuseppe G. A. Celano, Savas +glu Eryi� Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, Jayeol Chun, Silvie Cinkov�, Aur�lie Collomb, �a� +gr� fitekin, Miriam Connor, Marine Courtin, Elizabeth Davidson, Marie-Catherine de Marn.effe, Valeria de Paiva, Arantza Diaz de Ilarraza, Carly Dickerson, Peter Dirix, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Toma. Erjavec, Aline Etienne, Rich�rd Farkas, Hector Fernandez Alcalde, Jennifer Foster, Cl�udia Freitas, Katar�na Gajdo.ov�, Daniel Galbraith, Marcos Garcia, Moa G�rdenfors, Kim Gerdes, Filip Gin.ter, Iakes Goenaga, Koldo Gojenola, Memduh G�rmak, Yoav Goldberg, Xavier Gez Guino.vart, Berta Gonz�les Saavedra, Matias Grioni, Normunds Gruz� �fitis, Bruno Guillaume, C�line Guillot-Barbance, Nizar Habash, Jan Hajic,� Jan Haji�c jr., Linh H� M�y, Na-Rae Han, Kim Harris, Dag Haug, Barbora Hladk�, Jaroslava Hlav�� +cov�, Florinel Hociung, Petter Hohle, Jena Hwang, Radu Ion, Elena Irimia, Tom�. Jelfinek, Anders Johannsen, Fredrik Jgensen, Her Ka�s�kara, Sylvain Kahane, Hiroshi Kanayama, Jenna Kanerva, Tolga Kayade.len, V�clava Kettnerov�, Jesse Kirchner, Natalia Kotsyba, Simon Krek, Sookyoung Kwak, Veronika Laippala, Lorenzo Lambertino, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev, John Lee, Phng L� H` +g, Alessandro Lenci, Saran Lertpradit, Herman Leung, Cheuk Ying Li, Josie Li, Keying Li, KyungTae Lim, Nikola Ljube.ic,� Olga Loginova, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz, Aibek Makazhanov, Michael Mandl, Christopher Manning, Ruli Manurung, C� alina +at� Mar� � cek, Katrin Marheinecke, +anduc, David Mare� H�ctor Martfinez Alonso, Andr� Martins, Jan Ma.ek, Yuji Matsumoto, Ryan McDonald, Gustavo Mendon�a, Niko Miekka, Anna Missil�, Cat� � +alin Mititelu, Yusuke Miyao, Simonetta Montemagni, Amir More, Laura Moreno Romero, Shinsuke Mori, Bjartur Mortensen, Bohdan Moskalevskyi, Kadri Muischnek, Yugo Murawaki, Kaili Mrisep, Pinkey Nainwani, Juan Ignacio Navarro Horacek, Anna Nedoluzhko, Ne.pore-B� +Gunta erzkalne, ., . +Lng Nguy�n� Thi Huy�n` Nguy�n� Thi Minh, Vitaly Nikolaev, Rattima Nitisaroj, Hanna Nurmi, Stina Ojala, Ad�day� Olkun, Mai Omura, +. Petya Osenova, Robert �stling, Lilja �vrelid, Niko Partanen, Elena Pascual, Marco Passarotti, Agnieszka Patejuk, Siyao Peng, Cenel-Augusto Perez, Guy Perrier, Slav Petrov, Jussi Piitulainen, Emily Pitler, Barbara Plank, Thierry Poibeau, Martin Popel, Lauma Pretkalnin, a, Sophie Pr�vost, Prokopis Prokopidis, Adam Przepikowski, Tiina Puolakainen, Sampo Pyysalo, Andriela R��bis, Alexandre Rademaker, Loganathan Ramasamy, Taraka Rama, Carlos Ramisch, Vinit Ravishankar, Livy Real, Siva Reddy, Georg Rehm, Michael Rie�ler, Larissa Rinaldi, Laura Rituma, Luisa Rocha, Mykhailo Romanenko, Rudolf Rosa, Da.vide Rovati, Valentin Ros,ca, Olga Rudina, Shoval Sadde, Shadi Saleh, Tanja Samard.ic,� Stephanie Samson, Manuela Sanguinetti, Baiba Saul�fite, Yanin Sawanakunanon, Nathan Schneider, Sebas.tian Schuster, Djam� Seddah, Wolfgang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shimada, Muh Shohibussirri, Dmitry Sichinava, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simk M�ria .imkov�, Kiril Simov, Aaron Smith, Isabela Soares-Bastos, Antonio Stella, Milan Straka, Jana Strnadov�, Alane Suhr, Umut Sulubacak, Zsolt Sz�nt Dima Taji, Yuta Takahashi, Takaaki Tanaka, Isabelle Tellier, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Zde� +nka Ure.ov�, Larraitz Uria, Hans Uszkoreit, Sowmya Vajjala, Daniel van Niekerk, Gertjan van Noord, Viktor Varga, Veronika Vincze, Lars Wallin, Jonathan North Washington, Seyi Williams, Mats Wir�n, Tsegay Woldemariam, Tak-sum Wong, Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu, Zdenek� .abokrtsk Amir Zeldes, Daniel Zeman, Manying Zhang, and Hanzhi Zhu. 2018. Universal dependencies 2.2. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (�FAL), Faculty of Mathematics and Physics, Charles University. +Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval.uation of machine translation. In Proceedings of the 40th annual meeting on association for compu.tational linguistics, pages 311�318. Association for Computational Linguistics. +Philipp Probst, Anne-Laure Boulesteix, and Bernd Bis.chl. 2019. Tunability: Importance of hyperparame.ters of machine learning algorithms. Journal of Ma.chine Learning Research, 20(53):1�32. +Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad.manabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Meeting of the North American Chapter of the Association for Computa.tional Linguistics (NAACL), New Orleans, USA. +Brian Richards. 1987. Type/token ratios: What do they really tell us? Journal of child language, 14(2):201� 209. +Shruti Rijhwani, Jiateng Xie, Graham Neubig, and Jaime Carbonell. 2019. Zero-shot neural transfer for cross-lingual entity linking. In Thirty-Third AAAI Conference on Arti�cial Intelligence (AAAI), Hon.olulu, Hawaii. +Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Be.linkov, and Nir Shavit. 2020. A constructive pre. +diction of the generalization error across scales. In International Conference on Learning Representa.tions. +Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzm�n. 2019. Wiki-matrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791. +Emma Strubell, Ananya Ganesh, and Andrew McCal.lum. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computa.tional Linguistics, pages 3645�3650, Florence, Italy. Association for Computational Linguistics. +Christopher KI Williams and Carl Edward Rasmussen. 1996. Gaussian processes for regression. In Ad.vances in neural information processing systems, pages 514�520. +Ruochen Xu, Yiming Yang, Naoki Otani, and Yuexin Wu. 2018. Unsupervised cross-lingual transfer of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural language Processing, pages 2465�2474. +Pengcheng Yang, Fuli Luo, Peng Chen, Tianyu Liu, and Xu Sun. 2019. Maam: A morphology-aware alignment model for unsupervised bilingual lexicon +induction. In Proceedings of the 57th Annual Meet.ing of the Association for Computational Linguis.tics, pages 3190�3196. +Daniel Zeman, Jan Haji�c, Martin Popel, Martin Pot.thast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018a. CoNLL 2018 shared task: Mul. +tilingual parsing from raw text to universal depen. +dencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Univer.sal Dependencies, pages 1�21, Brussels, Belgium. Association for Computational Linguistics. +Daniel Zeman, Jan Hajic,� Martin Popel, Martin Pot.thast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018b. Conll 2018 shared task: mul.tilingual parsing from raw text to universal depen.dencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Univer.sal Dependencies, pages 1�21. +Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Earth mover�s distance minimization for unsupervised bilingual lexicon induction. In Pro.ceedings of the 2017 Conference on Empirical methods in Natural Language Processing, pages 1934� 1945. +Appendix A Questionnaire +An example of the first questionnaire from our user case study is shown below. The second sheet also included the results in 44 more language pairs. We provide an answer key after the second sheet. + +Please provide your prediction of the BLEU score based on the language pair and dataset features (the domain of the training and test sets is TED talks). After you �nish, please go to sheet v2. + +<
> + +Please provide your prediction of the BLEU score in the yellow area given all the information in this sheet. Note that all experiments are trained with the same model. + +<
> + +B Representative datasets + +In this section, we show the searching results of most/least representative subsets for the rest of the + +<
> + +Figure 4: Beam search results (beam size=100) for up to the 5 most (and least) representative datasets for the remaining NLP tasks. We also show random search results of corresponding sizes. + +C New Model + +In this section, we show the extrapolation performance for new models on BLI, MA and the remaining systems of UD. + +<
> + +Figure 5: RMSE scores of BLI task from dataset-wise mean value predictor (the dashed black line in each graph) and predictors trained with experimental records of other models and 0�5 records from a new model (as indicated by the title of each graph). + +<
> + +Figure 6: RMSE scores of MA task from dataset-wise mean value predictor (the dashed black line in each graph) and predictors trained with experimental records of other models and 0�5 records from a new model (as indicated by the title of each graph) + +<
> + +Figure 7: RMSE scores of UD task from dataset-wise mean value predictor (the dashed black line in each graph) and predictors trained with experimental records of other models and 0�5 records from a new model (as indicated by the title of each graph). + +D Feature importance + +In this section, we show the plots of feature importance for all the tasks. +<> <> <> + + +<> <> <> + Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data + + + Charles H. Martin Tongsu (Serena) Peng y Michael W. Mahoney z + + + Abstract + + In many practical applications, one works with deep neural network (DNN) models trained + by someone else. For such pretrained models, one typically does not have access to training + data or test data. Moreover, one does not know many details about the model, such as + the specifics of the training data, the loss function, the hyperparameter values, etc. Given + one or many pretrained models, can one say anything about the expected performance or + quality of the models? Here, we present and evaluate empirical quality metrics for pretrained + DNN models at scale. Using the open-source Weight Watcher tool, we analyze hundreds of + publicly-available pretrained models, including older and current state-of-the-art models in + computer vision (CV) and natural language processing (NLP). We examine both familiar + norm-based capacity control metrics (Frobenius and Spectral norms) as well as newer Power + Law (PL) based metrics (including fitted PL exponents, <>, and the Weighted Alpha metric, + <>, from the recently-developed Theory of Heavy-Tailed Self Regularization (HT-SR). We + also introduce the -Shatten Norm metric. We find that norm-based metrics correlate well + with reported test accuracies for well-trained models across nearly all CV architecture series. + On the other hand, we find that norm-based metrics can not distinguish \good-versus-bad" + models|which, arguably is the point of needing quality metrics. Indeed, they may give + spurious results. We also find that PL-based metrics do much better|quantitatively better + at discriminating among a series of \good-better-best" models, and qualitatively better at + discriminating \good-versus-bad" models. PL-based metrics can also be used to characterize + fine-scale properties of these models, and we introduce the layer-wiseCorrelation Flow as + new quality assessment. We show how poorly-trained (and/or poorly fine-tuned) models may + exhibit bothScale Collapse and unusually large PL exponents,6, in particular for recent + NLP models. Our techniques, as implemented in the Weight Watcher tool, can be used to + identify when a pretrained DNN has problems that can not be detected simply by examining + training/test accuracies. + + + 1 Introduction + + A common problem in machine learning (ML) is to evaluate the quality of a given model. A + popular way to accomplish this is to train a model and then evaluate its training/testing error. + There are many problems with this approach. The training/testing curves give very limited insight + into the overall properties of the model; they do not take into account the (often large human + and CPU/GPU) time for hyperparameter fiddling; they typically do not correlate with other + properties of interest such as robustness or fairness or interpretability; and so on. A less well- + known problem, but one that is increasingly important, in particular in industrial-scale artificial + intelligence (AI), arises when the model user is not the model developer. Here, one may not + have access to either the training data or the testing data. Instead, one may simply be given a + model that has already been trained a pretrained model|and need to use it as-is, or to fine-tune + and/or compress it and then use it. + Natively|but in our experience commonly, among ML practitioners and ML theorists|if one + does not have access to training or testing data, then one can say absolutely nothing about the + quality of a ML model. This may be true in worst-case theory, but models are used in practice, + and there is a need for a practical theory to guide that practice. Moreover, if ML is to become + an industrial process, then that process will become siloed: some groups will gather data, other + groups will develop models, and other groups will use those models. Users of models can not be + expected to know the precise details of how models were built, the specifics of data that were + used to train the model, what was the loss function or hyperparameter values, how precisely the + model was regularized, etc. + Moreover, for many large scale, practical applications, there is no obvious way to define an + ideal test metric. For example, models that generate fake text or conversational chatbots may + use a proxy, like perplexity, as a test metric. In the end, however, they really require human + evaluation. Alternatively, models that cluster user profiles, which are widely used in areas such + as marketing and advertising, are unsupervised and have no obvious labels for comparison and/or + evaluation. In these and other areas, ML objectives can be poor proxies for downstream goals. + Most importantly, in industry, one faces unique practical problems such as: do we have enough + data for this model? Indeed, high quality, labeled data can be very expensive to acquire, and this + cost can make or break a project. Methods that are developed and evaluated on any well-defined + publicly-available corpus of data, no matter how large or diverse or interesting, are clearly not + going to be well-suited to address problems such as this. It is of great practical interest to have + metrics to evaluate the quality of a trained model|in the absence of training/testing data and + without any detailed knowledge of the training/testing process. We seek a practical theory for + pretrained models which can predict how, when, and why such models can be expected to perform + well or poorly. + In this paper, we present and evaluate quality metrics for pretrained deep neural network + (DNN) models, and we do so at scale. We consider a large suite of hundreds of publicly-available + models, mostly from computer vision (CV) and natural language processing (NLP). By now, there + are many such state-of-the-art models that are publicly-available, e.g., there are now hundreds + of pretrained models in CV (500) and NLP (100). 1 These provide a large corpus of models + that by some community standard are state-of-the-art. 2 Importantly, all of these models have + been trained by someone else and have been viewed to be of sufficient interest/quality to be made + publicly-available; and, for all of these models, we have no access to training data or testing data, + and we have no knowledge of the training/testing protocols. + The quality metrics we consider are based on the spectral properties of the layer weight + matrices. They are based on norms of weight matrices (such norms have been used in traditional + statistical learning theory to bound capacity and construct regularizers) and/or parameters of + power law (PL) fits of the eigenvalues of weight matrices (such PL fits are based on statistical + mechanics approaches to DNNs). Note that, while we use traditional norm-based and PL-based + metrics, our goals are not the traditional goals. Unlike more common ML approaches,we do + not seek a bound on the generalization(e.g., by evaluating training/test error during training), + 1 When we began this work in 2018, there were fewer than tens of such models; now in 2020, there are hundreds + of such models; and we expect that in a year or two there will be an order of magnitude or more of such models. + 2 Clearly, there is a selection bias or survivorship bias here|people tend not to make publicly-available their + poorly-performing models|but these models are things in the world that (like social networks or the internet) can + be analyzed for their properties. + we do not seek a new regularizer, and we do not aim to evaluate a single model(e.g., as with + hyperparameter optimization). 3 Instead, we want to examine different models across common + architecture series, and we want to compare models between different architectures themselves, + and in both cases, we ask: + Can we predict trends in the quality of pretrained DNN models without access to + training or testing data? + To answer this question, we analyze hundreds of publicly-available pretrained state-of-the-art + CV and NLP models. Here is a summary of our main results. + Norm-based metrics and well-trained models.Norm-based metrics do a reasonably + good job at predicting quality trends in well-trained CV/NLP models. + Norm-based metrics and poorly-trained models.Norm-based metrics may give + spurious results when applied to poorly-trained models (e.g., models trained without enough + data, etc.), exhibitingScale Collapse for these models. + PL-based metrics and model quality.PL-based metrics do much better at predicting + quality trends in pretrained CV/NLP models. They are quantitatively better at discriminating + good-better-best trends, and qualitatively better at distinguishing \good-versus- + bad" models. + PL-based metrics and model diagnostics. PL-based metrics can also be used to + characterize fine-scale model properties (including layer-wiseCorrelation Flow) in well- + trained and poorly-trained models, and they can be used to evaluate model enhancements + (e.g., distillation, fine-tuning, etc.). + We emphasize that our goal is a practical theory to predict trends in the quality of state-of-the- + art DNN models, i.e., not to make a statement about every publicly-available model. We have + examined hundreds of models, and we identify general trends, but we also highlight interesting + exceptions. + + The WeightWatcher Tool. All of our computations were performed with the publicly-available + Weight Watcher tool (version 0.2.7) [1]. To be fully reproducible, we only examine publicly- + available, pretrained models, and we also provide all Jupyter and Google Colab notebooks used + in an accompanying github repository [2]. See Appendix A for details on how to reproduce all + results. + + Organization of this paper. We start in Section2and Section3with background and an + overview of our general approach. In Section4, we study three well-known widely-available + DNN CV architectures (the VGG, ResNet, and DenseNet series of models); and we provide an + illustration of our basic methodology, both to evaluate the different metrics against reported test + accuracies and to use quality metrics to understand model properties. Then, in Section5, we + look at several variations of a popular NLP DNN architecture (the OpenAI GPT and GPT2 + models); and we show how model quality and properties vary between several variants of GPT + and GPT2, including how metrics behave similarly and differently. Then, in Section6, we present + results based on an analysis of hundreds of pretrained DNN models, showing how well each metric + predicts the reported test accuracies, and how the PL-based metrics perform remarkably well. + Finally, in Section7, we provide a brief discussion and conclusion. + 3 One could of course use these techniques to improve training, and we have been asked about that, but we are + not interested in that here. Our main goal here is to use these techniques to evaluate properties of state-of-the-art + pretrained DNN models. + + 2 Background and Related Work + + Most theory for DNNs is applied to small toy models and assumes access to data. There is very + little work asking how to predict, in a theoretically-principled manner, the quality of large-scale + state-of-the-art DNNs, and how to do so without access to training data or testing data or details + of the training protocol, etc. Our approach is, however, related to two other lines of work. + + Statistical mechanics theory for DNNs. Statistical mechanics ideas have long had influence + on DNN theory and practice [3,4,5]; and our best-performing metrics (those using fitted PL + exponents) are based on statistical mechanics [4,6,7,8,9], in particular the recently-developed + Theory of Heavy Tailed Self Regularization (HT-SR) [6,7,9]. We emphasize that the way in + which we (and HT-SR Theory) use statistical mechanics theory is quite different than the way + it is more commonly formulated. Several very good overviews of the more common approach are + available [3,5]. We use statistical mechanics in a broader sense, drawing upon techniques from + quantitative nance and random matrix theory. Thus, much more relevant for our methodological + approach is older work of Bouchaud, Potters, Sornette, and coworkers [10,11,12,13] on the + statistical mechanics of heavy tailed and strongly correlated systems. + + Norm-based capacity control theory. There is also a large body of work on using norm- + based metrics to bound generalization error [14,15,16]. In this area, theoretical work aims + to prove generalization bounds, and applied work uses these norms to construct regularizers to + improve training. While we do find that norms provide relatively good quality metrics, at least + for distinguishing good-better-best among well-trained models, we are not interested in proving + generalization bounds or developing new regularizers. + + + 3 Methods + + Let us write the Energy Landscape (or optimization function, parameterized by <> and <>) + for a DNN wit <> layers, activation functions <>, and <> weight matrices <> and biases + <>, as: + + <> (1) + + Each DNN layer contains one or more layer 2D <> weight matrices, <>, or pre-activation + maps, <>, extracted from 2D Convolutional layers, and whereN > M.4 (We may drop the i + and/or <> subscripts below.) See Appendix A for how we define the Conv2D layer matrixes and + for our choices of normalization. + Assume we are given several pretrained DNNs, e.g., as part of an architecture series. The + models have been trained and evaluated on labeled data fdi <>, using standard techniques. + The pretrained pytorch model files are publicly-available, and the test accuracies have been + reported online. In this study, we do not have access to this data, and we have not trained + any of the models ourselves, nor have we re-evaluated the test accuracies. We expect that most + well-trained, production-quality models will employ one or more forms of regularization, such as + Batch Normalization (BN), Dropout, etc., and many will also contain additional structure such + as Skip Connections, etc. Here, we will ignore these details, and will focus only on the pretrained + layer weight matrices Wl . + + 4 We do not use intra-layer information from the models in our quality metrics, but (as we will describe) our + metrics can be used to learn about intra-layer model properties. + + DNN Empirical Quality Metrics. The best performing empirical quality metrics depend + on the norms and/or spectral properties of each weight matrix,W, and/or, equivalently, it’s + Empirical Correlation Matrix:X=WT W. + Here, we consider the following metrics. + + <> + + Here, <> is the i th eigenvalue of the X, and <> is the maximum eigenvalue. Recall that the + eigenvalues are squares of the singular values <> of <>. Also, note that we do not i normalize + X by <>; see Appendix A for a discussion of this issue. + The first two norms are well-known in ML; the last two deserve special mention. The empirical + parameter is the Power Law (PL) exponent that arises in the recently-developed HT-SR + Theory [6,7,9]. Operationally, is determined by using the publicly-available Weight Watcher + tool [1] to fit the Empirical Spectral Density (ESD) of X, i.e., a histogram of the eigenvalues, call + it <>, to a truncated PL, + <> (2) + + Each of these quantities is defined for a given layer W matrix. + For norm-based metrics, we use the average of the log norm, and to the appropriate power. + Informally, this amounts to assuming that the layer weight matrices are statistically independent, + in which case we can estimate the model complexityC, or test accuracy, with a standard Product + Norm (which resembles a data dependent VC complexity), + + <>; (3) + + where <> is a matrix norm. The log complexity, + + <>; (4) + + takes the form of an average Log Norm. For the Frobenius Norm metric and Spectral Norm + metric, we can use Eqn. (4) directly. 6 + The Weighted Alpha metric is an average of <> over all layers <>, weighted by the + size, or scale, or each matrix, + + <>; (5) + + where L is the total number of layer weight matrices. The Weighted Alpha metric was introduced + previously [9], where it was shown to correlate well with trends in reported test accuracies of + pretrained DNNs, albeit on a limited set of models. + Based on this, in this paper, we introduce and evaluate the -Shatten Norm metric. Notice + for the -Shatten Norm metric, however,l varies from layer to layer, and so in Eqn. (6) it can + not be taken out of the sum: + + 5 Notice <>. + 6 When taking <>, the 2 comes down and out of the sum, and thus ignoring it only changes the metric F by a constant factor. + + We use X to emphasize that <> depends on the ESD of X.2 + + <> (6) + + For small <>, the Weighted Alpha metric approximates the Log -Shatten norm, as can be shown + with a statistical mechanics and random matrix theory derivation [17]; and the Weighted Alpha + and -Shatten norm metrics often behave like an improved, weighted average Log Spectral Norm, + and may track this metric in some cases. + To avoid confusion, let us clarify the relationship between <> and <>. + We fit the ESD of the + correlation matrix X to a truncated PL, parameterized by 2 values: the PL exponent <>, and the + maximum eigenvalue <>. (Technically, we also need the minimum eigenvalue <>, but this + detail does not affect our analysis.) The PL exponent <> measures of the amount of correlation + in a DNN layer weight matrixW. It is valid for <>, and it is scale-invariant, i.e., it does + not depend on the normalization ofWorX. The <> is a measure of the size, or scale, of W. + Multiplying each <> by the corresponding log <> weighs \bigger" layers more, and averaging + this product leads to a balanced, Weighted Alpha metric for the entire DNN. + + Convolutional Layers and Normalization issues. There are several technical issues + (regarding spectral analysis of convolutional layers and normalization of empirical matrices) that + are important for reproducibility of our results. See Appendix A for a discussion. + + 4 Comparison of CV models + + In this section, we examine empirical quality metrics described in Section3for several CV model + architecture series. This includes the VGG, ResNet, and DenseNet series of models, each of which + consists of several pretrained DNN models, trained on the full ImageNet [18] dataset, and each + of which is distributed with the current open source pyTorch framework (version 1.4) [19]. This + also includes a larger set of ResNet models, trained on the ImageNet-1K dataset [18], provided + on the OSMR \Sandbox for training convolutional networks for computer vision" [20], which we + call the ResNet-1K series. + We perform coarse model analysis, comparing and contrasting the four model series, and + predicting trends in model quality. We also perform fine layer analysis, as a function of depth + for these models, illustrating that PL-based metrics can provide novel insights among the VGG, + ResNet/ResNet-1K, and DenseNet architectures. + + Average Quality Metrics versus Reported Test Accuracies. We have examined the + performance of the four quality metrics (Log Frobenius norm, Log Spectral norm, Weighted Alpha, + and Log -Norm) applied to each of the VGG, ResNet, ResNet-1K, and DenseNet series. To start, + Figure1considers the VGG series (in particular, the pretrained models VGG11, VGG13, VGG16, + and VGG19, with and without BN), and it plots the four quality metrics versus the reported test + accuracies [19], 7 as well as a basic linear regression line. All four metrics correlate quite well + with the reported Top1 accuracies, with smaller norms and smaller values of <> implying better + generalization (i.e., greater accuracy, lower error). While all four metrics perform well, notice + that the Log -Norm metric (<>) performs best (with an RMSE of 0:42, see Table 1); <> + and the Weighted Alpha metric (<>), which is an approximation to the Log -Norm + metric [17], performs second best (with an RMSE of 0:48, see Table1). + 7 That is, these test accuracies have been previously reported and made publicly-available by others. We take + them as given, and we do not attempt to reproduce/verify them, since we do not permit ourselves any access to + training/test data. + + <
> + + Figure 1: Comparison of Average Log Norm and Weighted Alpha quality metrics versus reported + test accuracy for pretrained VGG models (with and without BN), trained on ImageNet, available + in pyTorch (v1.4). Metrics fit by linear regression, RMSE reported. + + + See Table1for a summary of results for Top1 accuracies for all four metrics for the VGG, + ResNet, and DenseNet series. Similar results (not shown) are obtained for the Top5 accuracies. + Overall, for the the ResNet, ResNet-1K, and DenseNet series, all metrics perform relatively well, + the Log-Norm metric performs second best, and the Weighted Alpha metric performs best. + These model series are all well-trodden, and our results indicate that norm-based metrics and + PL-based metrics can both distinguish among a series of \good-better-best" models, with PL- + based metrics performing somewhat (i.e., quantitatively) better. + The DenseNet series has similar behavior to what we see in Figures1and2for the other + models. However, as noted in Table1, it has only 4 data points. In our larger analysis, in + Section6, we will only include series with 5 or more models. (Note that these and many other + such plots can be seen on our publicly-available repo.) + + Variation in Data Set Size. We are interested in how our four quality metrics depend on + data set size. To examine this, we look at results on ResNet versus ResNet-1K. See Figure2, + which plots and compares the Log-Norm metric for the full ResNet model, trained on the + full ImageNet dataset, against the ResNet-1K model, which has been trained on a much smaller + ImageNet-1K data set. The Log-Norm is much better than the Log Frobenius/Spectral norm + metrics (although, as Table1shows, it is actually slightly worse than the Weighted Alpha metric). + The ResNet series has strong correlation, with an RMSE of 0:66, whereas the ResNet-1K series + + <
> + + Table 1: RMSE (smaller is better) for linear fits of quality metrics to reported Top1 test error + for pretrained models in each architecture series. Column # refers to number of models. VGG, + ResNet, and DenseNet were pretrained on ImageNet, and ResNet-1K was pretrained on ImageNet- + 1K. + + also shows good correlation, but has a much larger RMSE of 1:9. (Other metrics exhibit similar + behavior.) As expected, the higher quality data set shows a better fit, even with fewer data points. + + Layer Analysis: Metrics as a Function of Depth. We can learn much more about a + pretrained model by going beyond average values of quality metrics to examining quality metrics + for each layer weight matrix,W, as a function of depth (or layer id). For example, we can + plot (just) the PL exponent, , for each layer, as a function of depth. See Figure3, which + plots for each layer (the first layer corresponds to data, the last layer to labels) for the least + accurate (shallowest) and most accurate (deepest) model in each of the VGG (no BN), ResNet, + and DenseNet series. (Again, a much more detailed set of plots is available at our repo; but note + that the corresponding layer-wise plots for Frobenius and Spectral norms are much less interesting + than the results we present here.) + In the VGG models, Figure3(a)shows that the PL exponent systematically increases as + we move down the network, from data to labels, in the Conv2D layers, starting with <> and + reaching all the way to <> and then, in the last three, large, fully-connected (FC) layers, + stabilizes back down to <>. This is seen for all the VGG models (again, only the shallowest + and deepest are shown in this figure), indicating that the main effect of increasing depth is to + increase the range over which increases, thus leading to larger values in later Conv2D layers + of the VGG models. This is quite different than the behavior of either the ResNet-1K models or + the DenseNet models. + For the ResNet-1K models, Figure 3 (b) shows that also increases in the last few layers + (more dramatically, in fact, than for VGG, observe the differing scales on the Y axes). However, + + <
> + + Figure 3: PL exponent () versus layer id, for the least and the most accurate models in VGG + (a), ResNet (b), and DenseNet (c) series. (VGG is without BN; and note that the Y axes on + each plot are different.) Subfigure (d) displays the ResNet models (b), zoomed in to 2 [1;5], + and with the layer ids overlaid on the X-axis, from smallest to largest, to allow a more detailed + analysis of the most strongly correlated layers. Notice that ResNet152 exhibits different and much + more stable behavior of across layers. This contrasts with how both VGG models gradually + worsen in deeper layers and how the DenseNet models are much more erratic. In the text, this is + interpreted in terms ofCorrelation Flow. + + + as the ResNet-1K models get deeper, there is a wide range over which values tend to remain + quite small. This is seen for other models in the ResNet-1K series, but it is most pronounced for + the larger ResNet-1K (152) model, whereremains relatively stable at <>, from the earliest + layers all the way until we reach close to the final layers. + For the DenseNet models, Figure 3 (c) shows that fi tends to increase as the layer id increases, + in particular for layers toward the end. While this is similar to what is seen in the VGG models, + with the DenseNet models, values increase almost immediately after the first few layers, and + the variance is much larger (in particular for the earlier and middle layers, where it can range all + the way to <>) and much less systematic throughout the network. + + Comparison of VGG, ResNet, and DenseNet Architectures. We can interpret these + observations by recalling the architectural differences between the VGG, ResNet, and DenseNet + architectures, and, in particular, the number of of residual connections. VGG resembles the + traditional convolutional architectures, such as LeNet5, and consists of several [Conv2D-Maxpool- + + <
> + + Figure 4: ResNet20, distilled with Group Regularization, as implemented in the distiller + (4D regularized 5L removed) pretrained models. Log Spectral Norm (<>) and PL exponent + (<>) for individual layers, versus layer id, for both baseline (before distillation, green) and fine- + tuned (after distillation, red) pretrained models. + + + ReLu] blocks, followed by 3 large Fully Connected (FC) layers. ResNet greatly improved on + VGG by replacing the large FC layers, shrinking the Conv2D blocks, and introducing residual + connections. This optimized approach allows for greater accuracy with far fewer parameters (and + GPU memory requirements), and ResNet models of up to 1000 layers have been trained [21]. + We conjecture that the efficiency and effectiveness of ResNet is reflected in the smaller and + more stable <>, across nearly all layers, indicating that the inner layers are very well + correlated and strongly optimized. Contrast this with the DenseNet models, which contains + many connections between every layer. Our results (large , meaning they even a PL model + is probably a poor fit) suggest that DenseNet has too many connections, diluting high quality + interactions across layers, and leaving many layers very poorly optimized. + + Correlation Flow. More generally, we can understand the results presented in Figure3in + terms of what we will call theCorrelation Flow of the model. Recall that the average Log - + Norm metric and the Weighted Alpha metric are based on HT-SR Theory [6,7,9], which is + in turn based on ideas from the statistical mechanics of heavy tailed and strongly correlated + systems [10,11,12,13]. There, one expects the weight matrices of well-trained DNNs will exhibit + correlations over many size scales. Their ESDs can be well-fit by a (truncated) PL, with exponents + <>. Much larger values (<>) may reflect poorer PL fits, whereas smaller values (<>), + are associated with models that generalize better. Informally, one would expect a DNN model to + perform well when it facilitates the propagation of information/features across layers. Previous + work argues this by computing the gradients over the input data. In the absence of training/test + data, one might hope that this leaves empirical signatures on weight matrices, and thus we can + to try to quantify this by measuring the PL properties of weight matrices. In this case, smaller + values correspond to layers in which correlations across multiple scales are better captured [6,11], + and we expect that small values that are stable across multiple layers enable better correlation + flow through the network. We have seen this in many models, including those shown in Figure3. + + Scale Collapse; or How Distillation May Break Models. The similarity between norm- + based metrics and PL-based metrics suggests a question: is the Weighted Alpha metric just a + variation of the more familiar norm-based metrics? More generally, do fitted values contain + information not captured by norms? In examining hundreds of pretrained models, we have found + several anomalies that demonstrate the power of our approach. In particular, to show that does + capture something different, consider the following example, which looks at a compressed/distilled + DNN model [22]. In this example, we show that some distillation methods may actually break + models unexpectedly by introducing what we callScale Collapse, where several distilled layers + have unexpectedly small Spectral Norms. + We consider ResNet20, trained on CIFAR10, before and after applying the Group Regularization + distillation technique, as implemented in the distiller package [23]. We analyze the + pretrained 4D regularized 5L removed baseline and fine-tuned models. The reported baseline test + accuracies (Top1= 91:45 and Top5= 99:75) are better than the reported fine-tuned test accuracies + (Top1= 91:02 and Top5= 99:67). Because the baseline accuracy is greater, the previous results + on ResNet (Table1and Figure2) suggest that the baseline Spectral Norms should be smaller on + average than the fine-tuned ones.The opposite is observed.Figure4presents the Spectral Norm + (here denoted <> ) and PL exponent () for each individual layer weight matrixW.8 On + the other hand, the values (in Figure 4 (b)) do not differ systematically between the baseline + and fine-tuned models. Also (not shown), the average (unweighted) baseline is smaller than + the fine-tuned average (as predicted by HT-SR Theory, the basis of <>). + That being said, Figure4(b)also depicts two very large 6 values for the baseline, + but not for the fine-tuned, model. This suggests the baseline model has at least two over- + parameterized/under-trained layers, and that the distillation method does, in fact, improve the + fine-tuned model by compressing these layers. + The pretrained models in the distiller package have passed some quality metric, but they + are much less well trodden than any of the VGG, ResNet, or DenseNet series. While norms + make good regularizers for a single model, there is no reason a priori to expect them correlate + so well with test accuracies across different models. We do expect, however, the PL fit o do so + because it effectively measures the amount of correlation in the model [6,7,9]. The reason for the + anomalous behavior shown in Figure4is that the distiller Group Regularization technique + causes the norms of the W pre-activation maps for two Conv2D layers to increase spuriously. + This is difficult to diagnose by analyzing training/test curves, but it is easy to diagnose with + our approach. + + 5 Comparison of NLP Models + + In this section, we examine empirical quality metrics described in Section3for several NLP + model architectures. Within the past two years, nearly 100 open source, pretrained NLP DNNs + based on the revolutionary Transformer architecture have emerged. These include variants of + BERT, Transformer-XML, GPT, etc. The Transformer architectures consist of blocks of so-called + Attention layers, containing two large, Feed Forward (Linear) weight matrices [24]. In contrast to + smaller pre-Activation maps arising in Cond2D layers, Attention matrices are significantly larger. + In general, we have found that they have larger PL exponents . Based on HT-SR Theory (in + particular, the interpretation of values of 2 as modeling systems with good correlations over + many size scales [10,11]), this suggests that these models fail to capture successfully many of the + correlations in the data (relative to their size) and thus are substantially under-trained. More + generally, compared to the CV models of Section4, modern NLP models have larger weight + matrices and display different spectral properties. Thus, they provide a very different test for our + empirical quality metrics. + While norm-based metrics perform reasonably well on well-trained NLP models, they often + behave anomalously on poorly-trained models. Indeed, for such \bad" models, weight matrices + may display rank collapse, decreased Frobenius mass, or unusually small Spectral norms. (This + may be misinterpreted as \smaller is better.") In contrast, PL-based metrics, including the Log + -Norm metric (<>) and the Weighted Alpha metric (<>) display consistent + behavior, even on poorly trained models. Indeed, we can use these metrics to help identify when + architectures need repair and when more and/or better data are needed. + + What do large values of mean? Many NLP models, such as GPT and BERT, have some + weight matrices with unusually large PL exponents (e.g.,6). This indicates these matrices + may be under-correlated (i.e., over-parameterized, relative to the amount of data). In this regime, + the truncated PL fit itself may not be very reliable because the MLE estimator it uses is unreliable + in this range (i.e., the specific values returned by the truncated PL fits are less reliable, but + having large versus small values of is reliable). Phenomenologically, if we examine the ESD + visually, we can usually describe theseWas in the Bulk-Decayor Bulk-plus-Spikes phase [6,7]. + Previous work [6,7] has conjectured that very well-trained DNNs would not have many outlier + 6; and improved versions of GPT (shown below) and BERT (not shown) confirm this. + + OpenAI GPT Models. The OpenAI GPT and GPT2 models provide us with the opportunity + to analyze two effects: training the same model with different data set sizes; and increasing + the sizes of both the data set and the architectures simultaneously. These models have the + remarkable ability to generate fake text that appears to the human to be real, and they have + generated significant media attention because of the potential for their misuse. For this reason, + the original GPT model released by OpenAI was trained on a deficient data set, rendering the + model interesting but not fully functional. Later, OpenAI released a much improved model, + GPT2-small, which has the same architecture and number of layers as GPT, but which has been + trained on a larger and better data set (and with other changes), making it remarkably good at + generating (near) human-quality fake text. By comparing the poorly-trained (i.e., \bad") GPT to + the well-trained (i.e., \good") GPT2-small, we can identify empirical indicators for when a model + has in fact been poorly-trained and thus may perform poorly when deployed. By comparing + GPT2-medium to GPT2-large to GPT2-xl, we can examine the effect of increasing data set and + model size simultaneously, an example of what we call a series of \good-better-best" models. + The GPT models we analyze are deployed with the popular HuggingFace PyTorch library [25]. + GPT has 12 layers, with 4 Multi-head Attention Blocks, giving 48 layer Weight Matrices,W. + Each Block has 2 components, the Self Attention (attn) and the Projection (proj) matrices. The + self-attention matrices are larger, of dimension (2304x768) or (3072x768). The projection + layer concatenates the self-attention results into a vector (of dimension 768). This gives 50 + large matrices. Because GPT and GPT2 are trained on different data sets, the initial Embedding + matrices differ in shape. GPT has an initial Token and Positional Embedding layers, of dimension + (40478x768) and (512x768), respectively, whereas GPT2 has input Embeddings of shape + (50257x768) and (1024x768), respectively. The OpenAI GPT2 (English) models are: GPT2- + small, GPT2-medium, GPT2-large, and GPT2-xl, having 12, 24, 36, and 48 layers, respectively, + with increasingly larger weight matrices. + + Average Quality Metrics for GPT and GPT2. We have analyzed the four quality metrics + described in Section3for the OpenAI GPT and GPT2 pretrained models. See Table2for a + summary of results. We start by examining trends between GPT and GPT2-small. Observe + that all four metrics increase when going from GPT to GPT2-small, i.e., they are smaller for the + higher-quality model (higher quality since GPT was trained to better data), when the number of + layers is held xed. Notice that in the GPT model, being poorly trained, the norm metrics all + exhibitScale Collapse, compared to GPT2-small. + + <
> + + Table 2: Average value for the average Log Norm and Weighted Alpha metrics for pretrained + OpenAI GPT and GPT2 models. Column # refers to number of layers treated. Note that the + averages do not include the first embedding layer(s) because they are not (implicitly) normalized. + + + We next examine trends between GPT2-medium to GPT2-large to GPT2-xl. Observe that + (with one minor exception involving the log Frobenius norm metric) all four metrics decrease as + one goes from medium to large to xl, indicating that the larger models indeed look better than + the smaller models. Notice that, for these well-trained models, the norm metrics now behave as + expected, decreasing with increasing accuracy. + Going beyond average values, Figure5(a)shows the histogram (empirical density), for all + layers, of for GPT and GPT2-small. These two histograms are very different. The older + deficient GPT has numerous unusually large exponents meaning they are not really well- + described by a PL fit. Indeed, we expect that a poorly-trained model will lack good (i.e., small) + PL behavior in many/most layers. On the other hand, as expected, the newer improved GPT2- + small model has, on average, smaller values than the older GPT, with all 6 and with + smaller mean/median. It also has far fewer unusually-large outlying values than GPT. From + this (and other results not shown), we see that provides a good quality metric for comparing + these two models, the \bad" GPT versus the \good" GPT2-small. This should be contrasted + with the behavior displayed by the Frobenius norm (not shown) and the Spectral norm. + + Scale Collapse in Poorly Trained Models. We next describe the behavior of the Spectral + norm in GPT versus GPT2-small. In Figure5(b), the \bad" GPT model has a smaller + mean/median Spectral norm as well as, spuriously, many much smaller Spectral norms, com- + pared to the \good" GPT2-small, violating the conventional wisdom that smaller Spectral norms + are better. Indeed, because there are so many anonymously small Spectral norms, it appears that + the GPT model may be exhibiting a kind ofScale Collapse, like that observed in the distilled + CV models (in Figure4). This is important because it demonstrates that, while the Spectral + (or Frobenius) norm may correlate well with predicted test error, it is not a good indicator of + the overall model quality. It can mispredict good-versus-bad questions in ways not seen with + PL-based metrics. Using it as an empirical quality metric may give spurious results when applied + to poorly-trained or otherwise deficient models. + (Note that Figure5(b)also shows some unusually large Spectral Norms. Upon examination, + e.g., from Figure6(b)(below), we see that these correspond to the first embedding layer(s). + These layers have a different effective normalization, and therefore a different scale. We discuss + this further in AppendixA. Here, we do not include them in our computed average metrics in + Table2, and we do not include them in the histogram plot in Figure5(b).) + + Layer Analysis: Correlation Flow and Scale Collapse in GPT and GPT2. We also + examine in Figure 6 the PL exponent and Log Spectral Norm versus layer id, for GPT and + GPT2-small. Let’s start with Figure6(a), which plots versus the depth (i.e., layer id) for + each model. The deficient GPT model displays two trends in , one stable with 4, and one + + <
> + + Figure 5: Histogram of PL exponents (<>) and Log Spectral Norms (<>) for weight matrices + from the OpenAI GPT and GPT2-small pretrained models. + + increasing with layer id, with reaching as high as 12. In contrast, the well-trained GPT2-small + model shows consistent and stable patterns, again with one stable <> (and below the GPT + trend), and the other only slightly trending up, with 6. The scale-invariant metric lets us + identify potentially poorly-trained models. These results show that the Correlation Flow differs + significantly between GPT and GPT2-small (with the better GPT2-small looking more like the + better ResNet-1K from Figure3(b)). + These results should be contrasted with the corresponding results for Spectral Norms, shown + in Figure6(b). Attention models have two types of layers, one small and large; and the Spectral + Norm, in particular, displays unusually small values for some of these layers for GPT. This Scale + Collapse for the poorly-trained GPT is similar to what we observed for the distilled ResNet20 + model in Figure4(b). Because of the anomalous scale collapse that is frequently observed in + poorly-trained models, these results suggest that scale-dependent norm metrics should not be + directly applied to distinguish good-versus-bad models. + + <
> + + Figure 6: PL exponents (<>) (in (a)) and Log Spectral Norms (<>) (in (b)) for weight + matrices from the OpenAI GPT and GPT2-small pretrained models. (Note that the quantities + being shown on each Y axis are different.) In the text, this is interpreted in terms ofCorrelation + Flow and Scale Collapse. + + + GPT2: medium, large, xl. We now look across series of increasingly improving GPT2 models + (i.e., we consider good-better-best questions), by examining both the PL exponent as well as + the Log Norm metrics. In general, as we move from GPT2-medium to GPT2-xl, histograms + for both exponents and the Log Norm metrics downshift from larger to smaller values. For + example, see Figure7, which shows the histograms over the layer weight matrices for fitted PL + exponent (<>) and the Log Alpha Norm (<>) metric. We see that the average decreases + with increasing model size, although the differences + are less noticeable between the differing good-better-best GTP2 models than between the good- + versus-bad GPT and GPT2-small models. Unlike GPT, however, the layer Log Alpha Norms + behave more as expected for GPT2 layers, with the larger models consistently having smaller + norms. Similarly, the Log Spectral Norm also decreases on average with the larger models (not + shown). As expected, the norm metrics can indeed distinguish among good-better-best models + among a series well-trained models. + We do notice, however, that while the peaks of the are getting smaller, towards 2:0, the tails + of the distribution shifts right, with larger GPT2 models having more usually large (also not + shown). We suspect this indicates that these larger GPT2 models are still under-optimized/over- + parameterized (relative to the data on which they were trained) and that they have capacity to + support datasets even larger than the recent XL 1.5B release [26]. + + <
> + + Figure 7: Histogram of PL exponents (<>) and Log Alpha Norm (<>) for weight matrices + from models of different sizes in the GPT2 architecture series. (Plots omit the first 2 (embedding) + layers, because they are normalized differently giving anomalously large values.) + + + 6 Comparing Hundreds of CV Models + + In this section, we summarize results from a large-scale analysis of hundreds of CV models, + including models developed for image classification, segmentation, and a range of related tasks. Our + aim is to complement the detailed results from Sections4and5by providing broader conclusions. + The models we consider have been pretrained on nine datasets. We provide full details about + how to reproduce these results in AppendixA. + We choose ordinary least squares (OLS) regression to quantify the relationship between quality + metrics (computed with the Weight Watcher tool ) and the reported test error and/or accuracy + metrics. We regress the metrics on the Top1 (and Top5) reported errors (as dependent variables). + These include Top5 errors for the ImageNet-1K model, percent error for the CIFAR-10/100, + SVHN, CUB-200-2011 models, and Pixel accuracy (Pix.Acc.) and Intersection-Over-Union (IOU) + for other models. We regress them individually on each of the norm-based and PL-based metrics, + as described in Section4. + Our results are summarized in Table3. For the mean, largerR2 and smaller MSE are + desirable; and for the standard deviation, smaller values are desirable. Taken as a whole, over the + entire corpus of data, PL-based metrics are somewhat better for both theR2 mean and standard + deviation; and PL-based metrics are much better for MSE mean and standard deviation. These + + <
> + + Table 3: Comparison of linear regression fits for different average Log Norm and Weighted Alpha + metrics across 5 CV datasets, 17 architectures, covering 108 (out of over 400) different pretrained + DNNs. We include regressions only for architectures with five or more data points, and which are + positively correlated with test error. These results can be readily reproduced using the Google + Colab notebooks (see AppendixA). + + + (and other) results suggest our conclusions from Sections4and5hold much more generally, and + they suggest obvious questions for future work. + + 7 Conclusion + + We have developed (based on strong theory) and evaluated (on a large corpus of publicly-available + pretrained models from CV and NLP) methods to predict trends in the quality of state-of-the-art + neural networks|without access to training or testing data. Prior to our work, it was not obvious + that norm-based metrics would perform well to predict trends in quality across models (as they + are usually used within a given model or parameterized model class, e.g., to bound generalization + error or to construct regularizers). Our results are the first to demonstrate that they can be used + for this important practical problem. That PL-based metrics perform better (than norm-based + metrics) should not be surprising|at least to those familiar with the statistical mechanics of + heavy tailed and strongly correlated systems [10,11,12,13] (since our use of PL exponents is + designed to capture the idea that well-trained models capture correlations over many size scales + in the data). Again, though, our results are the first to demonstrate this. It is also gratifying + that our approach can be used to provide fine-scale insight (such as rationalizing the flow of + correlations or the collapse of size scale) throughout a network. + We conclude with a few comments on what a practical theory of DNNs should look like. To do + so, we distinguish between two types of theories:non-empirical or analogical theories, in which one + creates, often from general principles, a very simple toy model that can be analyzed rigorously, + and one then argues that the model is relevant to the system of interest; and semi-empirical + theories, in which there exists a rigorous asymptotic theory, which comes with parameters, for + the system of interest, and one then adjusts or fits those parameters to the finite non-asymptotic + data. A drawback of the former approach is that it typically makes very strong assumptions + on the data, and the strength of those assumptions can limit the practical applicability of the + theory. Nearly all of the work on the theory of DNNs focuses on the former type of theory. Our + approach focuses on the latter type of theory. Our results, which are based on using sophisticated + statistical mechanics theory and solving important practical DNN problems, suggests that the + latter approach should be of interest more generally for those interested in developing a practical + DNN theory. + + Acknowledgements. MWM would like to acknowledge ARO, DARPA, NSF, and ONR as well + as the UC Berkeley BDD project and a gift from Intel for providing partial support of this work. + We would also like to thank Amir Khosrowshahi and colleagues at Intel for helpful discussion + regarding the Group Regularization distillation technique. + + + References + + [1]WeightWatcher, 2018.https://pypi.org/project/WeightWatcher/. + [2]https://github.com/CalculatedContent/ww-trends-2020. + [3]A. Engel and C. P. L. Van den Broeck.Statistical mechanics of learning. Cambridge University Press, + New York, NY, USA, 2001. + [4]C. H. Martin and M. W. Mahoney. Rethinking generalization requires revisiting old ideas: statistical + mechanics approaches and complex learning behavior. Technical Report Preprint:arXiv:1710.09553, + 2017. + [5]Y. Bahri, J. Kadmon, J. Pennington, S. Schoenholz, J. Sohl-Dickstein, and S. Ganguli. Statistical + mechanics of deep learning.Annual Review of Condensed Matter Physics, pages 000{000, 2020. + [6]C. H. Martin and M. W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from + random matrix theory and implications for learning. Technical Report Preprint:arXiv:1810.01075, + 2018. + [7]C. H. Martin and M. W. Mahoney. Traditional and heavy-tailed self regularization in neural network + models. InProceedings of the 36th International Conference on Machine Learning, pages 4284{4293, + 2019. + [8]C. H. Martin and M. W. Mahoney. Statistical mechanics methods for discovering knowledge from + modern production quality neural networks. InProceedings of the 25th Annual ACM SIGKDD Con- + ference, pages 3239{3240, 2019. + [9]C. H. Martin and M. W. Mahoney. Heavy-tailed Universality predicts trends in test accuracies for very + large pre-trained deep neural networks. InProceedings of the 20th SIAM International Conference on + Data Mining, 2020. + [10]J. P. Bouchaud and M. Potters.Theory of Financial Risk and Derivative Pricing: From Statistical + Physics to Risk Management. Cambridge University Press, 2003. + [11]D. Sornette.Critical phenomena in natural sciences: chaos, fractals, selforganization and disorder: + concepts and tools. Springer-Verlag, Berlin, 2006. + [12]J. P. Bouchaud and M. Potters. Financial applications of random matrix theory: a short review. In + G. Akemann, J. Baik, and P. Di Francesco, editors,The Oxford Handbook of Random Matrix Theory. + Oxford University Press, 2011. + [13]J. Bun, J.-P. Bouchaud, and M. Potters. Cleaning large correlation matrices: tools from random + matrix theory.Physics Reports, 666:1{109, 2017. + [14]B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In + Proceedings of the 28th Annual Conference on Learning Theory, pages 1376{1401, 2015. + [15]P. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. + Technical Report Preprint:arXiv:1706.08498, 2017. + [16]Q. Liao, B. Miranda, A. Banburski, J. Hidary, and T. Poggio. A surprising linear relationship predicts + test performance in deep networks. Technical Report Preprint:arXiv:1807.09659, 2018. + [17]C. H. Martin and M. W. Mahoney. Unpublished results, 2020. + [18]O. Russakovsky et al. Imagenet large scale visual recognition challenge. International Journal of + Computer Vision, 115(3):211{252, 2015. + [19]A. Paszke et al. Pytorch: An imperative style, high-performance deep learning library. InAnnual + Advances in Neural Information Processing Systems 32: Proceedings of the 2019 Conference, pages + 8024{8035, 2019. + [20]Sandbox for training convolutional networks for computer vision. https://github.com/osmr/ + imgclsmob. + [21]K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. Technical Report + Preprint:arXiv:1603.05027, 2016. + [22]Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey of model compression and acceleration for + deep neural networks. Technical Report Preprint:arXiv:1710.09282, 2017. + [23]Intel Distiller package.https://nervanasystems.github.io/distiller. + [24]A. Vaswani et al. Attention is all you need. Technical Report Preprint:arXiv:1706.03762, 2017. + [25]T. Wolf et al. Huggingface’s transformers: State-of-the-art natural language processing. Technical + Report Preprint:arXiv:1910.03771, 2019. + [26]OpenAI GPT-2: 1.5B Release.https://openai.com/blog/gpt-2-1-5b-release/. + [27]H. Sedghi, V. Gupta, and P. M. Long. The singular values of convolutional layers. Technical Report + Preprint:arXiv:1805.10408, 2018. + [28]X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. + InProceedings of the 13th International Workshop on artificial Intelligence and Statistics, pages + 249{256, 2010. + + + A Appendix + + In this appendix, we provide more details on several issues that are important for the reproducibility + of our results. All of our computations were performed with the Weight Watcher tool (version + 0.2.7) [1]. More details and more results are available in an accompanying github repository [2]. + + A.1 Reproducibility Considerations + SVD of Convolutional 2D Layers. There is some ambiguity in performing spectral analysis + on Conv2D layers. Each layer is a 4-index tensor of dimension (<>), with an (wxh) + filter (or kernel) and (in;out) channels. When w=h=k, it gives (kxk) tensor slices, or + pre-Activation Maps Wi;L of dimension (in/out) each. We identify 3 different approaches for + running SVD on a Conv2D layer: + + 1.run SVD on each pre-Activation MapWi;L , yielding (kxk) sets of M singular values; + + 2.stack the maps into a single matrix of, say, dimension (<>), and run SVD to + get in singular values; + + 3.compute the 2D Fourier Transform (FFT) for each of the (in;out) pairs, and run SVD on + the Fourier coefficients [27], leading to <> non-zero singular values. + + Each method has tradeoffs. Method (3) is mathematically sound, but computationally expensive. + Method (2) is ambiguous. For our analysis, because we need thousands of runs, we select method + (1), which is the fastest (and is easiest to reproduce). + + Normalization of Empirical Matrices. Normalization is an important, if underappreciated, + practical issue. Importantly, the normalization of weight matrices does not affect the PL fits + because is scale-invariant. Norm-based metrics, however, do depend strongly on the scale of the + weight matrix|that is the point. To apply RMT, we usually define X with a <> normalization, + assuming variance of <>. Pretrained DNNs are typically initialized with random weight + matrices <>, with <>, or some variant, e.g., the Glorot/Xavier normalization [28], + or <> normalization for Convolutional 2D Layers. With this implicit scale, we do not + renormalize the empirical weight matrices, i.e., we use them as-is. The only exception is that + + <> + + Table 4: Jupyter notebooks used to reproduce all results in Sections 4 and 5. + pwe do rescale the Conv2D pre-activation mapsWi;L byk= 2 so that they are on the same scale + as the Linear / Fully Connected (FC) layers. + + Special consideration for NLP models. NLP models, and other models with large initial p + embeddings, require special care because the embedding layers frequently lack the implicit <> + normalization present in other layers. For example, in GPT, for most layers, the maximum + eigenvalue <>, but in the first embedding layer, the maximum eigenvalue is of + orderN(the number of words in the embedding), or <>). For GPT and GPT2, we + treat all layers as-is (although one may want to normalize the first 2 layers X by <>, or to treat + them as outliers). + + A.2 Reproducing Sections 4 and 5 + + We provide a github repository for this paper that includes Jupyter notebooks that fully reproduce + all results (as well as many other results) [2]. All results have been produced using the Weight- + Watcher tool (v0.2.7) [1]. The ImageNet and OpenAI GPT pretrained models are provided in the + current pyTorch [19] and Huggingface [25] distributions, as specified in the requirements.txt file. + + A.3 Reproducing Figure 4, for the Distiller Model + + In the distiller folder of our github repo, we provide the original Jupyter Notebooks, which use + the Intel distiller framework [23]. Figure4is from the‘‘...-Distiller-ResNet20.ipynb’’ + notebook (see Table4). For completeness, we provide both the results described here, as well as + additional results on other pretrained and distilled models using the Weight Watcher tool . + + A.4 Reproducing Table 3 in Section 6 + + In the ww-colab folder of our github repo, we provide several Google Colab notebooks which can + be used to reproduce the results of Section6. The ImageNet-1K and other pretrained models are + taken from the pytorch models in theomsr/imgclsmob\Sandbox for training convolutional net- + works for computer vision" github repository [20]. The data for each regression can be generated + in parallel by running each Google Colab notebook (i.e.,wwcolab0100.ipynb) simultaneously + on the same account. The data generated are analyzed withww colabresults.ipynb, which + runs all regressions and which tabulates the results presented in Table3. + We attempt to run linear regressions for all pyTorch models for each architecture series for + all datasets provided. There are over 450 models in all, and we note that theosmr/imgclsmob + repository is constantly being updated with new models. We omit the results for CUB-200-2011, + + <
> + + Table 5: Datasets used + + <
> + + Table 6: Architectures used + + Pascal-VOC2012, ADE20K, and COCO datasets, as there are fewer than 15 models for those + datasets. Also, we filter out regressions with fewer than 5 datapoints. + We remove the following outliers, as identified by visual inspection:efficientb0,b2. We + also remove the entirecifar100 ResNeXTseries, which is the only example to show no trends + with the norm metrics. The final datasets used are shown in Table 5. The final architecture series + used are shown in Table6, with the number of models in each. + To explain further how to reproduce our analysis, we run three batches of linear regressions. + First, at the global level, we divide models by datasets and run regressions separately on all + models of a certain dataset, regardless of the architecture. At this level, the plots are quite + noisy and clustered, as each architecture has its own accuracy trend; but one can still see that + most plots show positive relationship with positive coefficients. Example regressions are shown + in Figure8, as available in the results notebook. + To generate the results in Table3, we run linear regressions for each architecture series in + Table6, regressing each empirical Log Norm metric against the reported Top1 (and Top5) errors + (as listed on theosmr/imgclsmobgithub repository README file [20], with the relevant data + extracted and provided in our github repo aspytorchcv.html). We record theR2 andMSE + for each metric, averaged over all regressions for all architectures and datasets. See Table7and + Table8. In the repo, plots are provided for every regression, and more fine grained results may + be computed by the reader by analyzing the data in thedf all.xlsxfile. The final analysis + includes 108 regressions in all, those with 4 or more models, with a positive R2. + + <
> + + Table 7: MSEResults for all CV model regressions. + + <
> + + Table 8: R2 Results for all CV model regressions. + + <
> + + Figure 8: PL exponentfiversus reported Top1 Test Accuracies for pretrained DNNs available + for five different data sets. +<> <> <> + + +<> <> <> + Pruning neural networks without any data by iteratively conserving synaptic flow + + Hidenori Tanaka Daniel Kunin + Physics & Informatics Laboratories Institute for Computational and + NTT Reserach, Inc. Mathematical Engineering + Department of Applied Physics Stanford University + + Stanford University + + Daniel L. K. Yamins Surya Ganguli + Department of Psychology Department of Applied Physics + Department of Computer Science Stanford University + Stanford University + + Abstract + + Pruning the parameters of deep neural networks has generated intense interest due to potential + savings in time, memory and energy both during training and at test time. Recent works + have identified, through an expensive sequence of training and pruning cycles, the existence + of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a + foundational question: can we identify highly sparse trainable subnetworks at initialization, + without ever training, or indeed without ever looking at the data? We provide an affirmative + answer to this question through theory driven algorithm design. We first mathematically + formulate and experimentally verify a conservation law that explains why existing gradient- + based pruning algorithms at initialization suffer from layer-collapse, the premature pruning of + an entire layer rendering a network untrainable. This theory also elucidates how layer-collapse + can be entirely avoided, motivating a novel pruning algorithmIterative Synaptic Flow Pruning + (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths + through the network at initialization subject to a sparsity constraint. Notably, this algorithm + makes no reference to the training data and consistently outperforms existing state-of-the-art + pruning algorithms at initialization over a range of models (VGG and ResNet), datasets + (CIFAR-10/100 and Tiny ImageNet), and sparsity constraints (up to99:9percent). Thus our + data-agnostic pruning algorithm challenges the existing paradigm that data must be used to + quantify which synapses are important. + + + 1 Introduction + + Network pruning, or the compression of neural networks by removing parameters, has been an important subject + both for reasons of practical deployment [1,2,3,4,5,6,7] and for theoretical understanding of artificial [8] and + biological [9] neural networks. Conventionally, pruning algorithms have focused on compressing pre-trained + models [1,2,3,5,6]. However, recent works [10,11] have identified through iterative training and pruning + cycles (iterative magnitude pruning) that there exist sparse subnetworks (winning tickets) in randomly-initialized + neural networks that, when trained in isolation, can match the test accuracy of the original network. Moreover, + its been shown that some of these winning ticket subnetworks can generalize across datasets and optimizers + [12]. While these results suggest training can be made more efficient by identifying winning ticket subnetworks + at initialization, they do not provide efficient algorithms to find them. Typically, it requires significantly more + computational costs to identify winning tickets through iterative training and pruning cycles than simply training + the original network from scratch [10,11]. Thus, the fundamental unanswered question is: can we identify + highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the + data? Towards this goal, we start by investigating the limitations of existing pruning algorithms at initialization + [13,14], determine simple strategies for avoiding these limitations, and provide a novel data-agnostic algorithm + that improves upon state-of-the-art results. Our main contributions are: + 1.We study layer-collapse, the premature pruning of an entire layer making a network untrainable, and + formulate the axiomMaximal Critical Compression that posits a pruning algorithm should avoid + layer-collapse whenever possible (Sec. 3). + 2.We demonstrate theoretically and empirically that synaptic saliency, a general class of gradient-based + scores for pruning, is conserved at every hidden unit and layer of a neural network (Sec. 4). + 3.We show that the se conservation law simply parameters in large layers receive lower scores than + parameters in small layers, which elucidates why single-shot pruning disproportionately prunes the + largest layer leading to layer-collapse (Sec. 4). + 4.We hypothesize that iterative magnitude pruning[10] avoids layer-collapse because gradient descent + effectively encourages the magnitude scores to observe a conservation law, which combined with + iteration results in the relative scores for the largest layers increasing during pruning (Sec. 5). + 5.We prove that a pruning algorithm avoids layer-collapse entirely and satisfies Maximal Critical + Compression if it uses iterative, positive synaptic saliency scores (Sec. 6). + 6.We introduce a new data-agnostic algorithmIterative Synaptic Flow Pruning (SynFlow)that satisfies + Maximal Critical Compression (Sec. 6) and demonstrate empirically 2 that this algorithm achieves + state-of-the-art pruning performance on 12 distinct combinations of models and datasets (Sec. 7). + + 2 Related work + + While there are a variety of approaches to compressing neural networks, such as novel design of micro- + architectures [15,16,17], dimensionality reduction of network parameters [18,19], and training of dynamic + sparse networks [20, 21], in this work we will focus on neural network pruning. + Pruning after training.Conventional pruning algorithms assign scores to parameters in neural networks after + training and remove the parameters with the lowest scores [5,22,23]. Popular scoring metrics include weight + magnitudes [4,6], its generalization to multi-layers [24], first- [1,25,26,27] and second-order [2,3,27] Taylor + coefficients of the training loss with respect to the parameters, and more sophisticated variants [28,29,30]. + While these pruning algorithms can indeed compress neural networks at test time, there is no reduction in the + cost of training. + Pruning before Training.Recent works demonstrated that randomly initialized neural networks can be pruned + before training with little or no loss in the final test accuracy [10,13,31]. In particular, the Iterative Magnitude + Pruning (IMP) algorithm [10,11] repeats multiple cycles of training, pruning, and weight rewinding to identify + extremely sparse neural networks at initialization that can be trained to match the test accuracy of the original + network. While IMP is powerful, it requires multiple cycles of expensive training and pruning with very specific + sets of hyperparameters. Avoiding these difficulties, a different approach uses the gradients of the training + loss at initialization to prune the network in a single-shot [13,14]. While these single-shot pruning algorithms + at initialization are much more efficient, and work as well as IMP at moderate levels of sparsity, they suffer + from layer-collapse, or the premature pruning of an entire layer rendering a network untrainable [32,33]. + Understanding and circumventing this layer-collapse issue is the fundamental motivation for our study. + + 3 Layer-collapse: the key obstacle to pruning at initialization + + Broadly speaking, a pruning algorithm at initialization is defined by two steps. The first step scores the + parameters of a network according to some metric and the second step masks the parameters (removes or + keeps the parameter) according to their scores. The pruning algorithms we consider will always mask the + parameters by simply removing the parameters with the smallest scores. This ranking process can be applied + globally across the network, or layer-wise. Empirically, its been shown that global-masking performs far better + than layer-masking, in part because it introduces fewer hyperparameters and allows for flexible pruning rates + across the network [23]. However, recent works [32,14,33] have identified a key failure mode,layer-collapse, + for existing pruning algorithms using global-masking. Layer-collapse occurs when an algorithm prunes all + parameters in a single weight layer even when prunable parameters remain elsewhere in the network. This + renders the network untrainable, evident by sudden drops in the achievable accuracy for the network as shown in + + Fig. 1. To gain insight into the phenomenon of layer-collapse we will define some useful terms inspired by a + recent paper studying the failure mode [33]. + + Given a network,compression ratio (FORMULA) is the number of parameters + in the original network divided by the number of parameters + remaining after pruning. For example, when the compression + ratio <>, then only one out of a thousand of the parameters + remain after pruning.Max compression (<>) is the maximal <
> + possible compression ratio for a network that doesn’t lead to + layer-collapse. For example, for a network with L layers and + N parameters, <>, which is the compression ratio + associated with pruning all but one parameter per layer. Critical + compression (<>) is the maximal compression ratio a given + algorithm can achieve without inducing layer-collapse. In + particular, the critical compression of an algorithm is always upper + bounded by the max compression of the network:<>. Figure 1:Layer-collapse leads to a suddenThis inequality motivates the following axiom we postulate any drop in accuracy.Top-1 test accuracy as a successful pruning algorithm should satisfy. function of the compression ratio for a VGG- + Axiom.Maximal Critical Compression.The critical compression 1 6 model pruned at initialization and trained + of a pruning algorithm applied to a network should always on CIFAR-100. Colored arrows represent + equal the max compression of that network. the critical compression of the corresponding pruning algorithm. Only our algorithm, + In other words, this axiom implies a pruning algorithm should SynFlow, reaches the theoretical limit of max + never prune a set of parameters that results in layer-collapse if compression (black dashed line) without col- + there exists another set of the same cardinality that will keep lapsing the network. See Sec. 7 for more + the network trainable. To the best of our knowledge, no exist- details on the experiments. + ing pruning algorithm with global-masking satisfies this simple + axiom. Of course any pruning algorithm could be modified to satisfy the axiom by introducing specialized + layer-wise pruning rates. However, to retain the benefits of global-masking [23], we will formulate an algorithm, + Iterative Synaptic Flow Pruning (SynFlow), which satisfies this property by construction. SynFlow is a natural + extension of magnitude pruning, that preserves the total flow of synaptic strengths from input to output rather + than the individual synaptic strengths themselves. We will demonstrate that not only does the SynFlow + algorithm achieve Maximal Critical Compression, but it consistently outperforms existing state-of-the-art pruning + algorithms (as shown in Fig. 1 and in Sec. 7), all while not using the data. + Throughout this work, we benchmark our algorithm, SynFlow, against two simple baselines, random scoring + and scoring based on weight magnitudes, as well as two state-of-the-art single-shot pruning algorithms, Single- + shot Network Pruning based on Connection Sensitivity (SNIP) [13] and Gradient Signal Preservation (GraSP) + [14]. SNIP [13] is a pioneering algorithm to prune neural networks at initialization by scoring weights based + on the gradients of the training loss. GraSP [14] is a more recent algorithm that aims to preserve gradient + flow at initialization by scoring weights based on the Hessian-gradient product. Both SNIP and GraSP have + been thoroughly benchmarked by [14] against other state-of-the-art pruning algorithms that involve training + [2, 34, 10, 11, 35, 21, 20], demonstrating competitive performance. + + 4 Conservation laws of synaptic saliency + + In this section, we will further verify that layer-collapse is a key obstacle to effective pruning at initialization + and explore what is causing this failure mode. As shown in Fig. 2, with increasing compression ratios, existing + random, magnitude, and gradient-based pruning algorithms will prematurely prune an entire layer making the + network untrainable. Understanding why certain score metrics lead to layer-collapse is essential to improve the + design of pruning algorithms. + Random pruning prunes every layer in a network by the same amount, evident by the horizontal lines in + Fig. 2. With random pruning the smallest layer, the layer with the least parameters, is the first to be fully + pruned. Conversely, magnitude pruning prunes layers at different rates, evident by the staircase pattern in Fig. 2. + Magnitude pruning effectively prunes parameters based on the variance of their initialization, which for common + network initializations, such as Xavier [36] or Kaiming [37], are inversely proportional to the width of a layer + [33]. With magnitude pruning the widest layers, the layers with largest input or output dimensions, are the + + <
> + + Figure 2:Where does layer-collapse occur Fraction of parameters remaining at each layer of a VGG-19 + model pruned at initialization with ImageNet over a range of compression ratios (<>). A + higher transparency represents a higher compression ratio. A dashed line indicates that there is at least one layer + with no parameters, implying layer-collapse has occurred. + + + first to be fully pruned. Gradient-based pruning algorithms SNIP [13] and GraSP [14] also prune layers at + different rates, but it is less clear what the root cause for this preference is. In particular, both SNIP and GraSP + aggressively prune the largest layer, the layer with the most trainable parameters, evident by the sharp peaks + in Fig. 2. Based on this observation, we hypothesize that gradient-based scores averaged within a layer are + inversely proportional to the layer size. We examine this hypothesis by constructing a theoretical framework + grounded in flow networks. We first define a general class of gradient-based scores, prove a conservation law for + these scores, and then use this law to prove that our hypothesis of inverse proportionality between layer size and + average layer score holds exactly. + A general class of gradient-based scores.Synaptic saliency is any score metric that can be expressed as the + Hadamard product + <> (1) + + whereRis a scalar loss function of the output of a feed-forward network parameterized by . When R is the + training loss L, the resulting synaptic saliency metric is equivalent (modulo sign) to <>, the score metric + used in Skeletonization [1], one of the first network pruning algorithms. The resulting metric is also closely + related to <> the score used in SNIP [13], <> the score used in GraSP, and <> the + score used in the pruning after training algorithm Taylor-FO [27]. This general class of score metrics, while not + encompassing, exposes key properties of gradient-based scores used for pruning. + The conservation of synaptic saliency.All synaptic saliency metrics respect two surprising conservation laws + that hold at any initialization and step in training. + Theorem 1.Neuron-wise Conservation of Synaptic Saliency.For a feedforward neural network with homogenous + activation functions, <>, (e.g. ReLU, Leaky ReLU, linear), the sum of the synaptic saliency for + the incoming parameters to a hidden neuron (Sin ) is equal to the sum of the synaptic saliency for the outgoing + parameters from the hidden neuron (S_out). + + Proof.Consider the jth hidden neuron of a network with outgoing parameters out and incoming parameters P + <>, such that <> and <>. The sum of the synaptic saliency for the outgoing + parameters is + <> (2) + + The sum of the synaptic saliency for the incoming parameters is + + <> (3) + + When is homogeneous, then <> + + <
> + + Figure 3: Total score in Neuron-wise conservation of score.Each dot represents a hidden unit from the feature-extractor of a + VGG-19 model pruned at initialization with ImageNet. The location of each dot corresponds to the total score + for the unit’s incoming and outgoing parameters, <>. The black dotted line represents exact neuron-wise + conservation of score. + + <> + + Figure 4:Inverse relationship between layer size and average layer score.Each dot represents a layer from + a VGG-19 model pruned at initialization with ImageNet. The location of each dot corresponds to the layer’s + average score 4 and inverse number of elements. The black dotted line represents a perfect linear relationship. + + The neuron-wise conservation of synaptic saliency implies network conservation as well. + Theorem 2.Network-wise Conservation of Synaptic Saliency.The sum of the synaptic saliency across any + set of parameters that exactly 3 separates the input neurons x from the output neurons y of a feedforward neural + network with homogenous activation functions equals <> + We prove this theorem in Appendix 10 by applying the neuron-wise conservation law recursively. Similar + conservation properties have been noted in the neural network interpretability literature and have motivated the + construction of interpretability methods such as Conductance [38] and Layer-wise Relevance Propagation [39], + which have recently been modified for network pruning [9,40]. While the interpretability literature has focused + on attribution to the input pixels and hidden neuron activations, we have formulated conservation laws that are + more general and applicable to any parameter and neuron in a network. Remarkably, these conservation laws of + synaptic saliency apply to modern neural network architectures and a wide variety of neural network layers (e.g. + dense, convolutional, batchnorm, pooling, residual) as visually demonstrated in Fig. 3. + Conservation and single-shot pruning leads to layer-collapse.The conservation laws of synaptic saliency + provide us with the theoretical tools to validate our earlier hypothesis of inverse proportionality between layer + size and average layer score as a root cause for layer-collapse of gradient-based pruning methods. Consider the + set of parameters in a layer of a simple, fully connected neural network. This set would exactly separate the input + neurons from the output neurons. Thus, by the network-wise conservation of synaptic saliency (theorem 2), the + total score for this set is constant for all layers, implying the average is inversely proportional to the layer size. + We can empirically evaluate this relationship at scale for existing pruning methods by computing the total score + for each layer of a model, as shown in Fig. 4. While this inverse relationship is exact for synaptic saliency, other + closely related gradient-based scores, such as the scores used in SNIP and GraSP, also respect this relationship. + This validates the empirical observation that for a given compression ratio, gradient-based pruning methods will + disproportionately prune the largest layers. Thus, if the compression ratio is large enough and the pruning score + is only evaluated once, then a gradient-based pruning method will completely prune the largest layer leading to + layer-collapse. + + 3 Every element of the set is needed to separate the input neurons from the output neurons. + 4 For GraSP we negated the average layer score so that we could plot on a log-log plot. + 5 Magnitude pruning avoids layer-collapse with conservation and iteration + + Having demonstrated and investigated the cause of layer-collapse + in single-shot pruning methods at initialization, we now explore + an iterative pruning method that appears to avoid the issue + entirely. Iterative Magnitude Pruning (IMP) is a recently proposed + pruning algorithm that has proven to be successful in finding + extremely sparse trainable neural networks at initialization + (winning lottery tickets) [10,11,12,41,42,43,44]. The algorithm + follows three simple steps. First train a network, second prune + parameters with the smallest magnitude, third reset the unpruned + parameters to their initialization and repeat until the desired + compression ratio. While simple and powerful, IMP is impractical as + it involves training the network several times, essentially defeating <
> + the purpose of constructing a sparse initialization. That being + said it does not suffer from the same catastrophic layer-collapse + that other pruning at initialization methods are susceptible to. + Thus, understanding better how IMP avoids layer-collapse might + shed light on how to improve pruning at initialization. + As has been noted previously [10,11], iteration is essential for + stabilizing IMP. In fact, without sufficient pruning iterations, IMP + will suffer from layer-collapse, evident in the sudden accuracy + drops for the darker curves in Fig. 5a. However, the number of + layer-collapse. Notice that if IMP didn’t train the network during + each prune cycle, then, no matter the number of pruning iterations, + it would be equivalent to single-shot magnitude pruning. + Thus, something very critical must happen to the magnitude of + the parameters during training, that when coupled with sufficient + pruning iterations allows IMP to avoid layer-collapse. We + hypothesize that gradient descent training effectively encourages + the scores to observe an approximate layer-wise conservation + law, which when coupled with sufficient pruning iterations allows + IMP to avoid layer-collapse. + Gradient descent encourages conservation. To better understand the dynamics of the IMP algorithm during + training, we will consider a differentiable score <> algorithmically equivalent to the magnitude score. + Consider these scores throughout training with gradient descent on a loss function L using an infinitesimal step + size (i.e. gradient flow). In this setting, the temporal derivative of the parameters is equivalent to <>, + and thus the temporal derivative of the score is + + <> (4) + + Surprisingly, this is a form of synaptic saliency and thus the neuron-wise and layer-wise conservation laws + from Sec. 4 apply. In particular, this implies that for any two layers l and k of a simple, fully connected + network, then <>. This invariance has been noticed before by [45] as a form of implicit + regularization and used to explain the empirical phenomenon that trained multi-layer models can have similar + layer-wise magnitudes. In the context of pruning, this phenomenon implies that gradient descent training, with a + small enough learning rate, encourages the squared magnitude scores to converge to an approximate layer-wise + conservation, as shown in Fig. 5b. + Conservation and iterative pruning avoids layer-collapse.As explained in section 4, conservation alone + leads to layer-collapse by assigning parameters in the largest layers with lower scores relative to parameters in + smaller layers. However, if conservation is coupled with iterative pruning, then when the largest layer is pruned, + becoming smaller, then in subsequent iterations the remaining parameters of this layer will be assigned higher + relative scores. With sufficient iterations, conservation coupled with iteration leads to a self-balancing pruning + strategy allowing IMP to avoid layer-collapse. This insight on the importance of conservation and iteration + applies more broadly to other algorithms with exact or approximate conservation properties (e.g. Skeletonization, + SNIP, and GraSP as demonstrated in Sec. 3). Indeed, very recent work empirically confirms that iteration + improves the performance of SNIP [46]. + + 6 A data-agnostic algorithm satisfying Maximal Critical Compression + + In the previous section we identified two key ingredients of IMP’s ability to avoid layer-collapse: (i) approximate + layer-wise conservation of the pruning scores, and (ii) the iterative re-evaluation of these scores. While these + properties allow the IMP algorithm to identify high performing and highly sparse, trainable neural networks, + it requires an impractical amount of computation to obtain them. Thus, we aim to construct a more efficient + pruning algorithm while still inheriting the key aspects of IMP’s success. So what are the essential ingredients + for a pruning algorithm to avoid layer-collapse and provably attain Maximal Critical Compression? We prove + the following theorem in Appendix 10. + Theorem 3.Iterative, positive, conservative scoring achieves Maximal Critical Compression.If a pruning + algorithm, with global-masking, assigns positive scores that respect layer-wise conservation and if the algorithm + re-evaluates the scores every time a parameter is pruned, then the algorithm satisfies the Maximal Critical + Compression axiom. + + The Iterative Synaptic Flow Pruning (SynFlow) algorithm + + Theorem 3 directly motivates the design of our novel pruning algorithm, SynFlow, that provably reaches Maximal + Critical Compression. First, the necessity for iterative score evaluation discourages algorithms that involve + backpropagation on batches of data, and instead motivates the development of an efficient data-independent + scoring procedure. Second, positivity and conservation motives the construction of a loss function that yields + positive synaptic saliency scores. We combine these insights to introduce a new loss function (where 1 is the all + ones vector and <> the element-wise absolute value of parameters in the lth layer), + + <> (5) + + that yields the positive, synaptic saliency scores ( @RSF ) we term Synaptic Flow. For a simple, fully connected + network (i.e. <>), we can factor the Synaptic Flow score for a parameter <> as <> + + <> (6) + + This perspective demonstrates that Synaptic Flow score is a generalization of magnitude score (jw[l] j), where ij the scores consider the product of synaptic strengths flowing through each parameter, taking the inter-layer + interactions of parameters into account. We use the Synaptic Flow score in the Iterative Synaptic Flow Pruning + (SynFlow) algorithm summarized in the pseudocode below. + + Algorithm 1:Iterative Synaptic Flow Pruning (SynFlow). + + <> + + Given a network <> and specified compression ratio , the SynFlow algorithm requires only one additional + hyperparameter, the number of pruning iterations n. We demonstrate in Appendix 11, that an exponential pruning + schedule <> with n=100 pruning iterations essentially prevents layer-collapse whenever avoidable (Fig. 1), + while remaining computationally feasible, even for large networks. + + 7 Experiments + + We empirically benchmark the performance of our algorithm, SynFlow (red), against the baselines random + pruning and magnitude pruning, as well as the state-of-the-art algorithms SNIP [13] and GraSP [14]. In Fig. 6, + we test the five algorithms on 12 distinct combinations of modern architectures (VGG-11, VGG-16, ResNet- + 18, WideResNet-18) and datasets (CIFAR-10, CIFAR-100, Tiny ImageNet) over an exponential sweep of + compression ratios (<>). See Appendix 12 for more details and hyperparameters + of the experiments. Consistently, SynFlow outperforms the other algorithms in the high compression regime + (10 1:5 < ) and demonstrates significantly more stability, as indicated by its tight intervals. Furthermore, + SynFlow is the only algorithm that reliably shows better performance to the random pruning baseline: SNIP and + GraSP perform significantly worse than random pruning with ResNet-18 and WideResNet-18 trained on Tiny + ImageNet. SynFlow is also quite competitive in the low compression regime (<>). Although magnitude + pruning can partially outperform SynFlow in this regime with models trained on Tiny ImageNet, it suffers from + catastrophic layer-collapse as indicated by the sharp drops in accuracy. + + <> + + Figure 6:SynFlow consistently outperforms other pruning methods.Top-1 test accuracy as a function of + different compression ratios over 12 distinct combinations of models and datasets. We performed three runs + with the same hyperparameter conditions and different random seeds. The solid line represents the mean, the + shaded region represents area between minimum and maximum performance of the three runs. + + + 8 Conclusion + + In this paper, we developed a unifying theoretical framework that explains why existing single-shot pruning + algorithms at initialization suffer from layer-collapse. We applied our framework to elucidate how iterative + magnitude pruning [10] overcomes layer-collapse to identify winning lottery tickets at initialization. Building + on the theory, we designed a new data-agnostic pruning algorithm, SynFlow, that provably avoids layer-collapse + and reaches Maximal Critical Compression. Finally, we empirically confirmed that our SynFlow algorithm + consistently performs better than existing algorithms across 12 distinct combinations of models and datasets, + despite the fact that our algorithm is data-agnostic and requires no pre-training. Promising future directions + for this work are to (i) explore a larger space of potential pruning algorithms that satisfy Maximal Critical + Compression, (ii) harness SynFlow as an efficient way to compute appropriate per-layer compression ratios to + combine with existing scoring metrics, and (iii) incorporate pruning as a part of neural network initialization + schemes. Overall, our data-agnostic pruning algorithm challenges the existing paradigm that data must be used + to quantify which synapses of a neural network are important. + + + 9 Acknowledgements + + We thank Jonathan M. Bloom, Weihua Hu, Javier Sagastuy-Brena, Chengxu Zhuang, and members of the + Stanford Neuroscience and Artificial Intelligence Laboratory for helpful discussions. We thank the Stanford + Data Science Scholars program (DK), the Burroughs Welcome, Simons and James S. McDonnell foundations, + and an NSF career award (SG) for support. + + References + [1]Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network + via relevance assessment. InAdvances in neural information processing systems, pages 107–115, 1989. + + [2]Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. InAdvances in neural information + processing systems, pages 598–605, 1990. + + [3]Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. + InAdvances in neural information processing systems, pages 164–171, 1993. + + [4]Steven A Janowsky. Pruning versus clipping in neural networks.Physical Review A, 39(12):6600, 1989. + + [5]Russell Reed. Pruning algorithms-a survey.IEEE transactions on Neural Networks, 4(5):740–747, 1993. + + [6]Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient + neural network. InAdvances in neural information processing systems, pages 1135–1143, 2015. + + [7]Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: + A tutorial and survey.Proceedings of the IEEE, 105(12):2295–2329, 2017. + + [8]Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep + nets via a compression approach. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th + International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, + pages 254–263. PMLR, 2018. + + [9]Hidenori Tanaka, Aran Nayebi, Niru Maheswaranathan, Lane McIntosh, Stephen Baccus, and Surya + Ganguli. From deep learning to mechanistic understanding in neuroscience: the structure of retinal + prediction. InAdvances in Neural Information Processing Systems, pages 8535–8545, 2019. + + [10]Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural + networks. InInternational Conference on Learning Representations, 2019. + + [11]Jonathan Frankle, G Karolina Dziugaite, DM Roy, and M Carbin. Stabilizing the lottery ticket hypothesis. + arXiv, page. + + [12] Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing + lottery ticket initializations across datasets and optimizers. InAdvances in Neural Information Processing + Systems, pages 4933–4943, 2019. + + [13]Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: SINGLE-SHOT NETWORK PRUNING + BASED ON CONNECTION SENSITIVITY. InInternational Conference on Learning Representations, + 2019. + + [14]Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving + gradient flow. InInternational Conference on Learning Representations, 2020. + + [15]Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. + Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size.arXiv preprint + arXiv:1602.07360, 2016. + + [16]Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco + Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision + applications.arXiv preprint arXiv:1704.04861, 2017. + + [17]Ameya Prabhu, Girish Varma, and Anoop Namboodiri. Deep expander networks: Efficient deep networks + from graph theory. InProceedings of the European Conference on Computer Vision (ECCV), pages 20–35, + 2018. + [18]Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with + low rank expansions. InProceedings of the British Machine Vision Conference. BMVA Press, 2014. doi: + http://dx.doi.org/10.5244/C.28.88. + [19]Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. + InAdvances in neural information processing systems, pages 442–450, 2015. + [20]Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very + sparse deep networks. InInternational Conference on Learning Representations, 2018. + [21]Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and + Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by + network science.Nature communications, 9(1):1–12, 2018. + [22]Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.arXiv preprint + arXiv:1902.09574, 2019. + [23]Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural + network pruning?arXiv preprint arXiv:2003.03033, 2020. + [24]Sejun Park*, Jaeho Lee*, Sangwoo Mo, and Jinwoo Shin. Lookahead: A far-sighted alternative of + magnitude-based pruning. InInternational Conference on Learning Representations, 2020. + [25]Ehud D Karnin. A simple procedure for pruning back-propagation trained neural networks. IEEE + transactions on neural networks, 1(2):239–242, 1990. + [26]Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural + networks for resource efficient inference.arXiv preprint arXiv:1611.06440, 2016. + [27]Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation + for neural network pruning. InProceedings of the IEEE Conference on Computer Vision and Pattern + Recognition, pages 11264–11272, 2019. + [28]Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. InAdvances in + neural information processing systems, pages 1379–1387, 2016. + [29]Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal + brain surgeon. InAdvances in Neural Information Processing Systems, pages 4857–4867, 2017. + [30]Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching- + Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In + Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9194–9203, + 2018. + [31]Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network + pruning. InInternational Conference on Learning Representations, 2019. + [32]Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. A signal propaga- + tion perspective for pruning neural networks at initialization. InInternational Conference on Learning + Representations, 2020. + [33]Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, + Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efficient training of deep + networks. InInternational Conference on Learning Representations, 2020. + [34]Wenyuan Zeng and Raquel Urtasun. Mlprune: Multi-layer pruning for automated neural network compres- + sion. 2018. + [35]Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by + dynamic sparse reparameterization. InProceedings of the 36th International Conference on Machine + Learning, volume 97 ofProceedings of Machine Learning Research, pages 4646–4655. PMLR, 2019. + [36]Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural + networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics, + pages 249–256, 2010. + [37]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. + InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. + [38]Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. How important is a neuron. InInternational + Conference on Learning Representations, 2019. + [39]Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and + Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance + propagation.PloS one, 10(7), 2015. + [40]Seul-Ki Yeom, Philipp Seegerer, Sebastian Lapuschkin, Simon Wiedemann, Klaus-Robert Müller, and + Wojciech Samek. Pruning by explaining: A novel criterion for deep neural network pruning.arXiv preprint + arXiv:1912.08881, 2019. + [41]Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, + and the supermask. InAdvances in Neural Information Processing Systems, pages 3592–3602, 2019. + [42]Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, + Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efficient training of deep + networks. InInternational Conference on Learning Representations, 2020. + [43]Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Linear mode connectiv- + ity and the lottery ticket hypothesis.arXiv preprint arXiv:1912.05671, 2019. + [44]Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and + multiple languages: lottery tickets in rl and nlp. InInternational Conference on Learning Representations, + 2020. + [45]Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: + Layers are automatically balanced. InAdvances in Neural Information Processing Systems, pages 384–395, + 2018. + [46]Stijn Verdenius, Maarten Stol, and Patrick Forré. Pruning via iterative ranking of sensitivity statistics. + arXiv preprint arXiv:2006.00896, 2020. + + Appendix + + 10 Proofs + + We provide a proof for Theorem 2 which we rewrite below. + Theorem 2.Network-wise Conservation of Synaptic Saliency.The sum of the synaptic saliency across any set + of parameters that exactly separates the input neuronsxfrom the output neuronsyof a feedforward neural + network with homogenous activation functions equals <> + + Proof.We begin by defining the set of neurons (V) and the set of prunable parameters (E) for a neural network. + Consider a subset of the neurons <>, such that all output neuronsyc 2Sand all input neuronsxi 2VnS. + Consider the set of parameters cut by this partition + + <> (7) + + By theorem 1, we know that that sum of the synaptic saliency over C(S) is equal to the sum of the synaptic + saliency over the set of parameters adjacent toC(S)and between neurons in <>. + Continuing this argument, then eventually we get that this sum must be equal to the sum of the synaptic saliency + over the set of parameters incident to the output neuronsy, which is + + <> (8) + + We can repeat this argument iterating through the setVnStill we reach the input neuronsxto show that this + sum is also equal to <> + We provide a proof for Theorem 3 which we rewrite below. + + Theorem 3. Iterative, positive, conservative scoring achieves Maximal Critical Compression.If a pruning + algorithm, with global-masking, assigns positive scores that respect layer-wise conservation and if the algorithm + re-evaluates the scores every time a parameter is pruned, then the algorithm satisfies the Maximal Critical + Compression axiom. + + Proof.We prove this theorem by contradiction. Assume that a pruning algorithm with global-masking and + iterative, positive, conservative scoring does not satisfy the Maximal Critical Compression axiom. This implies + that at some iteration, the algorithm will prune the last parameter in a layer (layer l), despite there existing more + than one parameters <> in another layer (layer k). Because the algorithm uses global-masking, then the + score for the last parameter in layer l,S[l] , is less than or equal to the scores for each parameter, S[k], in layer k + + <> (9) + + P Because the scores respect a layer-wise conservation, then S[l] = N[k] S[k]. This implies, by the positivity of i=1 + the scores and because N[k]>1, that for all i, + + <> (10) + + This is a contradiction to the previous inequality. + + 11 Hyperparameters choices for the SynFlow algorithm + + Theorem 3 required that an algorithm re-evaluates the scores every time a parameter is pruned. However, + theorem 2 provides a theoretical insight to drastically reduce the number of iterations needed to practically + attain Maximal Critical Compression. We now introduce a modification to theorem 3 that motivates practical + hyperparameter choices used in the SynFlow algorithm. + Theorem 4.Achieving Maximal Critical Compression practically. If a pruning algorithm, with global- + masking, assigns positive scores that respect layer-wise conservation and if the prune size, the total score for the + parameters pruned at any iteration, is strictly less than the cut size, the total score for an entire layer, whenever + possible, then the algorithm satisfies the Maximal Critical Compression axiom. + + Proof.We prove this theorem by contradiction. Assume there is an iterative pruning algorithm that uses positive, + layer-wise conserved scores and maintains that the prune size at any iteration is less than the cut size whenever + possible, but doesn’t satisfy the Maximal Critical Compression axiom. At some iteration the algorithm will + prune a set of parameters containing a subset separating the input neurons from the output neurons, despite + there existing a set of the same cardinality that does not lead to layer-collapse. By theorem 2, the total score + for the separating subset is <>, which implies by the positivity of the scores, that the total prune size is at @y + least <>. This contradicts the assumption that the algorithm maintains that the prune size at any iteration is @y + always strictly less than the cut size whenever possible. + + + + Motivated by Theorem 4, we can now choose a practical, yet effective, number of pruning iteration (n) and + schedule for the compression ratios <> applied at each iteration (k) for the SynFlow algorithm. Two natural + candidates for a compression schedule would be either linear <> or exponential <>. Empirically + we find that the SynFlow algorithm with 100 pruning iterations and an exponential compression schedule + satisfies the conditions of theorem 4 over a reasonable range of compression ratios <>, as + shown in Fig. 7b. This is not true if we use a linear schedule for the compression ratios, as shown in Fig. 7a. + Interestingly, Iterative Magnitude Pruning also uses an exponential compression schedule, but does not provide + a thorough explanation for this hyperparameter choice [10]. + + + <
> + + Figure 7:Choosing the number of pruning iterations and compression schedule for SynFlow.Maximum + ratio of prune size with cut size for increasing number of pruning iterations for SynFlow with a linear (left) or + exponential (right) compression schedule. Higher transparency represents higher compression ratios. The black + dotted line represents the maximal prune size ratio that can be obtained while still satisfying the conditions of + theorem 4. All data is from a VGG-19 model at initialization using ImageNet. + + + Potential numerical instability. The SynFlow algorithm involves computing the SynFlow objective, <>, + whose singular values may vanish or explode exponentially with depthL. This may lead to l=1 + potential numerical instability for very deep networks, although we did not observe this for the models presented + in this paper. One way to address this potential challenge would be to appropriately scale network parameters + at each layer to maintain stability. Because the SynFlow algorithm is scale invariant at each layer <>, this + modification will not effect the performance of the algorithm. + + 12 Experimental details + + An open source version of our code and the data used to generate all the figures in this paper are available at + github.com/ganguli-lab/Synaptic-Flow. + + 12.1 Pruning algorithms + + All pruning algorithms we considered in our experiments use the following two steps: (i) scoring parameters, + and (ii) masking parameters globally across the network with the lowest scores. Here we describe details of how + we computed scores used in each of the pruning algorithms. + Random:We sampled independently from a standard Gaussian. + Magnitude:We computed the absolute value of the parameters. SNIP:We computed the score <> + using a random subset of the training dataset with a size ten times the + number of classes, namely 100 for CIFAR-10, 1000 for CIFAR-100,2000 for Tiny ImageNet, and 10000 for + ImageNet. The score was computed on a batch of size 256 for CIFAR-10/100, 64 for Tiny ImageNet, and 16 for + ImageNet, then summed across batches to obtain the score used for pruning. GraSP: + We computed the score <> using a random subset of the training dataset with a size ten + times the number of classes, namely 100 for CIFAR-10,1000 for CIFAR-100,2000 for Tiny ImageNet, and + 10000for ImageNet. The score was computed on a batch of size 256 for CIFAR-10/100, 64 for Tiny ImageNet, + and 16 for ImageNet, then summed across batches to obtain the score used for pruning. + SynFlow:We applied the pseudocode 1 with 100 pruning iterations motivated by the theoretical and empirical + results discussed in Sec 11. + + 12.2 Model architectures + + We adapted standard implementations of VGG-11 and VGG-16 from OpenLTH, and ResNet-18 and WideResNet- + 18 from PyTorch models. We considered all weights from convolutional and linear layers of these models as + prunable parameters, but did not prune biases nor the parameters involved in batchnorm layers. For convolutional + and linear layers, the weights were initialized with a Kaiming normal strategy and biases to be zero. + + 12.3 Training hyperparameters + + Here we provide hyperparameters that we used to train the models presented in Fig. 1 and Fig. 6. These + hyperparameters were chosen for the performance of the original model and were not optimized for the + performance of the pruned networks. + + <
> + <> <> <> \ No newline at end of file diff --git a/Corpus/MOGRIFIER LSTM.txt b/Corpus/MOGRIFIER LSTM.txt deleted file mode 100644 index c75f02e556613ac0ba17a26c4bcd7665b7a421f9..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 56526 zcmdtL%W_;5_%Seq}57%e6_Uyy@-KBs3-~Uh7d44_2HhcLf^58@AI?L`( z2IKyEa-5I*`FK6~`Ca6TUy`54gHPFHI+#5q`Bn1%{wkmJ<8yCPqk8Ay<@v?g!K({B z`IR4gD4qOq@aaW+zqP;FYVYoMK6R4hpL^5ISI7S@Njm$m(gNBx(Os z%DPRH!8kchXY)xqOpep>^*p`Kk`I%dkEVn1b#jw_%96`08z-0Ply#rq<>O?OPqHMt zznf&!jR;AsqbWq;i~VdkO`u3}lYgeKyZKCA&az3`nPZj#^S(=1{AAW1f!GNNRb+c-bjI4_5pc=9?qzsV=F zjf-qDVzaO)4Rb7ncvpMU1%`|8Qy4V^xk$sWqy>IWb#-9fHe46C*8H=}> zTxbrXtap=+2h&lKUWwhOt9nwy6*1^3O(v`_qGEcNA~txZ;0C5Sh0|uA#cY97!YBrl zfOC7%Y_XO~)|^j4GBz-E&a>T1Q4tNFNRe_X@1 zD;ePO!Jd=gDh9OC&+f8upME!!!+T`lV8q|kq|bVcr}Qv`0O{m5>%$|15sUjN>#v%` zA{!=*t6-lR5yKKQc6KWC>;Cm$|JSXAp6`a)y_oVYhomgnIQfu`SI*^Pl4Y0a_|{hX zwQ(IS}DXXrHn@u)c8uZf{QeigTOg^sU=WIGlO>40B z1>Ajuze=Z8uD5Q2+fH2o(@=nb^2z`tgxXY3H@Z2 z-m+B9PK%AckzS8;*07hPcLbA$=ccJ>3VWD2l22=XT{Lh z+WUKZ&!4Y1yL3%^ImzeOH%#Xe36H2+!UpVb=7`?B5#%dIS&1W5De7>n0pF#7pz-xO zjr48?3PrKhBe>l$cNl%rMT!8>hz~H22}4XJ4Eq8DE65YcJE=TmJ2| zS{*nqU0}x69?K(O)12nwcGLP?&IiL;ay0=oLNjqhZ#eH4TA|>Sjaj}?{GD8-qXBz! z?Hv4an*sL<^1|+0&zqf!BgI8x3W&qBTwq=zb|58lx#_3lbiUAxlR)WLfeC7d* z8An|XbA&_$wpXx1e(z074{uuN$(Mfah>#XcZ&({w@&YO?H6ubRD z9Kg>^oAJ-35g{rE7(+C|e0Ae57Fomb*cQAd{zlUNzc*IAu%PU6B@`E%*xXt9H zaQJMJ_vgJJR0QYyPCB@TUJK}r-uI1ex*{JlVh}%Dr$Vs50p}w4TBI1t2*ADtLoubo zOj-~Y*c+0wp+EN8++>JN{Zl%5&;V-)!MDU!J{;!Y<>pk?>tGB-E@9b_C1OS00!!8o z?KK?LLLeXB8w4pNtgoMFlD8JK>?we+PBR zc%Es0CToAn@6cDU`LO|#o#&Aacr>2PVdhU5Z$VLupr0|OB+Z7>FFwrw@BjV}6W%FF znB67Q9;U|B*wL9c8(n5BFf&<0>}NwWbPx?0%Q;IUMsh(v2#SS0LI1^UVA$%GS(_m| zea`GkbE?-{>xNzl03B`Ys7}-c-gTL(RrAGert@q!pEBvMq>PqW>^PpZBy0ngg!( zk--&{cOW=?1x8tOQm9cIjFg+3&qtzLa?WnUyzL`$U>mk`*cr*%n;i8dG2|L5Vn{S( z_VOF;> zbz6SA9W&zI46X&6+ApwIt|tQ)hGj6ZxLs2OF6%L6FIe1Ci>VR<1PuGB+@ej=q>~AI1rcUS;>N(X>9^}qKEOLAAcTsU-al(6Kiq5If>MT=oCu!> zsEwciQTVP+nEz(-Dn|rQ5HmGph@=%eYOoy3140zrSl0A*Y(8=6qap#PBGun^hiUf3zPNm5Y-)0 zHbBf()@C@(8K^We0;t)iNntIxFWb~rpy(?pwKM^gUWUbO0$;liGq-Mo>^Nq|_2YgN z77<_r7HM{Cpje7Vz2J@zkkKyFvyqOyvv-pej8}u#S52MF zapMNRy8u&RG8fPug*OEqv6tAQ!TAO&`Iv1)+6kqDiWQ1z&BuIJmxz3uzno1FEBLUm zN$TDz+5N01ybxfmtR7Rr)%ILPVN8c6?UPP&lDvO?cpUi*y8``+vt61N^Uc^oyU83N zxCIKIaxf$eq1Ytg!f_vy%bjE{4&nCnwzgg|Th}D1!yu8f^zUhkh_C*b#k~`B5Y&=# z7R5@2E?uuLBJ+&t6s_Bfvr+QwY;6Dk_x|q_pH6IcYxm#xBxRECU93M(en{4O-`_ls z4B|q3=5~P34XD9ZiPW1&-Z?~ zdgd$gkAr8wJ#j5>e#oEM$a~MfZ$4t?>@z8BmTH(^BM5L(VLq7REgb&q#W(6o*LCA; z{o^xJT$lW2jX!4t9A(Xp3Q=up^>Vhwqxg|fLKALf`^2dyB!2}$8B z;1ju>#5HMTttCFyp9xPanjz`~X)wQYu<_#q2K~4+hJQ?JqEo8wN8J1}Ht8uPuw43s z&A^E*p!k5`n?~lKL{UZtS+G%I+#9$c`fi4)KJLLoBk%==DAtykWO;d?J!^I-{25sa z2cfR13(hf6XnM`L<=K6*HoH;mMU<;hvW->Hj%i~dMWh-LxqyvaiIldqqG%zwRw!+2 z$TW=Uc$_&RMKz|6&u$3DgysOUl?MpiGam{~ z%2^XQD7qK;5ZZ%_6c}O$U11?;Q(tCiW*_%_sxYj{16h-Ex1lXY;1*|=tYOh67%-b{ zLrPF~Fpc=~Z8BF6F}Yxpvr4HDK*%gAajETdst`2!=%*zsjtXf@6|5}s~% zi{;LYa=D+^y z|FIr{`1qR02Ow-AuF*y%h*h&VkrpQ4^{H$bz|O3L)rg(t3M#>(CH5@qhM?uZ9eZ&9 z_h%6}P$>4`Clqs(_J=#6L+~8_8YrUyEzg8&+fmgzv z(jwNheY7R&5P%Nw%RmX1%m$OZEx+)B2YzUF#=HDy;n~s2;pxSB z^c$P)DD;rRxhk-}pl=3oO>h;W{9-DyU%@ zp6;ziyo4kaHaWQ&Kof}@EgoD0bA{EgIJ5^}QM0qCjSnLb8Y8a;b*Ho?Y2oPzKtTeC zBaLA2%1uG@%F_x11d@rr=oKu8|HJ&Gg)pa8%L^Ms;~GK_JQonIW@ujzzJF~*QxL2k z>3eBht4*@Lx050K;n{5es38Ci@8Ag9gF6z%8~@e<1&D4{}M`hSq^@sfGoC zm3<+Yqs_{`l3{yonR0`PQraB1vx(ATfwZ|JdRq+3DFb~WdbujrY|1`aQSU{S#)kXrgc#^)6q5pQso z_INfKaA0_v^l*7GdCvd@>{yz}5E1%TIc3JvA*F|8i_&kkGPf}hiL z9Oc%q-7urHAj)#sP!d53dQ`TG8%p)Cc_jV@3#xge6ib+15)VeZVxkcOL?xTbx+GnS z7fcpn;=8+pacTqmtZ^y0g~4v$S`OnkNn8ju2`F+M0TIPkx5gm!ILrE?@_YQ5H;5WB znDx&p_io6y(y&bkV|gldr*Jwl?`b3_#Gb7!D9w>{qsTFRYZZ+r)u^jwoA|XH*)?T< zr>s9YD?D#l={oqgn$t`E?c0S++LF106;%of;X7>@K_BgX z01j(S&*_~TND+fB`L`E{iqybXVv&EfSgJxr!C5QF>sW?;s}<1nuW-kk0bjGgnCGRu z6^`F8Ba#ff=TeCza-K$hXnF8qOmNg2F@f6S`CzIf>pgxu-B7Ak(h#4z6Q#)$1w?SgtxPNEjZ~8c3900sd#2TJp@*ea z?z78`tR}Rwr>$IOJ&Pb9^X1c+;I*?n_GY&-5$b7V>ntha-qe_BlY||wkgOKaHVOSh zg*qF;5&$3NPmGiR7u-Y{AwBdI*>25VMML(ul#7evLgHV~2gn!k0?%H}K?!U>x&16* z0rD+KgV@3)-OWz&IytyFI6sVH%a87c7AFMVp=Qxl!fqaFEV*U_x@Ag$4;laVKq(PD zrPnwU4~U+QNKkK~TwOQ{cc!9P8i3YP*~i$iA~+w+JntM2M9j!i z-0y~n9hMX%jP$6~ecM}d6r2oh^C6k6MFWKVS{}Wsl9{a^&@gDN_BCaqdY+33euEdE z@pv{@Q?oEF2)i+SXN5w4CZF;iY6GSqsKK%iWizm-KKw9NX-jccNhT4;9VCRzhLLnd z*37yU%%<}`I$w+{xI{03{~3?@SVORl0ciV^s$q%0W*y((4c&8F0W8-7D?QYV6#Qkf z)^al8eo@=>53SpPYI7E0gouz00qHBC>3A0q&0u9|Sc;u__@d+;@vKGd5Hx}^i$fP7 zs{L=1mMfn@$@tk!-msUnbr}X&sJdgkI9-c&`9P<4Df1>dm~$dDy><@YDtn~h z8OuQmfD{daTQ9%#GZE-x_@>0jQ5(u7DCR+qR*~gd5LEgG(K=;xfnKR1I$M$ee7BSs>z#E$Hs z86XdX!*F-Gt;G-KOavt0=jbM<%8D&CBRf;#n$gz*`(O%_m`$u;lmH4Y2Zez80#zD1 zXY=N~{fS%_X+}?w)nw(O%wq?KBALA3a~Q9YziV!C&myi+sc$6~xJU(vYSKl|PVIV0##Y8J2L5%SnE| zjv~7QU7VU;?4L$p`N=CwDsVnOQHXEj75P$m($^t3)LMnMHragPsF}rg>J$i$E9n_j z947^6MKONKty#Kr)NoR|3?`-a>Z4v;Bx0nAFf;&7lH~U(jC_(JUn%dsn zQ&T${=DvPKGLdQzSDdq|S!=Z;!nQYC$<{W1@vF6Rq-tD5r9IZ`Rb;4PoJLL>+w;MG z!#JbuTB~i=XS<~#ZfW*gn|qIr^C^tg-9AZnS}ghYp61_zzTN%J_M`1Sm9ebtE+gG( zV;^l9=eA(1)qL>rqEq7o+Yk|Muy#Apm4QPG8DCrNuWW_|PKYy7(>`l+mK*;UwbPI_1l;s*0POtSIcG`+kymLI4wrn!WyfLAM}JaT!36_Cktkn z%Nx`*XYDwb=LQ6h3y{oBND%AeJJA_+^EDyBo6KS-J^OXkUwDip&letc8`hO>X#luk zWnkL866)Qca|z6wed0M_ijMvj5h>y37{R`bu>y`P0KY}{guE5>mp;DB`kcFb|Hh^opF>jm!MLxk!s$S@-hS2!R5TE|rPNpKoy_O*!vcR69ssf|KM_ zLNnr605w#~QHMX@n6Vi#L?o%*KGH<{jhPny5YFRfFD%RA*D}TBSIfm+H(wa!)yr&s z*!dN6mbm^ygdx0vfS=Ya685VdR{6Df&)`$}3Y3MbvUpieqlR~U>>k-|5|d@{5Vk_i z28R?Uz2$AKTN$1;ynU}&m@FIta3!8OLDq0(cyvPBi!--F#jPoVfbT7KGDnm)r&DFB zVEVrNaKS;ok7PpRrQ68P838{Df^)E3wi`XeKJqW_dY-#2w|~Z)J6lG9<}{c0ZZM9m z(|G&s?N|rx&K}IHbL%Ecf}LqLcC%AaaH;X_R!jDiyvw4uuU=B}3v|Vpo(*6#(Q=te zyjhU?Xq-k8^lE!4nlg6$I=t&=t{8qui-T+Y#Np76)h}u*m$}X5ON(QL#DlTV zYw;2UOz$?-VK92({s{T!nwRt)mOx>t_a>Y^*x-FAfTJD1^(@(I{hO%L6U&qi0}FKU zW!s5J4jKDQlTr;VCtY3=1hAMNGj_SjF+E~sq&f07d5WlC$@ChS^*_x*tQlw8R!bb9 zKFzqp?IKGM*8=WBPqHnnyG5bE#M}gI_UA@nwJ2+EQ3K-1w>Rkx*)}A=s;~eo>)tBV z1y zK_9Cgkd-Les5G2rZ(f90;$rjU=w(;r6sY9E84oB_kytLLh1tRtO<3ehcdkn(E&F-o-ba`lcH*mQ45=|Q%OrAh_aEC(e z5AF|4YKGjInCfBo$XRXF!XI2ZSptX%;Jr6uC4G zU+p!EAmD~{U4CP0`3+wG?GDEzahGG~MeCjG5*!Arb4m&M}3ot3-x4hCpnp3=$WMRUR*SBuIJG zN~q|Acoc6K^J_{-R0JmEsp=TT2B2qZrHzVXVgU^XLI+)D3>6WRSfgg)`Uq6W2raIL zm)v48(>N*Mj04}bheeSqK(qgR{mV;&X8y^#*n?0Mnn$q=a6zcPf)|6MitXKX>EC@R zD8=k?I~Lp$F;NnCS$p4~kt#yuZbpWb6=7EkzYCpMcD$qn^Yi$cg`8wsv-o-*75m^J zc>5YOFg0eB3n%SjtAGK5%DTpU!O{+HbTM~PTtWr|)XhoskgU?6iUV_1TxpZUJ3av_ zp)92Z>t!TX%C)4f@C3{f0%NFeGQl`%f}%Z5PoHZOYC-LD(a1Q=at;(H{7tD>m`H{Y zILN3JBa(6FW2!8K>G**Cu>QkO zo715u<%`Ys_I_;ri!UUS?WLC**Y_g4r)2~@P41l}?alUX^bC2?R&XL0?%8rg#plJp za;4JPPFKIT?00)df491i54Dn*tGvk^5Aepf_SImw&F|L!o__B=?yD_E9d-U{HsD?T?zA-E{Vo08`wC5de06+ciuV;~wK)^B>=g4AWcg@p zdtda}?mmGMoZ9Ei$BNd9g*2wv-Pa24cB8(};+r_||SL@d4iqxIX z3@Alt^HM(13;%-+>b{IQi>>Jp51>b4RYPD^x8Yjy> z(x4v?5GCKj7`xlzwcYktIN4PFlNn>*w)TFz8*_@+8H2yo^6EWZY`*TEsLXry*bq9enjN!T+Eu%Ptf7<&`Aj#@&T@HL#L+1UF z3y>#(Z2{{@`h$n^s;&!%t=nIPJHY_EOuNFt<<%);J$s0REU=L3B)AWJbdOx|xQ~WU zY6&{A#>|%Q1dou6hRS+h3Tqa=A~V&LL|R&E=1D5khm3&>zmmKu%5jc$9Ws;g*vnke z3GX5Vik<8ZK`a_sPihiJlMeh^Hmv;X8mrlJW_eYIX>2Vj-)B0EbsaM5b!;Ekt}G)v zrehu+#(~?96n`BcU?gr}VkcA150e#3(&}GgGFh2#R-+jgWPq`5F*$>4!0b#OU9=cH zAV5ae#GJUY9_3#vob0K`s_K!QwQZ7}1>7B33=n&+AyPMf_iR82Ka>v)Cl`cum%Yl(!VJSXE}$ zrQ6-$H#&d<;I$w?@L9ORkM$S^>0Veh$0TW}v#;g=6RIc9D6Y~^a%{VhtPKT!My%Bp z%5d@R?Zub8`erxkQ#$`|u0S4y3N`X@7IS9(G~f6#xc-t}b8Nk~G$!N4IBa{5^G|2< zz79C@imfDW9#YwijZq|9=-V7aAO{oAqF0!RFYxwmW9{N^xKq3P!gl-OKx(0{-_1hm zv0FU6f$&fKscl7NU`E<{VnclGO_5;1@%QgPB!luvUk2I8W4T%tQqOUst+w789LZ6bI7J`sRUlU&}x(0lK%M~;CgR4L^4ayi$ zpmj0%!2znBqDNOSS?LrVc8ZnUE>=9X+!@qP-OSN?Pddrf_V+v2Q0uny&s!~O{p`2? zomEpRk^x?tK1;Rw9Br{{c48oP$!btpGnI|zP$AKs0j(^ReYfuxK2;^BjUu zecw?I%e2^-FQ%#~H@U4mcmuxGeLx|AC2=Jtno}2Du-6agN_g3Y33fgo>w-ABNwiQ> z%EqN+JfoD!smS5gBDK;^|5-WysWK(!q=MQOva^0|yi+8B^loG$SsgQBxc!|iV@y>> z3MatoMo@El#j#4f3jt9sV(mg!y0sjFw~F`+Xo!m}0=zo+sM+IUGLlV*(OTw6B-tjS z<5h*L>$R^LK}fr*3gQHPR9tH?7<0HiTsflg21}d6f~xdQ$`s{#BO8T8Dyu@w z4kzf2IKiZ~TF7h&S_Ys6_E|!iRl&Bhf*MEmn&iFELQ+z#&di7Zg3yzg<&Tar+qeM4 zp%N$?Hzt)9kWIZ9nH(lXHL~hUGW8c~kwvOlR^LvrjcbLmLQ$;x=W0)<0;{T6FRbJ0 zJ9^oq0YYp^7Q69Q&Ei`g$q^Abtakt>6`d zyXts_twD{gPT|b3LU?S!`L(L;x+Dnuu3zht|M9KFpw8>o#kt-?>qg;51lM|&bxYb$*W#M z8b4y8tc@Gj&#>UDz6YUHV^Kcf&$-qcC}TJx*x5|&j7N1}#hcixPWA8LGSJR9k` z;_CQcWn7^A_}iTU<)h6%jHu1Gv4-r&K5oBo1JIcBY5-bW^;3Q$HqmC?|HWovuKCuB zXfWAJ`!7CTf3+&kM6P>V1|R<_G;2SL??nc_mXcP{ItlDkg7aWgSKcx<9GUy=+s!0d zyW|)m#p%-$nXqu=MYlgH#lNbD7vQvY;z$RJ{W8c zHc2qFOE@qMjx9M74d%`|1FX$ySCUA|Zqa!jz^@DG0y;j75p$|dTy;$DHiAL9@m4bg z4NLZs-5OVT5u?`a+Q$(kn4eRMc~lf(u3=QR_(-Ko4W|lp6qO;sk(NeV-r#fx;Q{Z` z@f@RfWnlZbIa~Cy$_Gn)(0EE&5qTZi%XSO z#ycyuAT(~Xuur+ZZGck@kGW28*clielEM>A4*Fs+E+(p&1_lK9NDSBPu{_{aS%M5X zaBW2gC#M%-(|A<62&Ms2v#H*=S6})~Fvl3aC_$pdi5fj43MFbnWUc+n-0rp@_N1&( z(3sf8(>ql+qjp&ZrV5oJrP4zRwG+!UYhgE;$!p@hELrK$R!!2|j8%d)=T=BC)Ks%C zM8Mm7pG*iomyTyk`t!@hLIgtD@;05j@&e-`^5v#k5zk%7lWZ;riAFjcb0qh5kBwm) zSsK#BeF=pcCa=>0SR@y?JJdhQ7Q;y(gbU4LXIiQ?H&i4SSLs+)r(EZys~v@8oP-N= zsg43^9cwA^oNl|0q&EP)^u?u3qE#u(lmNASGKDK7P`DtlV+|E8Hb*TL1e;V`AaXcy zB(I1ZP;^%s6T)g%AN4_YvkVcu3c-At%FS%{hRQ>+e=J*aI=5ni5ieD6DPd^o`d>{{ ze-)!9j>v`?V4v9Gn4(x%h{ot!;k<_lIz8<~Ft{k$$6L*LiWX858Z^``j^)_aBs{5R zEWQ{TsCiH}xi<)T0ALWA;pbqH30BvzbF<4Jlfo{SZ!3si>&+G?Y~5OKg;20#WxfsV z&)TW8!W|RI_6;J+>NPSeF|C($vn$CY+8=EC+!X8|Ts~BK;4;(NG2B8I73JP2mKFiG ziRw34`P8$+KOCN&Q%3duhhGv(V!exe`#pQ4Gz&u^Qwh8#y={O9Ow(q*FyTgOy(!-c z*6oxNTWmGsQM=cX3lk%&UR8ZV1X-otuo@Qrs2%0Avsz*M+rnxy*Ny{8f>S(2YKbAi zM0TL8x#rR<+=gY>0$EhnCV-{43~Ase=ekg#p6!)VQV~_=Lo30NuIbx6OD~(NO)-Y6k^AFQVIsi(|irYNt21Pm3>B}$y z+~>!#MfZppyrBC^#JkUva+f@A!oq+R+S{c|b?A@jnQO$Wg3dBdjPjEeSM8x>$liy= zS06{ymLt+@RY@!VRakMq<+6UN9&Qjo2mf=5QiBG?6h{|Ln8+E!wmpE>Sv(1$3SdaUI-8oL`>UjuCm9WrWN#pNi{^Ayl4^< z7K94zjsnUSQMK45sZ?sMO>`36C~e(wM3&yImEN8tU?^g@mCn7N8|}%_7a& zn#$BDqNZXOO;44!tVY;698?||j8#c4oWDAc!2F~kR1C^oykhHeljw+F(03hzIuM3a z<@SueR!VqV;ZiV~;%d&fV8uG_R4Wx3zx|}R5|4y~n!TB8HnMF1P9CePX;42EYNx<# zTB_w1mOiYWaiZo0Y5*9UD75cu0M&$IH9F<@-9qu|(nDm1%FRiG6aonl!6mJ3PKG1l z@H2>9l^2bTx!r`-t%}5&ohL6E`cUymyBL3H61~v#%n)Cn_g6B>cQ=E+Tw;E1Ml?{nC8(&4OGMq;-}`QNZ#Ti} z-fDl>ZV_gV-KZj$fzCW>W?zRNG%R66?c(rUH-mfN6j3MZ+6MmomKv~>PdY}emuBvzv%n7XGd?24$qRK z^W^mXMFbXJz_g|H=VoC6Oe+6R0lZ{q)r=3bIF^B&$ritH6e?XtY%8(U!uF_&Iu?-| zxEv`VO%O~2tCInr;qeIuvcy1b=w5+i(Ozq^06J4=`K&6RJzv^|h9XSEnyjpe@nXU< z^zB-BmzV0j6fcr`ywU8iHc#-fK}hSjf;uim$s+YSxn7FBm)U{L>5N84F|Qt1EBshm zfn0*ZV9NRizC}Z6?#jB@L|4d~MAxz~wsvw7sBWF0;J8OPi2)`(ZbjzujOJVt?x^&$ z_X=tF1;6YmB+%I06h$#`57J5WXjmo@jY)7CGK4amhEfmavTXKV`y0$nYyJMG?8R*PnQH%v-L3&3BS;Wu;$3+#aX{Z-WWEM~l zm&^Y9RtATVKxf^ZqobC1#I;0KK&j3nCcjIwFt#?KIm{evfw!;Miz*@tBw#%Y@lVnX zD~{t;e?V0bh1-wAhC&?rr;G2$Dk~(_%39cwg*6}LXW8(!Z6}J8tKuXjjXE%jgbrt!O!+W$_pp=p#A+qkulmd<~Ngv$qfmZl8%z zsj^I_n-@^6?6#C`k^bcRi*lUmxM_QM)nQ3~NMi&nU9{XRKy5pejNjo~`rhkcBi3b& z#%O|YuA#EAf}JknYJ=l5lHWX7{FZkHPiUt4r9e_SZK)MR&hKD##i%V{U#Peu^Slg4 znU@oK6t|w3X|K{v<_ko7GN{;Oi~MvQo)t5QpX&sw%fR0;a2T2>P2W~a)vJ{mL(HZG z00Jn^Een!}$25@aj`G}0AvW-!ez_uZ6Kq9v1MgBVTUl3M6+~{hzDf0S)wKEJCC`JX z*4>&#cQBoRNYCLaFGk=rPvv}>$4N0n-? zb{DaYSPD}O)nuHpI9+n}hM8eza-eHRFu>h`f7J+}p~jI?H97dHB0IU)~l zD>dWv)$z~gWJjJS2dA%-H$Pwee0G?8e18^mNek^LvJ1C3AJk@}#gRD%r>K&?Jk>Qo z88D3zy`FSjSu&CFP2~i}(N%0yr-OA#Mh0MxWy)vB9QB7N#a zj?J))Nx8ig!1lorN|fL691iR-LR^Bnd_LZ*346E81c;!MS@g_2gyM=jZY;IP{ChpZ zvsm+m+hhTeT;e#>jdM#gZq7C~wJ>8Rgk-3t6k*TEAqXi*d{k~|xN-`XsuY?Z9W0Xj zDAA)W3gbcTX5TV}jT=IT^_vRGtI};`3K5HDVgSF`MDaM8SSBUpw+k09p__}ur-&VN z!TBuZYWY)?UXyI~-OwO)saLU+xyME<8;+6Lg&cDJR>~6@G?fFUIXg5XCG51M^jllr z;Fl7*w;elj)DF&vlE@B2&HaqDPh_?t!{Qr1U*J#~7KX`1)kvpdgL^d5hEC{Qel-&V zA(9qUL`_tbTosf`VOE;k$U|;&)EUL%lj-mh0ee+Op42c zWha)2qwNh!H80JG+2Th=;GmhA2GRl_h_bjP-ORt)q@(tXrGJ}As|HPpm-m-jqKVBH zH&o?Tf^Cb-18XFGd-k(Z{;GGH{Vm*-Xr8TN&zMd(vjPVQn`AKNpu=o{a)zNy0CH0z z1GGv?BcEY$Rcc{!$8t>yeNuoj_zRTgg2msk zoBFYCrJ0LE#euOd7YvqHTy#9;jswNo&NGZ5uB9C(XQV`VDDEf9HYaDPlHV+!#F@xa8KGpnjCPu<>wS4yuO*tKT+5jhH??Z zJtBkT1Jl2~!77u&_@bF_{rtm8eKJ<{QIDLc)$7FWNh;oz2#6cYDWbsO`i#_(WP`oD{ctvRuU2lXJ?S`oMq5r#EvBmJ<&e>&0s1wydi;wdqMRFPBE#P4@IU}itetVwdUQ)~yh+W}iiQXHk*jp9{I2s?Q+P>U2 zuwh`Knin}f#Jc)=K!J@=JAqNQcDKbSTR!eqCqC|9U`R|o_!{VV%XMzf7~##q^p=o$ znIEgq_{6z8mv*-p>m_L{x5bbn?BmP7Y67U>+FqqR6i3zDfhl(5gZ^ngegMl~!^-?# zPhO?hI!*d^KI~;TO0xVqv%keUN4N%gSF=`}^6|kIOx4)YPu#qn*3K?#(rN8(8!v6g zr}w9U5?fINu>7BJa>Eb0&t*ONg%a6(;4ePvGAo$wBc0!5sEwae<tZuZiSKT;Q1`+1hLQ>aUO_KM}i_WDm))-5ZOuN0c)_MM-y}K_yYqxo0f4{YgmbcLIlmsUxV*>Rj3SQFo+rg0eiJ>^2 zlX5N9&2Zy2S4;9GofG;Aae)ruzT@dF94evkniSA}a;`(1WKC&%DB;Yjy~h%3O;;-v zgN7?M6E%I91kUn{8&0@8r- z-(j$tyFj&4Ek`c+3=I5;KZ`o)%Haz)Ru>YtAm9p2w5V8ul0T}2Cyjz`0-Z1E5cZ)a z-0FDw*%i%2my zUq6tJrfb3g(gW;+_2iFp(z@A$7ipgSlQ!e=fCyFc(Zrv6y%=O0=m*}PBRM8u#utQS zLL6j+U>l`_Iq3p6isWc(TP}JJMYK9D`8M|HcDFR$?$)CM&xOa2aK*v6KcTR*DLzf6 zRA8YxeaY!@@wy{Cx!-O*iu2F2yF2a}?|~C^B+)kJPk;G;C<+X}_HIUlK1JjO;*O`Y zQ7vywaVA2Is9#G{AEwvJpHO0|d7bJ+dJIC;5S^YceGiKp4^Rx@ws!YTey}5y?dF28bE%bVE=H_`P9@98iSZVE>X9i*Lc8z4xTO zciLM{`xUAU8%uXs`X!#8+j%p`=1p$PJtS`nHnut64McFI#?NQRR^*==|6s*0M0-Fg ztX+Fk%40q8K34$TAGy703>RzHydyRGCVc=Zz9R{Bg3=1eh^%4J@Eo?V8%)ios69gV zibAFC4!1W-8bMbH9y`Tw1^UlqZ^UV|02FKigb`llXJ_<(IsLA^Bh(A=+{D;n;m(eA ztpPJmC-;L-o9LPE7QY!+I43!m)h0&|X8oH`{Ww6mNP@h>@<#dK_RheDVwdANr-c4Q zod=jW=Luv|-`Is?>!j~`LNiC1#=8GPR`@Fa1L-T+2s&!!T1*S4d)QIi%Im%31`%w9 zuk7Bv))pMp+J@(YnH49dgw$7l_9LJFUEED)xfJ+MTpA|&eaI#xFdgGw!Jv$g&Jki! z_nwVr=ll5AnH&LcY(Dk9|1G<1x3w#eM(_cwkPq+DFLQh`MXhGcq?%{Q#*b~0yc}}o zD<2{6D#M<9EV8Z$TrkN=rh9b7`ev9uq0U-T0Fx3MmnqOsf)}Rk>N!A5WG;L}TWziy+}~?O(%QvLaO4UH zMdd}5GKLk{R_n=83ZIRH zvIs)jxb5)xb0rhFnzU`C2r1bd2Xa39u(+qeFrNHwPN?|!O?KaJROdsBH~==K$Dcfa zU+iLKj&Wd8hN#jIEXQF~Bypf=BNo~APTX;g|9(7{qxU2;Emc@>2h(XzLZ~SnDbH7+ zf>23^FCVin=CbU2E+wyo~Y>Ycp;tz39lU*uaUH^l2X z*CyXPt$)eWPr&nGAG|0HKn+Ycqhy2@;a+yZ&{l!CgDryV${M}nt5RQt%?tTsq+b>1 z30CftX|$efz1Z%w0hHTa2jVO2xD#E@hUX~xfcsfU4G#I)sq-O+6eE_b_F>bN_)arqX)E6l{e(Hm}`1Q_Kk;3#88uTx+mU$Ds9M}3bBqqiYhg@d z4JCVGUSz_<$UY%Ge3aGo*?rv}w;ET31LVLSw7JX|4>%aZW zpD!7&T}HMjjGDONz2|#aUQEH8A#kP^<}d5cfD>`YC!KWq4>b5N?-EF%BTe0g; zIJAF*fA0{>88PJW;+7l<`G5df3(gCD8PEBokM|*{pji&u1q?dcT@Lohn-QPVf{Wv| z+|VIB1@g2ccysJ5IzCtLtqvNfyYr|=^F3TRQH+UW7Fx990m@Xj`{LLs>nEy42|<$b zbcJ+($VqI-u)3>AB5HdNdoMm&xa)Dj93GDgX7|M&k}g6L3qCP6{@aWzSrw2X_Zjt} z8k=XzVgLn!mn;U!N`_k=jteKk5;c{brNojAe4cD$kbNw{QHq~JMEKikp{>H~SWmXs zx*!-~yuB4d?MIF@joAGdLkI3l{z&{f8$*BBCV0fhfBC;GbPar+Us1#aSs=rq5t0wm zX?Y99sa{C5dhDtQ5F$Sp4t<(jQy9miclo&P3EU!3UnX~OS9a<9jQ88{4;Ig}yzwVQ z*odxPZ|V>Te!UlQ-Zf6eKZ{If65!4R(_3rC*KfGFUmWP{SD3&CSX zdB{mYf7VV`SZ>91eyOb-pE$nm*{WT=txd(LXP3@|<(mXZj)pkX&Z*OUtDNovaA6=R zSKNo+*eXP?g7*{-Q_i+4kY%2#T%#++a>S_a>@9~mzmkTKtR1~NIo@cuEAyDhr6M$~ z&X(z=_LFA*elp01$sY#e2b@_x^ID|l+=ES)EnFtv6dJ?lA1r~=AV*YiBniu?cfr-VUpe`P=Sg9XLBz9 zwW4!+{7)1Ha2gyB>|F^QJ9e}}8x``5nlZ{ny)LUqSsW>{&dqT4SJpXFyhXEb#Iw2M z%&4@p*Nw!pBLw@9&WDov87U^^CDktx>!nl&BvATUx%3t$dqJTGfEXbOOPH1}Bv)6O zu8*>tZANB$ba;5Up6sr;qiL<(k`cRYaARA!BA(w68SF1}^7ROfvO4GK;ETekES9Y( zi(UO)&+GGr^hBBDjOP}zVgOr*os z*JBC$ufWcVj&=99y9IWROi8)~4Du%+qCqYfNW%H05~=HEHlAJ<;@G6hQo3GUD|>uy z8!v8*oE9E<6m(g5kH@8zzwIY7)u+7GZk806S6n^Zc&^2M{bkg3cu5vwuND9Y}RFHihH&nU-3_k|59 zSHR&couS+`D~d{h7jU14v73E@^^`KJiy-XK~6^8!}_SayYW{q zpM#jHzlRG_P54mlScZIguzj<~Dz6_i_R0;PqW3 zEw#D4=u5M+@GXYMQAo*D)DBpwcnm)&Jgy{z-~7l)RUfG*CvWIgklwmk9tGzr?$Tal zlDI;1vWuZ7stz9%=2$^RmDpvPwz^;wL;wqs=%BQdhYazKi<~}lr4zK>eJAShh^a{KVs_5e#s_z+z74- zKBGJF6IDJT^$=PRVb*{>bEd{5l5R#EIHH%`&A1!v^Wb)com%T5`X_~sF}49w=WM~1 zrt-q01;Un;$yDaF)wbVTWT-^W5-vQe7tsx)77`pDO`1gfkr7>H?V(GyZuyf%_7nDm6?mwl+x>oVfz}n<_ z(n?~NLM?R3i~6$&(PZuRd9tB9yj!|@bNSnm*Urk@PhjQ0fpdQc;zkfID z1-4)0o)s~)MVhW=z9&5%W4^ClKkfsFvftAlt@F^g^55UafX;k7Phh_0nY*tF>BWWd zSHFi57qz7>S8IK@{S?`kFlfa(e_a&9h`?`>gJm<6OiUkqI}~kj^X!wvc-aQ}en7ga zhP;>m@>rGl&*p>!q?>1L3CAd46iP>V=0o!JGlj{BRv7A@49wHk7aK$#e_ zaMZM+5<_oU3Lsk=)HD;pW-4sPh@AlK?SG@V5f0L3uZk5rC>4^kwj(J04Or0*gJ}Sk z&sNwWl}+UMlfe?&bh8ZK(1?U|-S-{2@D31X^D$ESV-fqC67DLDP;XXe)!p$lF4ofA zk-y1mu`7c$%0;^1eLJ5K&Ir{EFf_2L=65g+M|Fd{ph+A*SFHSc(Fpb zh^LHN%kCgD#PYI&S_D%1NAfAxUNDyP}w(FB{}RT6RWBXXR$vXn3k!J zu&})uu(e}iVlztbB)5~)&&Zrqx=sCBg@F?DmbW_FE}Gi?((n)#0|D1SedFNkbWX%W z4OktOP{RX)dQmp+ycst;RS5)4xL`8`Mz-TrO3dz)yX+1%w)7;^xQn?OubKB>Jw|G* z8rd%+=8Cg^E5>?d#TqwVWsDTx-hG5ooeDC!DWk?5;`!1Vr83P1K-~dml)~d!DH|5ftksCNEBZFr$Hd;0(kHuak@>}tm862k8Y4#Z z^$pc0CpUR6i>kS3CB+LSwx-jh6KeidYojY=iaDE)I3os(|BvIkmg7~}@$q>#dkzx^ zQ2f`Nc*Kz&Xki3e7U3KwDW_gZy-@W+o9gLQXsmQ7gZV2-8@J|RK^iQ+XMj$QDuUXn i%0{xxb-*V#)}mv}o@mmx(f9HjTBPOozW%!1{r>@(TgnFj diff --git a/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt b/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt deleted file mode 100644 index 5741d6c..0000000 --- a/Corpus/Model Compression and Acceleration for Deep Neural Networks - Yu Cheng.txt +++ /dev/null @@ -1,1145 +0,0 @@ - Deep learning for visual unDerstanDing: - part 2 - - - Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang - - - - - - - - Model Compression and Acceleration - - - for Deep Neural Networks - - - The principles, progress, and challenges - - - - - - - - - - - - - In recent years, deep neural networks (DNNs) have received - increased attention, have been applied to different applica- - tions, and achieved dramatic accuracy improvements in many - tasks. These works rely on deep networks with millions or even - billions of parameters, and the availability of graphics process- - ing units (GPUs) with very high computation capability plays - a key role in their success. For example, Krizhevsky et al. [1] - achieved breakthrough results in the 2012 ImageNet Challenge - using a network containing 60 million parameters with five - convolutional layers and three fully connected layers. Usu- - ally, it takes two to three days to train the whole model on the - ImagetNet data set with an NVIDIA K40 machine. In another - example, the top face-verification results from the Labeled - Faces in the Wild (LFW) data set were obtained with networks - containing hundreds of millions of parameters, using a mix - of convolutional, locally connected, and fully connected layers - [2], [3]. It is also very time-consuming to train such a model - to obtain a reasonable performance. In architectures that only - rely on fully connected layers, the number of parameters can - grow to billions [4]. - - - Introduction - As larger neural networks with more layers and nodes are - considered, reducing their storage and computational cost - becomes critical, especially for some real-time applications ©Istockphoto.com/zapp2photo - such as online learning and incremental learning. In addition, - recent years witnessed significant progress in virtual real- - ity, augmented reality, and smart wearable devices, creating - unprecedented opportunities for researchers to tackle fun- - damental challenges in deploying deep-learning systems to - portable devices with limited resources [e.g., memory, central - processing units (CPUs), energy, bandwidth]. Efficient deep- - learning methods can have a significant impact on distributed - systems, embedded devices, and field-programmable gate ar- - ray (FPGA) for artificial intelligence (AI). For example, the - residual network-50 (ResNet-50) [5], which has 50 convolu- - tional layers, needs more than 95 megabytes of memory for Digital Object Identifier 10.1109/MSP.2017.2765695 - Date of publication: 9 January 2018 storage, and numerous floating number multiplications for - - - 126 IEEE SIgnal ProcESSIng MagazInE | January 2018 | 1053-5888/18©2018IEEE calculating each image. After discarding As larger neural networks volutional layers only. Low-rank factoriza- - some redundant weights, the network still with more layers and tion and transferred/compact filters-based - works as usual but saved more than 75% of nodes are considered, approaches provide an end-to-end pipeline - parameters and 50% computational time. reducing their storage and can be easily implemented in a CPU/ - For devices like cell phones and FPGAs GPU environment, which is straightfor- - with only several megabyte resources, how and computational ward, while parameter pruning and sharing - to compact the models used on them is cost becomes critical, use different methods such as vector quan- - also important. especially for some real- tization, binary coding, and sparse con- - Achieving these goals calls for joint time applications such straints to perform the task. Usually, it will - solutions from many disciplines, including as online learning and take several steps to achieve the goal. - but not limited to machine learning, opti- incremental learning. Regarding training protocols, models - mization, computer architecture, data com- based on parameter pruning/sharing low- - pression, indexing, and hardware design. rank factorization can be extracted from - In this article, we review recent works on compressing and pretrained ones or trained from scratch, while the transferred/ - accelerating DNNs, which attracted much attention from the compact filter and KD models can only support training from - deep-learning community and has already achieved signifi- scratch. These methods are independently designed and com- - cant progress in past years. plement each other. For example, transferred layers and pa- - We classify these approaches into four categories: rameter pruning and sharing can be used together, and model - 1) Parameter pruning and sharing: The parameter pruning quantization and binarization can be used together with low- - and sharing-based methods explore the redundancy in the rank approximations to achieve further speedup. We will de- - model parameters and try to remove the redundant and scribe the details of each theme and their properties, strengths, - noncritical ones. and drawbacks in the following sections. - 2) Low-rank factorization: Low-rank factorization-based - techniques use matrix/tensor decomposition to estimate the Parameter pruning and sharing - informative parameters of the deep convolutional neural An early work that showed that network pruning is effective in - networks (CNNs). reducing the network complexity and addressed the overfitting - 3) Transferred/compact convolutional filters: The trans- problem is [6]. Since then, it has been widely studied to compress - ferred/compact convolutional filters-based approaches DNN models, trying to remove parameters that are not crucial to - design special structural convolutional filters to reduce the the model performance. These techniques can be further classi- - storage and computation complexity. fied into three categories: model quantization and binarization, - 4) Knowledge distillation (KD): The KD methods learn a dis- parameter sharing, and structural matrix. - tilled model and train a more compact neural network to - reproduce the output of a larger network. Quantization and binarization - In Table 1, we briefly summarize these four types of meth- Network quantization compresses the original network by - ods. Generally, the parameter pruning and sharing, low-rank reducing the number of bits required to represent each weight. - factorization, and KD approaches can be used in DNNs with Gong et al. [6] and Wu et al. [7] applied k-means scalar quanti- - fully connected layers and convolutional layers, achieving zation to the parameter values. Vanhoucke et al. [8] showed that - comparable performances. On the other hand, methods using 8-bit quantization of the parameters can result in significant - transferred/compact filters are designed for models with con- speedup with minimal loss of accuracy. The work in [9] used - - - - - Table 1. A summary of different approaches for network compression. - Theme Name Description Applications More Details - Parameter pruning and sharing Reducing redundant parameters that Convolutional layer and Robust to various settings, can achieve - are not sensitive to the performance fully connected layer good performance, can support both train- - ing from scratch and pretrained model - Low-rank factorization Using matrix/tensor decomposition to Convolutional layer and Standardized pipeline, easily implement- - estimate the informative parameters fully connected layer ed, can support both training from scratch - and pretrained model - Transferred/compact Designing special structural convolutional Only for convolutional layer Algorithms are dependent on applications, - convolutional filters filters to save parameters usually achieve good performance, only - support training from scratch - KD Training a compact neural network with Convolutional layer and Model performances are sensitive to - distilled knowledge of a large model fully connected layer applications and network structure, only - support training from scratch - - - IEEE SIgnal ProcESSIng MagazInE | January 2018 | 127 16-bit fixed-point representation in stochastic rounding-based er drawback of these binary nets is that existing binarization - CNN training, which significantly reduced memory usage and schemes are based on simple matrix approximations and ignore - float- point operations with little loss in classification accuracy. the effect of binarization on the accuracy loss. To address - The method proposed in [10] first pruned the unimportant con- this issue, the work in [17] proposed a proximal Newton algo- - nections and retrained the sparsely connected networks. Then it rithm with diagonal Hessian approximation that directly mini- - quantized the link weights using weight-sharing, and then applied mizes the loss with respect to the binary weights. The work in - Huffman coding to the quantized weights as [18] significantly reduced the time on float- - well as the codebook to further reduce the point multiplication in the training stage by - rate. As shown in Figure 1 , it starts by learn- Network pruning and stochastically binarizing weights and con- - ing the connectivity via normal network train- sharing has been used verting multiplications in the hidden state - ing, followed by pruning the small-weight both to reduce network computation to sign changes. - connections. Finally, the network is retrained complexity and to address to learn the final weights for the remaining the overfitting issue. Pruning and sharing - sparse connections. This work achieves the Network pruning and sharing has been used - state-of-the-art performance among all param- both to reduce network complexity and to - eter quantization-based methods. It was shown in [11] that Hes- address the overfitting issue. An early approach to pruning was - sian weight could be used to measure the importance of network biased weight decay [19]. The optimal brain damage [20] and - parameters and proposed to minimize Hessian-weighted quantiza- the optimal brain surgeon [21] methods reduced the number - tion errors on average for clustering network parameters. A novel of connections based on the Hessian of the loss function, and - quantization framework was introduced in [12], which reduced the their works suggested that such pruning gave higher accuracy - precision of network weights to ternary values. than magnitude-based pruning such as the weight decay meth- - In the extreme case of 1-bit representation of each weight, i.e., od. Those methods supported training from scratch. - binary weight neural networks, there are also many works that A recent trend in this direction is to prune redundant, non- - directly train CNNs with binary weights; for instance, Binary- informative weights in a pretrained CNN model. For example, - Connect [13], BinaryNet [14], and XNORNetworks [15]. The Srinivas and Babu [22] explored the redundancy among neurons - main idea is to directly learn binary weights or activations dur- and proposed a data-free pruning method to remove redundant - ing the model training. The systematic study in [16] showed that neurons. Han et al. [23] proposed to reduce the total number of - networks trained with backpropagation could be robust against parameters and operations in the entire network. Chen et al. [24] - (robust against or resilient to) specific weight distortions, includ- proposed a HashedNets model that used a low-cost hash function - ing binary weights. to group weights into hash buckets for parameter sharing. The - deep compression method in [10] removed the redundant connec- - Drawbacks tions and quantized the weights and then used Huffman coding - However, the accuracy of such binary nets is significantly low- to encode the quantized weights. In [25], a simple regularization - ered when dealing with large CNNs such as GoogleNet. Anoth- method based on soft weight-sharing was proposed, which - - - - - - - - Cluster the Weights - - Train ConnectivityOriginal Compressed - Network NetworkGenerate Codebook Encode Weights - - Prune Connections - Quantize the Weights - with Codebook Encode Index - Train Weights - - Retrain Codebook - - - - - Figure 1. The three-stage compression method proposed in [10]: pruning, quantization, and encoding. The input is the original model, and the output is - the compression model. - - - 128 IEEE SIgnal ProcESSIng MagazInE | January 2018 | included both quantization and pruning in one simple (re)train- Thus the memory cost becomes O()d instead of O()d2 . - ing procedure. It is worth noting that the aforementioned prun- This circulant structure also enables the use of fast Fou- - ing schemes typically produce connection pruning in CNNs. rier transform (FFT) to speed up the computation. Given a - There is also growing interest in training compact CNNs d-dimensional vector r, the 1-layer circulant neural network - with sparsity constraints. Those sparsity constraints are in (1) has time complexity of O()ddlog . - typically introduced in the optimization In [31], a novel adaptive fastfood trans- - problem as l0 or l1 -norm regularizers. CNNs are parameter-efficient form was introduced to reparameterize the - The work in [26] imposed group sparsity due to exploring the matrix-vector multiplication of fully con- - constraints on the convolutional filters to nected layers. The adaptive fastfood trans- - achieve structured brain damage, i.e., prun- translation invariant property form matrix RR! nd# was defined as - ing entries of the convolution kernels in a of the representations to - group-wise fashion. In [27], a group-sparse input image, which is the key RS= HGPHB. (2) - regularizer on neurons was introduced to the success of training during the training stage to learn compact very deep models without Here, SG,, and B are random diago- - CNNs with reduced filters. Wen et al. [28] nal matrices. P!{,01}dd# - severe overfitting. is a random - added a structured sparsity regularizer on permutation matrix and H denotes the - each layer to reduce trivial filters, chan- Walsh–Hadamard matrix. Reparameteriz- - nels, or even layers. In filter-level pruning, all of the afore- ing a fully connected layer with d inputs and n outputs using - mentioned works used l21, -norm regularizers. The work in [29] the adaptive fastfood transform reduces the storage and the - used l1 -norm to select and prune unimportant filters. computational costs from O()nd to O()n and from O()nd to - O()ndlog , respectively. - Drawbacks The work in [32] showed the effectiveness of the new notion - There are some potential issues of the pruning and sharing of parsimony in the theory of structured matrices. Their pro- - works. First, pruning with l1 or l2 regularization requires posed method can be extended to various other structured matrix - more iterations to converge. Furthermore, all pruning criteria classes, including block and multilevel Toeplitz-like [33] matrices - require manual setup of sensitivity for layers, which demands related to multidimensional convolution [34]. - fine-tuning of the parameters and could be cumbersome for - some applications. Drawbacks - One potential problem of this kind of approach is that the struc- - Designing the structural matrix tural constraint will cause loss in accuracy since the constraint - In architectures that contain only fully connected layers, the might bring bias to the model. On the other hand, how to find a - number of parameters can grow up to billions [4]. Thus, it is proper structural matrix is difficult. There is no theoretical way - critical to explore this redundancy of parameters in fully con- from which to derive it. - nected layers, which is often the bottleneck in terms of memory - consumption. These network layers use the nonlinear transforms Low-rank factorization and sparsity - f(,xM)(=v Mx), where v ()o is an element-wise nonlinear As convolution operations constitute the bulk of all computations - operator, x is the input vector, and M is the mn# matrix of in CNNs, simplifying the convolution layer would have a direct - parameters. When M is a large general dense matrix, the cost impact on the overall speedup. The convolution kernels in a typi- - of storing mn parameters and computing matrix-vector products cal CNN is a four-dimensional tensor. The key observation is that - in Om()n time. Thus, an intuitive way to prune parameters is to there might be a significant amount of redundancy in the tensor. - impose x as a parameterized structural matrix. An mn# matrix Ideas based on tensor decomposition seem to be a particularly - that can be described using much fewer parameters than mn is promising way to remove the redundancy. Regarding to the fully - called a structured matrix. Typically, the structure should not connected layer, it can be viewed as a two-dimensional (2-D) - only reduce the memory cost but also dramatically accelerate the matrix and the low-rankness can also help. - inference and training stage via fast matrix-vector multiplication Using low-rank filters to accelerate convolution has a long - and gradient computations. history. Typical examples include high-dimensional discrete - Following this direction, the work in [30] proposed a sim- cosine transform (DCT) and wavelet systems constructed - ple and efficient approach based on circulant projections, from one-dimensional (1-D) DCT transform and 1-D wave- - while maintaining competitive error rates. Given a vector lets, respectively, using tensor products. In the context of - r=(,rr 01 ,,frd-1 ), a circulant matrix RR! dd# is defined as dictionary learning, Rigamonti et al. [35] suggested learning - separable 1-D filters. In [36], a few low-rank approximation Rr0 rd 1 g r VS - 2 r1 W and clustering schemes for the convolutional kernels were - Sr1 r0 rd 1 r W proposed. They achieved 2# speedup for a single convolu- - Rr (circ ): S - 2 - ==r WS h 1 r0 j h W. (1) tional layer with 1% drop in classification accuracy. The - Srd-2 j jrd-1 W work in [37] suggested using different tensor decomposition Sr WTd-1 rd-2 g r1 r0 X schemes, reporting a 45.# speedup with 1% drop in accuracy - - - IEEE SIgnal ProcESSIng MagazInE | January 2018 | 129 case. For the scheme in [39], the decom- - position always exists and can achieve - better performance than general CP. - Table 2 lists a performance comparison - of both methods. The actual speedup - and compression rates are used to mea- - sure the performances. We can see that - the BN version can achieve slightly bet- - ter performance while the CP version - gives higher compression rates. Original Framework Low-Rank Note that the fully connected layers Factorization Framework can be viewed as a 2-D matrix and thus - (a) (b) the aforementioned methods can also - be applied there. There are several clas- - sical works on exploiting low-rankness Figure 2. A typical framework of the low-rank regularization method. (a) is theoriginal convolutional - layer, and (b) is the low-rank constraint convolutional layer with rank-K. in fully connected layers. For instance, - Misha et al. [40] reduced the number - of dynamic parameters in deep models - in text recognition. In both works, the approximation was using the low-rank method. Reference [41] explored a low-rank - done layer by layer. After one layer was approximated by matrix factorization of the final weight layer in a DNN for - the low-rank filters, the parameters of that layer were fixed, acoustic modeling. - and the layers above were fine-tuned based on a reconstruc- - tion error criterion. These are typical low-rank methods for Drawbacks - compressing 2-D convolutional layers, which is described in Low-rank approaches are straightforward for model compres- - Figure 2. In [38], canonical polyadic (CP) decomposition of sion and acceleration. The idea complements recent advances - the kernel tensors was proposed. Their work used nonlinear in deep learning such as dropout, rectified units, and maxout. - least squares to compute the CP decomposition, which was However, the implementation is not that easy since it involves - also based on the tensor decomposition idea. In [39], a new a decomposition operation, which is computationally expen- - algorithm for computing the low-rank tensor decomposition sive. Another issue is that current methods perform low-rank - and a new method for training low-rank constrained CNNs approximation layer by layer, and thus cannot perform global - from scratch were proposed. It used batch normalization (BN) parameter compression, which is important as different lay- - to transform the activations of the internal hidden units, and it ers hold different information. Finally, factorization requires - was shown to be an effective way to deal with the exploding extensive model retraining to achieve convergence when com- - or vanishing gradients. pared to the original model. - In principle, both the CP decomposition scheme and the - decomposition scheme in [39] (BN low-rank) can be used to Transferred/compact convolutional filters - train CNNs from scratch. For the CP decomposition, finding CNNs are parameter-efficient due to exploring the transla- - the best low-rank approximation is an ill-posed problem, and tion invariant property of the representations to input image, - the best rank-K approximation may not exist in the general which is the key to the success of training very deep models - without severe overfitting. Although a strong theory is cur- - rently missing, a large amount of empirical evidence sup- - ports the notion that both the translation invariant property Table 2. Comparisons between the low-rank models and their baselines - on ILSVRC-2012. and convolutional weight-sharing are important for good - predictive performance. The idea of using transferred con-Model TOP-5 Accuracy Speedup Compression Rate volutional filters to compress CNN models is motivated by - AlexNet 80.03% 1 1 recent works in [42], which introduced the equivariant group - BN low-rank 80.56% 1.09 4.94 theory. Let x be an input, U()$ be a network or layer, and - T()$ be the transform matrix. The concept of equivariance CP low-rank 79.66% 1.82 5 is defined as VGG-16 90.60% 1 1 - BN low-rank 90.47% 1.53 2.72 TTlUU ^^ xx hh = , (3) - CP low-rank 90.31% 2.05 2.75 - GoogleNet 92.21% 1 1 which says that transforming the input x by the transform - T()$ and then passing it through the network or layer U(·) BN low-rank 91.88% 1.08 2.79 should give the same result as first mapping x through the CP low-rank 91.79% 1.20 2.84 network and then transforming the representation. Note that, - - - 130 IEEE SIgnal ProcESSIng MagazInE | January 2018 | in [42], the transforms T()$ and Tl()$ are not necessarily where Tx(·,,y) denoted the translation of the first oper- - the same as they operate on different objects. According to and by (,xy) along its spatial dimensions, with proper zero - this theory, it is reasonable to apply the transform to layers padding at borders to maintain the shape. The proposed - or filters U()$ to compress the whole network models. From framework can be used to 1) improve the classification accu- - empirical observation, deep CNNs also benefit from using a racy as a regularized version of maxout networks and 2) - large set of convolutional filters by applying a certain trans- to achieve parameter efficiency by flexibly varying their - form T()$ to a small set of base filters since it acts as a regu- architectures to compress networks. - larizer for the model. Table 3 briefly compares the performance of different - Following this trend, there are many recent works proposed methods with transferred convolutional filters, using VGG- - to build a convolutional layer from a set of base filters [42]– Net (16 layers) as the baseline model. The results are report- - [45]. What they have in common is that the transform T()$ ed on the CIFAR-10 and CIFAR-100 data sets with top-five - lies in the family of functions that only operate in the spatial error rates. It is observed that they can achieve reduction in - domain of the convolutional filters. For parameters with little or no drop in clas- - example, the work in [44] found that the sification accuracy. - lower convolution layers of CNNs learned The basic idea of KD is to - redundant filters to extract both positive and distill knowledge from a Drawbacks - negative phase information of an input sig- large teacher model into There are several issues that need to be - nal, and defined T()$ to be the simple nega- a small one by learning addressed for approaches that apply transfer - tion function the class distributions information to convolutional filters. First, - output by the teacher these methods can achieve competitive per- - T^h WW x = -x . (4) formance for wide/flat architectures (like via softened softmax. VGGNet) but not narrow/special ones (like - Here, Wx is the basis convolutional filter GoogleNet and ResNet). Second, the trans- - and W-x is the filter consisting of the shifts whose activation is fer assumptions sometimes are too strong to guide the algo- - opposite to that of Wx and selected after max-pooling opera- rithm, making the results unstable on some data sets. - tion. By doing this, the work in [44] can easily achieve 2# com- Using a compact filter for convolution can directly reduce - pression rate on all the convolutional layers. It is also shown that the computation cost. The key idea is to replace the loose and - the negation transform acts as a strong regularizer to improve overparametric filters with compact blocks to improve the - the classification accuracy. The intuition is that the learning speed, which significantly accelerate CNNs on several bench- - algorithm with pair-wise positive-negative constraint can lead marks. Decomposing 33# convolution into two 11# con- - to useful convolutional filters instead of redundant ones. volutions was used in [47], which achieved state-of-the-art - In [45], it was observed that magnitudes of the responses acceleration performance on object recognition. SqueezeNet - from convolutional kernels had a wide diversity of pattern rep- [48] was proposed to replace 33# convolution with 11# - resentations in the network, and it was not proper to discard convolution, which created a compact neural network with - weaker signals with a single threshold. Thus, a multibias non- approximately 50 fewer parameters and comparable accuracy - linearity activation function was proposed to generate more when compared to AlexNet. - patterns in the feature space at low computational cost. The - transform T()$ was define as KD - To the best of our knowledge, exploiting knowledge transfer to - TlU^h xW=+ x d , (5) compress model was first proposed by Caruana et al. [49]. They - trained a compressed model with pseudo-data labeled by an - where d were the multibias factors. The work in [46] consid- ensemble of strong classifiers and reproduced the output of the - ered a combination of rotation by a multiple of 90° and hori- original larger network. However, their work is limited to shal- - zontal/vertical flipping with low models. The idea has been recently adopted in [50] as KD - to compress deep and wide networks into shallower ones, where - TlU^h xW= Ti , (6) - Table 3. Comparisons of different approaches based on transferred where WTi was the transformation matrix that rotated the orig- convolutional filters on CIFAR-10 and CIFAR-100. - inal filters with angle i !{90,,}180270. In [42], the transform Model CIFAR-100 CIFAR-10 Compression Rate was generalized to any angle learned from data, and i was - directly obtained from data. Both [46] and [42] can achieve VGG-16 34.26% 9.85% 1 - good classification performance. MBA [45] 33.66% 9.76% 2 - Reference [43] defined T()$ as the set of translation func- CRELU [44] 34.57% 9.92% 2 - tions applied to 2-D filters CIRC [42] 35.15% 10.23% 4 - T lU^^ xhh =Tx·,,y , (7) DCNN [43] 33.57% 9.65% 1.62 xy,,!" -kkf,, ,^ xy,( h !00,) - - - IEEE SIgnal ProcESSIng MagazInE | January 2018 | 131 the compressed model mimicked the function learned by the Other types of approaches - complex model. The basic idea of KD is to distill knowledge We first summarize the works utilizing attention-based - from a large teacher model into a small one by learning the methods. Note that attention-based systems [57] can reduce - class distributions output by the teacher via softened softmax. computations significantly by learning to selectively focus or - The work in [51] introduced a KD compression framework, “attend to” a few, task-relevant input regions. The work in [57] - which eased the training of deep networks by following a student- introduced the dynamic capacity network that combined two - teacher paradigm, in which the student was penalized according types of modules: the small subnetworks with low capacity, and - to a softened version of the teacher’s output. The framework the large ones with high capacity. The low-capacity subnetworks - compressed an ensemble of deep networks (teacher) into a stu- were active on the whole input to first find the task-relevant areas - dent network of similar depth. To do so, the student was trained in the input, and then the attention mechanism was used to di- - to predict the output of the teacher, as well as the true classifica- rect the high-capacity subnetworks to focus on the task-relevant - tion labels. Despite its simplicity, KD demonstrates promising regions in the input. By doing this, the size of the CNN model - results in various image classification tasks. The work in [52] could be significantly reduced. - aimed to address the network compression Following this direction, the work in - problem by taking advantage of depth neural The standard criteria [58] introduced the conditional computation - networks. It proposed an approach to train to measure the quality idea, which only computes the gradient for - thin and deep networks, called FitNets, to of model compression some important neurons. It proposed a new - compress wide and shallower (but still deep) and acceleration are the type of general-purpose neural network com- - networks. The method was rooted in KD and ponent: a sparsely gated mixture-of-experts - extended the idea to allow for thinner and compression and the (MoE) layer. The MoE consisted of a number - deeper student models. To learn from the speedup rates. of experts, each a simple feed-forward neural - intermediate representations of the teacher network, and a trainable gating network that - network, FitNet made the student mimic the full feature maps of selected a sparse combination of the experts to process each input. - the teacher. However, such assumptions are too strict since the In [59], dynamic DNNs (D2NNs) were introduced, which were a - capacities of teacher and student may differ greatly. In certain type of feed-forward DNN that selected and executed a subset of - circumstances, FitNet may adversely affect the performance and D2NN neurons based on the input. - convergence. All the aforementioned methods are validated on There have been other attempts to reduce the number of - the MNIST, CIFAR-10, CIFAR-100, SVHN, and AFLW bench- parameters of neural networks by replacing the fully con- - mark data sets, and simulation results show that these methods nected layer with global average pooling [43], [60]. Network - match or outperform the teacher’s performance, while requiring architectures, such as GoogleNet or network in network, - notably fewer parameters and multiplications. can achieve state-of-the-art results on several benchmarks - There are several extensions along this direction of distilla- by adopting this idea. However, transfer learning, i.e., reus- - tion knowledge. The work in [53] trained a parametric student ing features learned on the ImageNet data set and applying - model to approximate a Monte Carlo teacher. The proposed them to new tasks, is more difficult with this approach. This - framework used online training and used DNNs for the student problem was noted by Szegedy et al. [60] and motivated - model. Different from previous works, which represented the them to add a linear layer on top of their networks to enable - knowledge using the softened label probabilities, [54] repre- transfer learning. - sented the knowledge by using the neurons in the higher hidden The work in [61] targeted the ResNet-based model with a - layer, which preserved as much information as the label prob- spatially varying computation time, called stochastic depth, - abilities, but are more compact. The work in [55] accelerated which enabled the seemingly contradictory setup to train short - the experimentation process by instantaneously transferring networks and used deep networks at test time. It started with - the knowledge from a previous network to each new deeper very deep networks and, while during training, for each mini- - or wider network. The techniques are based on the concept batch, randomly dropped a subset of layers and bypassed them - of function-preserving transformations between neural net- with the identity function. This model is end-to-end trainable, - work specifications. Zagoruyko et al. [56] proposed attention deterministic, and can be viewed as a black-box feature extrac- - transfer to relax the assumption of FitNet. They transferred the tor. Following this direction, the work in [62] proposed a pyra- - attention maps that are summaries of the full activations. midal residual network with stochastic depth. - Other approaches to reduce the convolutional overheads - Drawbacks include using FFT-based convolutions [63] and fast convolution - KD-based approaches can make deeper models thinner and using the Winograd algorithm [64]. Those works only aim to - help significantly reduce the computational cost. However, speedup the computation but not reduce the memory storage. - there are a few disadvantages. One of them is that KD can only - be applied to classification tasks with softmax loss function, Benchmarks, evaluation, and databases - which hinders its usage. Another drawback is that the model In the past five years, the deep-learning community has made - assumptions sometimes are too strict to make the performance great efforts in benchmark models. One of the most well- - competitive with other types of approaches. known models used in compression and acceleration for CNNs - - - 132 IEEE SIgnal ProcESSIng MagazInE | January 2018 | is Alexnet [1], which occasionally has been Proposing some general/ about how to choose different compression - used for assessing the performance of com- unified approaches is approaches and possible challenges/solu- - pression. Other popular standard models one direction that can tions in this area. - include LeNets [65], All-CNN-nets [66], be taken regarding and many others. LeNet-300-100 is a fully General suggestions - connected network with two hidden layers, the use of CNNs in There is no golden rule to measure which one - with 300 and 100 neurons each. LeNet-5 is small platforms. of the four kinds of approaches is the best. How - a convolutional network that has two convo- to choose the proper approaches is really de- - lutional layers and two fully connected layers. Recently, more pendent on the applications and requirements. Here, we provide - state-of-the-art architectures are used as baseline models in some general suggestions. - many works, including network in networks [67], VGGNets ■ If the applications needs compacted models from pretrained - [68], and ResNets [69]. Table 4 summarizes the baseline mod- models, one can choose either pruning and sharing or low- - els commonly used in several typical compression methods. rank factorization-based methods. If end-to-end solutions - The standard criteria to measure the quality of model com- are needed for the problem, the low-rank and transferred - pression and acceleration are the compression and the speedup convolutional filters approaches are preferred. - rates. Assume that a is the number of the parameters in the ■ For applications in some specific domains, methods with - original model M and a* is that of the compressed model M* , human prior (like the transferred convolutional filters and - then the compression rate a (,MM * ) of M* over M is structural matrix) sometimes have benefits. For example, - when conducting medical images classification, transferred - MM,.aa ^h * = (8)a convolutional filters should work well as medical images * (like organs) do have the rotation transformation property. - Another widely used measurement is the index space saving ■ Usually, the approaches of pruning and sharing could give - defined in several papers [70], [71] as a reasonable compression rate while not hurting the accu- - racy. Thus, for applications that require stable model accu- - MM,,aa b * = -^h * (9)a racy, it is better to utilize pruning and sharing. * - ■ If a problem involves small- or medium-size data sets, one - where a and a are the number of the dimension of the index can try the KD approaches. The compressed student model - space in the original model and that of the compressed can take the benefit of transferring knowledge from the - model, respectively. teacher model, making it a robust data set that is not large. - Similarly, given the running time s of M and s* of M*, the ■ As we mentioned in the “Introduction,” techniques of the - speedup rate d (,MM * ) is defined as four themes are orthogonal. It makes sense to combine two - or three of them to maximize the compression/speedup - MM,.sd ^h * =s (10) rates. For some specific applications, like object detection, * which requires both convolutional and fully connected lay- - Most work used the average training time per epoch to mea- ers, one can compress the convolutional layers with low- - sure the running time, while in [70] and [71], the average rank factorization and the fully connected layers with a - testing time was used. Generally, the compression rate and pruning method. - speedup rate are highly correlated, as smaller models often - results in faster computation for both the training and the - testing stages. - Good compression methods are expected to achieve almost Table 4. A summary of baseline models used in - the same performance as the original model with much smaller different representative works of network compression. - parameters and less computational time. However, for differ- Baseline Models Representative Works - ent applications with varying CNN designs, the correlation Alexnet [1] Structural matrix [30]–[32] between parameter size and computational time may be dif- - ferent. For example, it is observed that, for deep CNNs with Low-rank factorization [39] - fully connected layers, most of the parameters are in the fully Network in network [67] Low-rank factorization [39] - connected layers; while for image classification tasks, float- VGGNets [68] Transferred filters [43] - point operations are mainly in the first few convolutional lay- Low-rank factorization [39] ers since each filter is convolved with the whole image, which ResNets [69] Compact filters [48], stochastic depth [61] is usually very large at the beginning. Different applications - should focus on different layers. Parameter sharing [25] - All-CNN-nets [66] Transferred filters [44] - Discussion and challenges LeNets [65] Parameter sharing [25] - In this article, we summarized recent works on compress- Parameter pruning [21], [23] ing and accelerating DNNs. Here we discuss more details - - - IEEE SIgnal ProcESSIng MagazInE | January 2018 | 133 Technique challenges good compression approaches. Instead of directly reducing - Techniques for deep model compression methods are expected and transferring parameters from the teach- - and acceleration are still in the early stages, to achieve almost the er models, passing selectivity knowledge of - and the following challenges still need to same performance as the neurons could be helpful. One can derive - be addressed. a way to select essential neurons related to original model with much ■ Most of the current state-of-the-art ap - the task. The intuition is that, if a neuron - proaches are built on well-designed smaller parameters and is activated in certain regions or samples, - CNN models, which have limited free- less computational time. this implies these regions or samples share - dom to change the configuration (e.g., some common properties that may relate - network structural, hyperparameters). to the task. Performing such steps is time- - To handle more complicated tasks, it should provide more consuming, thus efficient implementation is important. - plausible ways to configure the compressed models. For methods with convolutional filters and the structural - ■ Pruning is an effective way to compress and accelerate matrix, we can conclude that the transformation lies in the - CNNs. Current pruning techniques are mostly designed to family of functions that only operations on the spatial dimen- - eliminate connections between neurons. On the other hand, sions. Hence, to address the imposed prior issue, one solution - a pruning channel can directly reduce the feature map is to provide a generalization of the aforementioned approach- - width and shrink the model into a thinner one. It is efficient es in two aspects: 1) instead of limiting the transformation - but also challenging because removing channels might dra- to belong to a set of predefined transformations, let it be the - matically change the input of the following layer. It is whole family of spatial transformations applied to 2-D filters - important to focus on how to address this issue. or the matrix, and 2) learn the transformation jointly with all - ■ As we mentioned previously, methods of structural matrix of the model parameters. - and transferred convolutional filters impose prior human Proposing some general/unified approaches is one direction - knowledge to the model, which could significantly affect that can be taken regarding the use of CNNs in small platforms. - the performance and stability. It is critical to investigate Yuhen et al. [75] presented a feature map dimensionality reduc- - how to control the impact of the imposed prior knowledge. tion method by excavating and removing redundancy in feature - ■ The methods of KD provide many benefits such as directly maps generated by different filters, which could also preserve - accelerating the model without special hardware or imple- intrinsic information of the original network. The idea can be - mentations. It is still worth it to develop KD-based extended to make CNNs more applicable for different platforms. - approaches and explore how to improve the performance. The work in [76] proposed a one-shot whole network compres- - ■ Hardware constraints in various of small platforms (e.g., sion scheme consisting of three components: rank selection, low- - mobile, robotic, self-driving cars) are still a major problem rank tensor decomposition, and fine-tuning to make deep CNNs - that hinder the extension of deep CNNs. How to make full work in mobile devices. From the systematic side, Facebook - use of the limited computational source available and how released the platform Caffe2 [77], which employed a particularly - to design special compression methods for such platforms lightweight and modular framework and included mobile-specif- - are still challenges that need to be addressed. ic optimizations based on the hardware design. Caffe2 can help - developers and researchers train large machine-learning models - Possible solutions and deliver AI on mobile devices. - To solve the hyperparameters configuration problem, we can - rely on the recent learning-to-learn strategy [72], [73]. This Acknowledgments - framework provides a mechanism, allowing the algorithm to We would like to thank the reviewers and broader community - automatically learn how to exploit structure in the problem of for their feedback on this survey. In particular, we would like - interest. There are two different ways to combine the learning- to thank Hong Zhao from the Department of Automation of - to-learn module with the model compression. The first designs Tsinghua University for her help on modifying this article. - compression and learning-to-learn simultaneously, while the This research is supported by National Science Foundation of - second way first configures the model with learn-to-learning China, grant number 61401169. The corresponding author of - and then prunes the parameters. this article is Pan Zhou. - Channel pruning provides the efficiency benefit on - both CPUs and GPUs because no special implementation is Authors - required. But it is also challenging to handle the input con- Yu Cheng (chengyu@us.ibm.com) received his bachelor’s - figuration. One possible solution is to use the training-based degree in automation from Tsinghua University, Beijing, - channel pruning methods [74], which focus on imposing sparse China, in 2010 and his Ph.D. degree in computer science - constraints on weights during training, and could adaptively from Northwestern University, Evanston, Illinois in 2015. - determine hyperparameters. However, training from scratch Currently, he is a research staff member at AI Foundations Lab, - for such a method is costly for very deep CNNs. IBM T.J. Watson Research Center, Yorktown Heights, New - Exploring new types of knowledge in the teacher models York. His research is focused on deep learning in general, with - and transferring it to the student models is useful for the KD specific interests in deep generative models and deep models - - 134 IEEE SIgnal ProcESSIng MagazInE | January 2018 | compression. He also has published many works regarding the [12] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv - applications of deep learning in computer vision and natural Preprint, arXiv:1612.01064, 2016. - language processing. [13] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep neu- - ral networks with binary weights during propagations,” in Proc. Advances Neural - Duo Wang (d-wang15@mails.tsinghua.edu.cn) received the Information Processing Systems Annu. Conf., 2015, pp. 3123–3131. - B.S. degree in automation from the Harbin Institute of [14] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks - Technology, China, in 2015, where he is currently pursuing his with weights and activations constrained to +1 or −1,” Computing Res. Repository, - vol. abs/1602.02830, 2016. [Online]. Available: https://arxiv.org/abs/1602.02830 Ph.D. degree in the Department of Automation, Tsinghua [15] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet University. His research interests are deep/machine learning and classification using binary convolutional neural networks,” in Proc. European Conf. - their applications in computer vision and robotics vision. Computer Vision, 2016, pp. 525–542. - Pan Zhou (panzhou@hust.edu.cn) received his B.S. degree [16] P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, “Deep - neural networks are robust to weight binarization and other non-linear distortions,” in the Advanced Class of Huazhong University of Science and Computing Res. Repository, vol. abs/1606.01981, 2016. [Online]. Available: https:// - Technology (HUST), Wuhan China, and his M.S. degree in elec- arxiv.org/abs/1606.01981 - tronics and information engineering from the same university in [17] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep networks,” - Computing Res. Repository, vol. abs/1611.01600, 2016. [Online]. Available: https:// 2006 and 2008, respectively. He received his Ph.D. degree from arxiv.org/abs/1611.01600 - the School of Electrical and Computer Engineering at the [18] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with - Georgia Institute of Technology, Atlanta in 2011. Currently, he is few multiplications,” Computing Res. Repository, vol. abs/1510.03009, 2015. - [Online]. Available: https://arxiv.org/abs/1510.03009 an associate professor with School of Electronic Information and [19] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network con- Communications, HUST. His research interests include big data struction with back-propagation,” Adv. Neural Inform. Process. Syst. 1, 1989, pp. - analytics and machine learning, security and privacy, and infor- 177–185. - mation networks. [20] Y. L. Cun, J. S. Denker, and S. A. Solla, “Advances in neural information pro- - cessing systems 2,” in Optimal Brain Damage, D. S. Touretzky, Ed. San Mateo, Tao Zhang (taozhang@mail.tsinghua.edu.cn) received his CA: Morgan Kaufmann, 1990, pp. 598–605. - B.S., M.S., and Ph.D. degrees from Tsinghua University, [21] B. Hassibi, D. G. Stork, and S. C. R. Com, “Second order derivatives for - Beijing, China, in 1993, 1995, and 1999, respectively, and his network pruning: Optimal brain surgeon,” in Advances in Neural Information - Processing Systems, vol. 5. San Mateo, CA: Morgan Kaufmann, 1993, pp. 164– Ph.D. degree from Saga University, Japan, in 2002, all in con- 171. - trol engineering. He is a professor with the Department of [22] S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural net- - Automation, Tsinghua University. His current research inter- works,” in Proc. British Machine Vision Conf., 2015, pp. 31.1–31.12. - ests include artificial intelligence, robotics, image processing, [23] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connec- - tions for efficient neural networks,” in Proc. 28th Int. Conf. Neural Information control theory, and control of spacecraft. Processing Systems, 2015, pp. 1135–1143. - [24] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing - References neural networks with the hashing trick,” in Proc. Machine Learning Research - Workshop Conf., 2015, pp. 2285–2294.[1] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep - convolutional neural networks,” in Proc. Conf. Neural Information Processing [25] K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network - Systems, 2012, pp. 1097–1105. compression,” Computing Res. Repository, vol. abs/1702.04008, 2017. [Online]. - Available: https://arxiv.org/abs/1702.04008 [2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to - human-level performance in face verification,” in Proc. IEEE Conf. Computer [26] V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain dam- - Vision Pattern Recognition, 2014, pp. 1701–1708. age,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2016, pp. 2554– - 2564.[3] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are sparse, - selective, and robust,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, [27] H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact - 2015, pp. pp. 2892–2900. CNNs,” in Proc. European Conf. Computer Vision, 2016, pp. 662–677. - [4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. [28] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in - Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale distributed deep deep neural networks,” Adv. Neural Inform. Process. Syst., vol. 29, pp. 2074–2082, - networks,” in Proc. Conf. Neural Information Processing Systems, 2012, pp. 2016. - 1223–1231. [29] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for - [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni- efficient convnets,” Computing Res. Repository, vol. abs/1608.08710, 2016. - tion,” Computing Res. Repository, vol. abs/1512.03385, 2015. [Online]. Available: [Online]. Available: https://arxiv.org/abs/1608.08710 - https://arxiv.org/pdf/1512.03385.pdf [30] Y. Cheng, F. X. Yu, R. Feris, S. Kumar, A. Choudhary, and S.-F. Chang, “An - [6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep convolutional exploration of parameter redundancy in deep networks with circulant projections,” in - networks using vector quantization,” Computing Res. Repository, vol. Proc. Int. Conf. Computer Vision, 2015, pp. 2857–2865. - abs/1412.6115, 2014. [Online]. Available: https://arxiv.org/pdf/1412.6115.pdf [31] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. - [7] Y. W. Q. H. Jiaxiang Wu, C. Leng, and J. Cheng, “Quantized convolutional neu- Wang, “Deep fried convnets,” in Proc. Int. Conf. Computer Vision, 2015, pp. 1476– - ral networks for mobile devices,” in Proc. IEEE Conf. Computer Vision Pattern 1483. - Recognition, 2016, pp. 4820–4828. [32] V. Sindhwani, T. Sainath, and S. Kumar. (2015). Structured transforms for - [8] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural net- small-footprint deep learning. Advances in Neural Information Processing - works on cpus,” in Proc. Conf. Neural Information Processing Systems Deep Systems, 28, pp. 3088–3096. [Online]. Available: http://papers.nips.cc/paper/5869- - Learning and Unsupervised Feature Learning Workshop, 2011. structured-transforms-for-small-footprint-deep-learning.pdf - [9] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning [33] J. Chun and T. Kailath, Generalized Displacement Structure for Block- - with limited numerical precision,” in Proc. 32nd Int. Conf. Machine Learning, Toeplitz, Toeplitz-Block, and Toeplitz-Derived Matrices. Berlin, Germany: - 2015, vol. 37, pp. 1737–1746. Springer, 1991, pp. 215–236. - [10] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural [34] M. V. Rakhuba and I. V. Oseledets. (2015). Fast multidimensional convolution - networks with pruning, trained quantization and Huffman coding,” in Proc. Int. in low-rank tensor formats via cross approximation. SIAM J. Sci. Comput., 37(2). - Conf. Learning Representations, 2016. [Online]. Available: http://dx.doi.org/10.1137/140958529 - [11] Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network quantiza- [35] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable filters,” - tion,” Computing Res. Repository, vol. abs/1612.01543, 2016. [Online]. Available: in Proc. IEEE Conf. Computer Vision Pattern Recognition, 2013, pp. 2754– - https://arxiv.org/abs/1612.01543 2761. - - - IEEE SIgnal ProcESSIng MagazInE | January 2018 | 135 [36] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting lin- [57] A. Almahairi, N. Ballas, T. Cooijmans, Y. Zheng, H. Larochelle, and A. C. - - - - - - - ear structure within convolutional networks for efficient evaluation,” Adv. Neural Courville, “Dynamic capacity networks,” in Proc. 33rd Int. Conf. Machine Learning, - - - - - - - Inform. Process. Syst. vol. 27, pp. 1269–1277, 2014. 2016, pp. 2549–2558. - - - - - - - - - - - - [37] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neu- [58] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. - - - - - - - ral networks with low rank expansions,” in Proc. British Machine Vision Conf., (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts - - - - - - - 2014, pp. 1–13. layer. [Online]. Available: https://openreview.net/pdf?id=B1ckMDqlg - - - - - - - - - - - - [38] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky, [59] D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, and J. - - - - - - - “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,” Odobez, “Deep dynamic neural networks for multimodal gesture segmentation and - - - - - - - Computing Res. Repository, vol. abs/1412.6553, 2014. [Online]. Available: https:// recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1583– - - - - - - - arxiv.org/abs/1412.6553 1597, 2016. - - - - - - - - - - - - [39] C. Tai, T. Xiao, X. Wang, and E. Weinan, “Convolutional neural networks [60] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. - - - - - - - with low-rank regularization,” Computing Res. Repository, vol. abs/1511.06067, Vanhoucke, and A. Rabinovich. (2015). Going deeper with convolutions. Proc. IEEE - - - - - - - 2015. Computer Vision Pattern Recognition. [Online]. Available: http://arxiv.org/ - - - - - - - - abs/1409.4842 - - - - [40] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. D. Freitas. (2013). - - - - - - - - Predicting parameters in deep learning. Advances in Neural Information [61] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks - - - - - - - Processing Systems, 26, 2148 –2156. [Online]. Available: http://media.nips.cc/nips- with stochastic depth,” Computing Res. Repository, vol. arXiv:1603.09382, - - - - - - - books/nipspapers/paper_files/nips26/1053.pdf 2016. - - - - - - - - - - - - [41] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, [62] Y. Yamada, M. Iwamura, and K. Kise. (2016). Deep pyramidal residual networks - - - - - - - “Low-rank matrix factorization for deep neural network training with high-dimen- with separated stochastic depth, Computing Res. Repository, vol. abs/1612.01230. - - - - - - - sional output targets,” in Proc. IEEE Int. Conf. Acoustics Speech Signal [Online]. Available: http://arxiv.org/abs/1612.01230 - - - - - - - Processing, 2013, pp. 6655–6659. - - - - [63] M. Mathieu, M. Henaff, and Y. Lecun, “Fast training of convolutional networks - - - - - - - [42] T. S. Cohen and M. Welling, “Group equivariant convolutional networks,” through FFTs,” Computing Res. Repository, vol. arXiv:1312.5851, 2014. - - - - - - - arXiv Preprint, arXiv:1602.07576, 2016. - - - - [64] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” - - - - - - - [43] S. Zhai, Y. Cheng, and Z. M. Zhang, “Doubly convolutional neural networks,” in in Proc. IEEE Conf. Computer Vision Pattern Recognition , 2016, pp. 4013 – - - - - - - - Proc. Advances Neural Information Processing Systems, 2016, pp. 1082–1090. 4021. - - - - - - - - - - - - [44] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving con- [65] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied - - - - - - - volutional neural networks via concatenated rectified linear units,” arXiv Preprint, to document recognition,” Proc. IEEE, pp. 2278–2324, 1998. - - - - - - - arXiv:1603.05201, 2016. - - - - [66] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for - - - - - - - - [45] H. Li, W. Ouyang, and X. Wang, “Multi-bias non-linear activation in deep neural simplicity: The all convolutional net,” Computing Res. Repository, vol. abs/1412.6806, - - - - - - - - networks,” arXiv Preprint, arXiv:1604.00676, 2016. 2014. [Online]. Available: https://arxiv.org/abs/1412.6806 - - - - - - - - - - - - [46] S. Dieleman, J. D Fauw, and K. Kavukcuoglu, “Exploiting cyclic symmetry in [67] M. Lin, Q. Chen, and S. Yan , “Network in network,” in Proc. Int. Conf. - - - - - - - - convolutional neural networks,” in Proc. 33rd Int. Conf. Machine Learning, 2016, vol. Learning Representations, 2014. [Online]. Available: https://arxiv.org/abs/ - - - - - - - - 48, pp. 1889–1898. 1312.4400 - - - - - - - - - - - - [47] C. Szegedy, S. Ioffe, and V. Vanhoucke. (2016). Inception-v4, inception-resnet and [68] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large- - - - - - - - - the impact of residual connections on learning, Computing Res. Repository, vol. scale image recognition,” Computing Res. Repository, vol. abs/1409.1556, 2014. - - - - - - - - abs/1602.07261. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1602. [Online]. Available: https://arxiv.org/abs/1409.1556 - - - - - - - - html#SzegedyIV16 - - - - [69] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni- - - - - - - - - [48] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, small, low tion,” arXiv Preprint, arXiv:1512.03385, 2015. - - - - - - - - power fully convolutional neural networks for real-time object detection for autono- - - - - [70] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. N. Choudhary, and S. Chang, “An - - - mous driving,” Computing Res. Repository, vol. abs/1612.01051, 2016. [Online]. - - - - exploration of parameter redundancy in deep networks with circulant projections,” in - - - Available: https://arxiv.org/abs/1612.01051 - - - - Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 2857–2865. - - - - - - - - [49] C. Buciluaˇ, R. Caruana, and A. Niculescu-Mizil. (2006). Model compression. - - - - [71] M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas, “ACDC: A structured - - - Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 535– - - - - efficient linear layer,” in Proc. Int. Conf. Learning Representations, 2016. - - - 541. [Online]. Available: http://doi.acm.org/10.1145/1150402.1150464 - - - - - - - - [72] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau, T. - - - [50] J. Ba and R. Caruana, “Do deep nets really need to be deep?” Adv. Neural Inform. - - - - Schaul, and N. de Freitas, “Learning to learn by gradient descent by gradient - - - Process. Syst., vol. 27, pp. 2654–2662, 2014. - - - - descent,” in Proc. Neural Information Processing Systems Conf., 2016, pp. 3981– - - - - - - - - [51] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural net- 3989. - - - - - - - - work,” Computing Res. Repository, vol. abs/1503.02531, 2015. [Online]. Available: - - - - [73] D. Ha, A. Dai, and Q. Le, “Hypernetworks,” in Proc. Int. Conf. Learning - - - https://arxiv.org/abs/1503.02531 - - - - Representations, 2016. - - - - - - - [52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, - - - - [74] J. M. Alvarez and M. Salzmann, “Learning the number of neurons in deep net- - - - “Fitnets: Hints for thin deep nets,” Computing Res. Repository, vol. abs/1412.6550, - - - - works,” in Proc. Neural Information Processing Systems Conf., 2016, pp. 2270– - - - 2014. [Online]. Available: https://arxiv.org/abs/1412.6550 - - - - 2278. - - - - - - - [53] A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling. (2015). Bayesian - - - - [75] Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature map for - - - dark knowledge. Advances in Neural Information Processing Systems, 28, 3420–3428. - - - - portable deep model,” in Proc. 34th Int. Conf. Machine Learning, 2017, pp. 3703– - - - [Online]. Available: http://papers.nips.cc/paper/5965-bayesian-dark-knowledge.pdf - - - - 3711. - - - - - - - [54] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression by dis- - - - - [76] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep - - - tilling knowledge from neurons,” in Proc. 30th AAAI Conf. Artificial Intelligence, - - - - convolutional neural networks for fast and low power mobile applications,” Computing - - - 2016, pp. 3560–3566. - - - - Res. Repository, vol. abs/1511.06530, 2015. [Online]. Available: https://arxiv.org/ - - - - - - - [55] T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning via abs/1511.06530 - - - - - - - knowledge transfer,” Computing Res. Repository, vol. abs/1511.05641, 2015. [Online]. - - - - [77] Facebook, Inc. Caffe2: A new lightweight, modular, and scalable deep learning - - - Available: https://arxiv.org/abs/1511.05641 - - - - framework. (2016). [Online]. Available: https://caffe2.ai/ - - - - - - - [56] S. Zagoruyko and N. Komodakis. (2016). Paying more attention to attention: - - - - - - - - Improving the performance of convolutional neural networks via attention transfer, - - - - - - - - Computing Res. Repository, vol. abs/1612.03928. [Online]. Available: http://arxiv.org/ - - - - - - - - abs/1612.03928 SP - - - - - - - - - - 136 IEEE SIgnal ProcESSIng MagazInE | January 2018 | \ No newline at end of file diff --git a/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt b/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt deleted file mode 100644 index 47f9152..0000000 --- a/Corpus/Movement Pruning Adaptive Sparsity by Fine-Tuning.txt +++ /dev/null @@ -1,662 +0,0 @@ - Movement Pruning: - Adaptive Sparsity by Fine-Tuning - - - - - Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2 - 1 Hugging Face, 2 Cornell University - {victor,thomas}@huggingface.co;arush@cornell.edu - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - arXiv:2005.07683v1 [cs.CL] 15 May 2020 Abstract - - Magnitude pruning is a widely used strategy for reducing model size in pure - supervised learning; however, it is less effective in the transfer learning regime that - has become standard for state-of-the-art natural language processing applications. - We propose the use ofmovement pruning, a simple, deterministic first-order weight - pruning method that is more adaptive to pretrained model fine-tuning. We give - mathematical foundations to the method and compare it to existing zeroth- and - first-order pruning methods. Experiments show that when pruning large pretrained - language models, movement pruning shows significant improvements in high- - sparsity regimes. When combined with distillation, the approach achieves minimal - accuracy loss with down to only 3% of the model parameters. - - - 1 Introduction - - Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art - performance in applications in natural language processing and related fields. In this setup, a large - model pretrained on a massive generic dataset is then fine-tuned on a smaller annotated dataset to - perform a specific end-task. Model accuracy has been shown to scale with the pretrained model and - dataset size [Raffel et al., 2019]. However, significant resources are required to ship and deploy these - large models, and training the models have high environmental costs [Strubell et al., 2019]. - Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at - only a small cost of accuracy. Pruning methods, which remove weights based on their importance, - are a particularly simple and effective method for compressing models to be sent to edge devices such - as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high - absolute values, is the most widely used method for weight pruning. It has been applied to a large - variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al., - 2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis - [Frankle et al., 2019]. - While magnitude pruning is highly effective for standard supervised learning, it is inherently less - useful in the transfer learning regime. In supervised learning, weight values are primarily determined - by the end-task training data. In transfer learning, weight values are mostly predetermined by the - original model and are only fine-tuned on the end task. This prevents these methods from learning to - prune based on the fine-tuning step, or “fine-pruning.” - In this work, we argue that to effectively reduce the size of models for transfer learning, one should - instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during - fine-tuning. Movement pruning differs from magnitude pruning in that both weights with low and - high values can be pruned if they shrink during training. This strategy moves the selection criteria - from the 0th to the 1st-order and facilitates greater pruning based on the fine-tuning objective. To - - - Preprint. Under review. test this approach, we introduce a particularly simple, deterministic version of movement pruning - utilizing the straight-through estimator [Bengio et al., 2013]. - We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019, - Vaswani et al., 2017] on a diverse set of fine-tuning tasks. In highly sparse regimes (less than 15% of - remaining weights), we observe significant improvements over magnitude pruning and other 1st-order - methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original - BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI) - [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of - the differences between magnitude pruning and movement pruning shows that the two methods lead - to radically different pruned models with movement pruning showing greater ability to adapt to the - end-task. - - 2 Related Work - - In addition to magnitude pruning, there are many other approaches for generic model weight pruning. - Most similar to our approach are methods for using parallel score matrices to augment the weight - matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo- - lutional networks. Differing from our methods, these methods keep the weights of the model fixed - (either from a randomly initialized network or a pre-trained network) and the scores are updated to - find a good sparse subnetwork. - Many previous works have also explored using higher-order information to select prunable weights. - LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for - deletion. Our method does not require the (possibly costly) computation of second-order derivatives - since the importance scores are obtained simply as the by-product of the standard fine-tuning. Theis - et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In - contrast, we found it useful to preserve the direction of movement in our algorithm. - Compressing pretrained language models for transfer learning is also a popular area of study. Other - approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning - [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model - and targets individual weight. We also show that having a teacher can further improve our approach. - Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train - sparse language models from scratch. This differs from our approach which focuses on the fine-tuning - stage. Finally, another popular compression approach is quantization. Quantization has been applied - to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014] - providing high memory compression rates at the cost of no or little performance. As shown in - previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and - can be combined to further improve the performance/size ratio. - - 3 Background: Score-Based Pruning - - We first establish shared notation for discussing different neural network pruning strategies. Let - W2Rnn refer to a generic weight matrix in the model (we consider square matrices, but they - could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of - associated importance scoresS2Rnn . Given importance scores, each pruning strategy computes a - maskM2 f0;1gnn . Inference for an inputxbecomesa= (WM)x, whereis the Hadamard - product. A common strategy is to keep the top-vpercent of weights by importance. We define Top v as a function which selects thev%highest values inS:1; STop(S) (1) v i;j = i;j in topv% - 0; o.w. - - Magnitude-based weight pruning determines the mask based on the absolute value of each weight as a measure of importance. Formally, we have importance scoresS= jWi;j j , and masks 1i;jn M=Top(S)(Eq(1)). There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude pruning: the model is first trained until convergence and weights with the lowest - magnitudes are removed afterward. The sparsified model is then re-trained with the removed weights - fixed to 0. This loop is repeated until the desired sparsity level is reached. - - 2 Magnitude pruning L0 regularization Movement pruning Soft movement pruning - Pruning Decision 0th order 1st order 1st order 1st order - Masking Function Top v Continuous Hard-Concrete Top v Thresholding - Pruning Structure Local or Global Global Local or Global Global - Learning Objective L L+l0 E(L0 ) L L+mvp R(S) - Gradient Form Gumbel-Softmax Straight-Through Straight-ThroughP P PScoresS jW ) )i;j j (@L )(t W(t) f(S(t) ) (@L )(t) W(t) (@L )(t) W(t - t@W i;j i;j i;j i;j t@W i;j i;j t@W i;j - Table 1: Summary of the pruning methods considered in this work and their specificities. The - expression offofL0 regularization is detailed in Eq (3). - - - In this study, we focus onautomated gradual pruning[Zhu and Gupta, 2018]. It supplements - magnitude pruning by allowing masked weights to be updated such that they are not fixed for the - entire duration of the training. Automated gradual pruning enables the model to recover from previous - masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity levelv during training using a cubic sparsity scheduler:v(t) =vf + (v t 3 - i vf ) 1ti . The sparsity nt level at time stept,v(t) is increased from an initial valuevi (usually 0) to a final valuevf innpruning - steps afterti steps of warm-up. The model is thus pruned and trained jointly. - - 4 Movement Pruning - - Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running - model. In this work, we focus on movement pruning methods where importance is derived from - first-order information. Intuitively, instead of selecting weights that are far from zero, we retain - connections that are moving away from zero during the training process. We consider two versions of - movement pruning: hard and soft. - For (hard) movement pruning, masks are computed using the Top v function:M=Top(S). Unlike v magnitude pruning, during training, we learn both the weightsP Wand their importance scoresS. - During the forward pass, we compute for alli,a ni = Wk=1 i;k Mi;k xk . - Since the gradient of Top v is 0 everywhere it is defined, we follow Ramanujan et al. [2020], Mallya - and Lazebnik [2018] and approximate its value with thestraight-through estimator[Bengio et al., - 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The - approximation of gradient of the lossLwith respect toSi;j is given by - @L @L @a= i @L= W x@S j (2) - i;j @a i @S i;j @a i;ji - This implies that the scores of weights are updated, even if these weights are masked in the forward - pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge. - We also consider a relaxed (soft) version of movement pruning based on the binary mask function - described by Mallya and Lazebnik [2018]. Here we replace hyperparametervwith a fixed global - threshold valuethat controls the binary mask. The mask is calculated asP M= (S> ). In order to - control the sparsity level, we add a regularization termR(S) =mvp (Si;j i;j )which encourages - the importance scores to decrease over time 1 . The coefficientmvp controls the penalty intensity and - thus the sparsity level. - Finally we note that these approaches yield a similar updateL0 regularization based pruning, another - movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the - hard-concretedistribution, where the maskMis sampled for alli;jwith hyperparametersb >0, - l <0, andr >1: u U(0;1) Si;j =(log(u)log(1u) +Si;j )=b - Zi;j = (rl)Si;j +l M i;j = min(1;ReLU(Zi;j )) - - The expectedP L0 norm has a closed form involving the parameters of the hard-concrete: E(L0 ) = - logSi;j i;j blog(l=r). Thus, the weights and scores of the model can be optimized in - P1 We also experimented with jSi;j i;j jbut it turned out to be harder to tune while giving similar results. - - 3 (a) Magnitude pruning (b) Movement pruning - Figure 1: During fine-tuning (on MNLI), the weights stay close to their pre-trained values which - limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are - plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning - selects weights that are moving away from 0. - - - an end-to-end fashion to minimize the sum of the training lossLand the expectedL0 penalty. A - coefficientl0 controls theL0 penalty and indirectly the sparsity level. Gradients take a similar form: - @L @L rl= W@S i;j xj f(Si;j )wheref(Si;j ) = S Zi;j 1g (3) - i;j @a i b i;j (1Si;j )1f0 - - At test time, a non-stochastic estimation of the mask is used:M^= min 1;ReLU (rl)(S)+l - and weights multiplied by 0 can simply be discarded. - Table 1 highlights the characteristics of each pruning method. The main differences are in the masking - functions, pruning structure, and the final gradient form. - - Method Interpretation In movement pruning, the gradient ofLwith respect toWi;j is given - by the standard gradient derivation: @L = @L M@W i;j @a i;j xj . By combining it to Eq(2), we have i @L = @L W@S i;j @W i;j (we omit the binary mask termMi;j for simplicity). From the gradient update in i;j - Eq (2),S @Li;j is increasing when <0, which happens in two cases: @S i;j - (a) @L <0andW@W i;j >0i;j - (b) @L >0andW@W i;j <0i;j - It means that during trainingWi;j is increasing while being positive or is decreasing while being - negative. It is equivalent to saying thatSi;j is increasing whenWi;j is moving away from 0. Inversely, - Si;j is decreasing when @L >0which means thatW@S i;j is shrinking towards 0. i;j - While magnitude pruning selects the most important weights as the ones which maximize their - distance to 0 (jWi;j j), movement pruning selects the weights which are moving the most away from - 0 (Si;j ). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement - pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from - equation (2), afterTgradient updates, we have - XS(T) @L= )(t) W(t) (4) i;j S (@W i;j i;jt